How to extract data with full archive sandbox in R?


#1

Please help. I am doing my final college project about research regarding 2019 Presidential Election in Indonesia. I use data from Twitter to predict the results of the election. I want to extract tweets every 2 weeks from September 2018 to January 2019, but I cannot collect tweets past 7 days from now. So I apply for developer account to use the full archive search API to collect tweets from September 2018. My developer account already approved, I already create an app, and I already set the dev environment to my app. But I still cannot extract tweets past 7 days from now using my old code. So I search the solution and I found this code :

library(jsonlite)

#Create your own appication key at https://dev.twitter.com/apps
consumer_key = "insert_consumer_key";
consumer_secret = "insert_consumer_secret";

#Use basic auth
secret <- jsonlite::base64_enc(paste(consumer_key, consumer_secret, sep = ":"))
req <- httr::POST("https://api.twitter.com/oauth2/token",
  httr::add_headers(
    "Authorization" = paste("Basic", gsub("\n", "", secret)),
    "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"
  ),
  body = "grant_type=client_credentials"
);

#Extract the access token
httr::stop_for_status(req, "authenticate with twitter")
token <- paste("Bearer", httr::content(req)$access_token)

#Actual API call
url <- "https://api.twitter.com/1.1/statuses/user_timeline.json?count=10&screen_name=Rbloggers"
req <- httr::GET(url, httr::add_headers(Authorization = token))
json <- httr::content(req, as = "text")
tweets <- fromJSON(json)
substring(tweets$text, 1, 100)

So I change the customer key and secret for my own and the url, and it will be like this :

library(jsonlite)

#Create your own appication key at https://dev.twitter.com/apps
consumer_key = "my_consumer_key";
consumer_secret = "my_consumer_secret";

#Use basic auth
secret <- jsonlite::base64_enc(paste(consumer_key, consumer_secret, sep = ":"))
req <- httr::POST("https://api.twitter.com/oauth2/token",
                  httr::add_headers(
                    "Authorization" = paste("Basic", gsub("\n", "", secret)),
                    "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"
                  ),
                  body = "grant_type=client_credentials"
);

#Extract the access token
httr::stop_for_status(req, "authenticate with twitter")
token <- paste("Bearer", httr::content(req)$access_token)

#Actual API call
url <- "https://api.twitter.com/1.1/tweets/search/fullarchive/my_env_label.json"
req <- httr::GET(url, httr::add_headers(Authorization = token))
json <- httr::content(req, as = "text")
tweets <- fromJSON(json)
substring(tweets$text, 1, 100)

But when I run the code, nothing happen and only show this :

character(0)

So I try to change the url with the query like this :

url <- "https://api.twitter.com/1.1/tweets/search/fullarchive/my_env_label.json?query=%23JokowiLagi&fromDate=201809270000&toDate=201810010000"

But still nothing happen and the result is same as before. The only difference is this time my subscriptions details are reduced.

So please help me, tell me what I should do to solve this problem.
Thank you.

(Note : I am using R programming language to extract the data from Twitter and to run the above codes)


#2

There’s no R implementation of the full archive search unfortunately, and writing your own R code for it might get a bit involved, also you don’t need to extract the access token every time - you can reuse the same one over and over.

If you just need tweets loaded into R for analysis give the command line version of this a try: https://github.com/twitterdev/search-tweets-python

eg:

search_tweets.py \
--max-results 100 \
--results-per-call 100 \
--filter-rule "#JokowiLagi" \
--start-datetime 2018-09-27 \
--end-datetime 2018-10-01 \
--filename-prefix test_search \
--print-stream

If you have a ~/.twitter_keys.yaml file like the readme describes (endpoint will be https://api.twitter.com/1.1/tweets/search/fullarchive/my_env_label.json), a file called test_search.json will be created with 100 tweets for #JokowiLagi for that date range in that example call.


#3

Thanks for your reply. So do you mean that I can’t extract data from Twitter for more than 7 days using R even though I am using full archive search?


#4

Oh no, you can still do it in R but you’ll have to implement it yourself - I’m only suggesting the Python way because it seems like less effort to get the data.

7 Days is the rough limit for the normal REST search API, using the full archive search properly will give you everything - but you’ll need to authenticate, paginate, and manage your own rate limits etc: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search


#5

Ok, I try to use Python now and I try to follow the github:
https://github.com/twitterdev/search-tweets-python

I already installed searchtweets and I just create a file twitter_keys.yaml for the credential just like in the github. The .yaml file contains the same as the github (except for the endpoint, customer key and secret).

After that I try this code in the github:

from searchtweets import load_credentials

load_credentials(filename="./search_tweets_creds_example.yaml",
                 yaml_key="search_tweets_premium_example",
                 env_overwrite=False)

But I stuck in this point. I don’t know about filename or yaml_key in the load_credentials’ parameters. What should I input in the filename and yaml_key? And can you give me the example about the input?

Thanks for your help.


#6

Have a look at Problem with .twitter_keys.yaml to see if it helps.

filename is the path to actual yaml file and yaml_key is there to specify what configuration to use, because one yaml file can have multiple configurations for multiple endpoints and apps.

Also, using --credential-file twitter_keys.yaml to that search_tweets.py command will work too - without having to modify the underlying code.


#7

Ok, finally I found this file search_tweets.py and I try to run it. But I confuse about what should I change in that code:

#!C:\Python\python.exe
# Copyright 2017 Twitter, Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
import os
import argparse
import json
import sys
import logging
from searchtweets import (ResultStream,
                          load_credentials,
                          merge_dicts,
                          read_config,
                          write_result_stream,
                          gen_params_from_config)

logger = logging.getLogger()
# we want to leave this here and have it command-line configurable via the
# --debug flag
logging.basicConfig(level=os.environ.get("LOGLEVEL", "ERROR"))


REQUIRED_KEYS = {"pt_rule", "endpoint"}


def parse_cmd_args():
    argparser = argparse.ArgumentParser()
    help_msg = """configuration file with all parameters. Far,
          easier to use than the command-line args version.,
          If a valid file is found, all args will be populated,
          from there. Remaining command-line args,
          will overrule args found in the config,
          file."""

    argparser.add_argument("--credential-file",
                           dest="credential_file",
                           default=None,
                           help=("Location of the yaml file used to hold "
                                 "your credentials."))

    argparser.add_argument("--credential-file-key",
                           dest="credential_yaml_key",
                           default=None,
                           help=("the key in the credential file used "
                                 "for this session's credentials. "
                                 "Defaults to search_tweets_api"))

    argparser.add_argument("--env-overwrite",
                           dest="env_overwrite",
                           default=True,
                           help=("""Overwrite YAML-parsed credentials with
                                 any set environment variables. See API docs or
                                 readme for details."""))

    argparser.add_argument("--config-file",
                           dest="config_filename",
                           default=None,
                           help=help_msg)

    argparser.add_argument("--account-type",
                           dest="account_type",
                           default=None,
                           choices=["premium", "enterprise"],
                           help="The account type you are using")

    argparser.add_argument("--count-bucket",
                           dest="count_bucket",
                           default=None,
                           help=("""Bucket size for counts API. Options:,
                                 day, hour, minute (default is 'day')."""))

    argparser.add_argument("--start-datetime",
                           dest="from_date",
                           default=None,
                           help="""Start of datetime window, format
                                'YYYY-mm-DDTHH:MM' (default: -30 days)""")

    argparser.add_argument("--end-datetime",
                           dest="to_date",
                           default=None,
                           help="""End of datetime window, format
                                 'YYYY-mm-DDTHH:MM' (default: most recent
                                 date)""")

    argparser.add_argument("--filter-rule",
                           dest="pt_rule",
                           default=None,
                           help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)")

    argparser.add_argument("--results-per-call",
                           dest="results_per_call",
                           help="Number of results to return per call "
                                "(default 100; max 500) - corresponds to "
                                "'maxResults' in the API")

    argparser.add_argument("--max-results", dest="max_results",
                           type=int,
                           help="Maximum number of Tweets or Counts to return for this session")

    argparser.add_argument("--max-pages",
                           dest="max_pages",
                           type=int,
                           default=None,
                           help="Maximum number of pages/API calls to "
                           "use for this session.")

    argparser.add_argument("--results-per-file", dest="results_per_file",
                           default=None,
                           type=int,
                           help="Maximum tweets to save per file.")

    argparser.add_argument("--filename-prefix",
                           dest="filename_prefix",
                           default=None,
                           help="prefix for the filename where tweet "
                           " json data will be stored.")

    argparser.add_argument("--no-print-stream",
                           dest="print_stream",
                           action="store_false",
                           help="disable print streaming")

    argparser.add_argument("--print-stream",
                           dest="print_stream",
                           action="store_true",
                           default=True,
                           help="Print tweet stream to stdout")

    argparser.add_argument("--extra-headers",
                           dest="extra_headers",
                           type=str,
                           default=None,
                           help="JSON-formatted str representing a dict of additional request headers")

    argparser.add_argument("--debug",
                           dest="debug",
                           action="store_true",
                           default=False,
                           help="print all info and warning messages")
    return argparser


def _filter_sensitive_args(dict_):
    sens_args = ("password", "consumer_key", "consumer_secret", "bearer_token")
    return {k: v for k, v in dict_.items() if k not in sens_args}

def main():
    args_dict = vars(parse_cmd_args().parse_args())
    if args_dict.get("debug") is True:
        logger.setLevel(logging.DEBUG)
        logger.debug("command line args dict:")
        logger.debug(json.dumps(args_dict, indent=4))

    if args_dict.get("config_filename") is not None:
        configfile_dict = read_config(args_dict["config_filename"])
    else:
        configfile_dict = {}
    
    extra_headers_str = args_dict.get("extra_headers")
    if extra_headers_str is not None:
        args_dict['extra_headers_dict'] = json.loads(extra_headers_str)
        del args_dict['extra_headers']

    logger.debug("config file ({}) arguments sans sensitive args:".format(args_dict["config_filename"]))
    logger.debug(json.dumps(_filter_sensitive_args(configfile_dict), indent=4))

    creds_dict = load_credentials(filename=args_dict["credential_file"],
                                  account_type=args_dict["account_type"],
                                  yaml_key=args_dict["credential_yaml_key"],
                                  env_overwrite=args_dict["env_overwrite"])

    dict_filter = lambda x: {k: v for k, v in x.items() if v is not None}

    config_dict = merge_dicts(dict_filter(configfile_dict),
                              dict_filter(creds_dict),
                              dict_filter(args_dict))

    logger.debug("combined dict (cli, config, creds) sans password:")
    logger.debug(json.dumps(_filter_sensitive_args(config_dict), indent=4))

    if len(dict_filter(config_dict).keys() & REQUIRED_KEYS) < len(REQUIRED_KEYS):
        print(REQUIRED_KEYS - dict_filter(config_dict).keys())
        logger.error("ERROR: not enough arguments for the program to work")
        sys.exit(1)

    stream_params = gen_params_from_config(config_dict)
    logger.debug("full arguments passed to the ResultStream object sans password")
    logger.debug(json.dumps(_filter_sensitive_args(stream_params), indent=4))

    rs = ResultStream(tweetify=False, **stream_params)

    logger.debug(str(rs))

    if config_dict.get("filename_prefix") is not None:
        stream = write_result_stream(rs,
                                     filename_prefix=config_dict.get("filename_prefix"),
                                     results_per_file=config_dict.get("results_per_file"))
    else:
        stream = rs.stream()

    for tweet in stream:
        if config_dict["print_stream"] is True:
            print(json.dumps(tweet))


if __name__ == '__main__':
    main()

I just try to change some parts in that code like you suggest before and the code are like this:

#!C:\Python\python.exe
# Copyright 2017 Twitter, Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
import os
import argparse
import json
import sys
import logging
from searchtweets import (ResultStream,
                          load_credentials,
                          merge_dicts,
                          read_config,
                          write_result_stream,
                          gen_params_from_config)

logger = logging.getLogger()
# we want to leave this here and have it command-line configurable via the
# --debug flag
logging.basicConfig(level=os.environ.get("LOGLEVEL", "ERROR"))


REQUIRED_KEYS = {"pt_rule", "endpoint"}


def parse_cmd_args():
    argparser = argparse.ArgumentParser()
    help_msg = """configuration file with all parameters. Far,
          easier to use than the command-line args version.,
          If a valid file is found, all args will be populated,
          from there. Remaining command-line args,
          will overrule args found in the config,
          file."""

    argparser.add_argument(".twitter_keys.yaml",
                           dest="credential_file",
                           default=None,
                           help=("Location of the yaml file used to hold "
                                 "your credentials."))

    argparser.add_argument("search_tweets_premium",
                           dest="credential_yaml_key",
                           default=None,
                           help=("the key in the credential file used "
                                 "for this session's credentials. "
                                 "Defaults to search_tweets_api"))

    argparser.add_argument("--env-overwrite",
                           dest="env_overwrite",
                           default=True,
                           help=("""Overwrite YAML-parsed credentials with
                                 any set environment variables. See API docs or
                                 readme for details."""))

    argparser.add_argument("--config-file",
                           dest="config_filename",
                           default=None,
                           help=help_msg)

    argparser.add_argument("premium",
                           dest="account_type",
                           default=None,
                           choices=["premium", "enterprise"],
                           help="The account type you are using")

    argparser.add_argument("--count-bucket",
                           dest="count_bucket",
                           default=None,
                           help=("""Bucket size for counts API. Options:,
                                 day, hour, minute (default is 'day')."""))

    argparser.add_argument("2018-09-27",
                           dest="from_date",
                           default=None,
                           help="""Start of datetime window, format
                                'YYYY-mm-DDTHH:MM' (default: -30 days)""")

    argparser.add_argument("2018-10-01",
                           dest="to_date",
                           default=None,
                           help="""End of datetime window, format
                                 'YYYY-mm-DDTHH:MM' (default: most recent
                                 date)""")

    argparser.add_argument("#JokowiLagi",
                           dest="pt_rule",
                           default=None,
                           help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)")

    argparser.add_argument("100",
                           dest="results_per_call",
                           help="Number of results to return per call "
                                "(default 100; max 500) - corresponds to "
                                "'maxResults' in the API")

    argparser.add_argument("100", dest="max_results",
                           type=int,
                           help="Maximum number of Tweets or Counts to return for this session")

    argparser.add_argument("--max-pages",
                           dest="max_pages",
                           type=int,
                           default=None,
                           help="Maximum number of pages/API calls to "
                           "use for this session.")

    argparser.add_argument("--results-per-file", dest="results_per_file",
                           default=None,
                           type=int,
                           help="Maximum tweets to save per file.")

    argparser.add_argument("#JokowiLagi",
                           dest="filename_prefix",
                           default=None,
                           help="prefix for the filename where tweet "
                           " json data will be stored.")

    argparser.add_argument("--no-print-stream",
                           dest="print_stream",
                           action="store_false",
                           help="disable print streaming")

    argparser.add_argument("--print-stream",
                           dest="print_stream",
                           action="store_true",
                           default=True,
                           help="Print tweet stream to stdout")

    argparser.add_argument("--extra-headers",
                           dest="extra_headers",
                           type=str,
                           default=None,
                           help="JSON-formatted str representing a dict of additional request headers")

    argparser.add_argument("--debug",
                           dest="debug",
                           action="store_true",
                           default=False,
                           help="print all info and warning messages")
    return argparser


def _filter_sensitive_args(dict_):
    sens_args = ("password", "consumer_key", "consumer_secret", "bearer_token")
    return {k: v for k, v in dict_.items() if k not in sens_args}

def main():
    args_dict = vars(parse_cmd_args().parse_args())
    if args_dict.get("debug") is True:
        logger.setLevel(logging.DEBUG)
        logger.debug("command line args dict:")
        logger.debug(json.dumps(args_dict, indent=4))

    if args_dict.get("config_filename") is not None:
        configfile_dict = read_config(args_dict["config_filename"])
    else:
        configfile_dict = {}
    
    extra_headers_str = args_dict.get("extra_headers")
    if extra_headers_str is not None:
        args_dict['extra_headers_dict'] = json.loads(extra_headers_str)
        del args_dict['extra_headers']

    logger.debug("config file ({}) arguments sans sensitive args:".format(args_dict["config_filename"]))
    logger.debug(json.dumps(_filter_sensitive_args(configfile_dict), indent=4))

    creds_dict = load_credentials(filename=args_dict["credential_file"],
                                  account_type=args_dict["account_type"],
                                  yaml_key=args_dict["credential_yaml_key"],
                                  env_overwrite=args_dict["env_overwrite"])

    dict_filter = lambda x: {k: v for k, v in x.items() if v is not None}

    config_dict = merge_dicts(dict_filter(configfile_dict),
                              dict_filter(creds_dict),
                              dict_filter(args_dict))

    logger.debug("combined dict (cli, config, creds) sans password:")
    logger.debug(json.dumps(_filter_sensitive_args(config_dict), indent=4))

    if len(dict_filter(config_dict).keys() & REQUIRED_KEYS) < len(REQUIRED_KEYS):
        print(REQUIRED_KEYS - dict_filter(config_dict).keys())
        logger.error("ERROR: not enough arguments for the program to work")
        sys.exit(1)

    stream_params = gen_params_from_config(config_dict)
    logger.debug("full arguments passed to the ResultStream object sans password")
    logger.debug(json.dumps(_filter_sensitive_args(stream_params), indent=4))

    rs = ResultStream(tweetify=False, **stream_params)

    logger.debug(str(rs))

    if config_dict.get("filename_prefix") is not None:
        stream = write_result_stream(rs,
                                     filename_prefix=config_dict.get("filename_prefix"),
                                     results_per_file=config_dict.get("results_per_file"))
    else:
        stream = rs.stream()

    for tweet in stream:
        if config_dict["print_stream"] is True:
            print(json.dumps(tweet))


if __name__ == '__main__':
    main()

But after I run it, this messages appear:

Traceback (most recent call last):
  File "c:/Python Code/search_tweets.py", line 207, in <module>
    main()
  File "c:/Python Code/search_tweets.py", line 148, in main
    args_dict = vars(parse_cmd_args().parse_args())
  File "c:/Python Code/search_tweets.py", line 38, in parse_cmd_args
    help=("Location of the yaml file used to hold "
  File "C:\Python\lib\argparse.py", line 1334, in add_argument
    raise ValueError('dest supplied twice for positional argument')
ValueError: dest supplied twice for positional argument

What should I change in the search_tweets.py code?
What is the meaning of that ValueError?
Is there any mistake in the changes I made above?
It will be helpful if you can provide some examples about the changes in search_tweets.py.

Thanks for your help so far.


#8

I want to extract data from 2017 and real-time also, and I faced the same your problem which is the limitation.
so I will get the Premium APIs soon, but I have some questions if you can answer me, please.
can you show me an example of json code for search in Twitter?

Are those the columns of the data I will get?
@mentions, Replies, Retweets, Quote Tweets, Retweets of Quoted Tweets, Likes, Direct Messages Sent, Direct Messages Received, Follows, Blocks, Mutes,Typing indicators and Read receipts.


#9

is there any GitHub code show me how to implement it by myself in R?


#10

Oh, there’s no need to edit the python code, you are changing the code that does the argument parsing for a command line utility, which you can just run without modifying, the command is:

python search_tweets.py \
--credential-file twitter_keys.yaml \
--max-results 100 \
--results-per-call 100 \
--filter-rule "#JokowiLagi" \
--start-datetime 2018-09-27 \
--end-datetime 2018-10-01 \
--filename-prefix test_search \
--print-stream

This assumes there is a twitter_keys.yaml file in the directory, and will print out 100 tweets for #JokowiLagi from the date range specified, and create a file called test_search.json with the tweets.


#11

As for real-time, i’d suggest starting with something like https://gwu-libraries.github.io/sfm-ui/about/overview which is more user friendly - and once you identify what’s available and what exactly you need to do, you can build your own data collecting approach. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json


#13

Thank you…
I used standard twitter API in R by used twitteR package for almost month. but as I mentioned before because of limitations I want to use premium API ,but It is my first time work with Json code and premium API so I am little confused, any advice or steps can help me?


#14

I’d suggest starting with with the docs https://developer.twitter.com/en/products/tweets/search and tools like https://github.com/twitterdev/search-tweets-python to see what’s possible and if it fits what you need to do.

I’ve no doubt R can effectively use the Premium search APIs, but since there’s no library for that yet it’ll be a good chunk of work to implement.

Also, https://github.com/geoffjentry/twitteR twitteR library is no longer maintained in favour of https://github.com/mkearney/rtweet - i’d strongly suggest switching to rtweet in R.


Can'T take full archive tweets
#15

Thank you so much


closed #16

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.