How to acquire tweet using Full archive search with pyhon

python
search

#1

Hello,

I’m sorry that I use translation software, so the grammar may be strange.

I am working on my graduation thesis. I’m going to need to acquire tweets of July 2018. So, I’m going to upgrade to a paid premium API. Before that I need to confirm that I can acquire tweets with the sandbox API.

I have acquired tweets until 7 days using packages in R. However, R cannot get full archive api tweets. So, I thought using Python, and I am working on it.

I search and tried various website, for example Twitter developer’s official website and this forum, Github etc. I do not have much experience with Python, so this is all a bit confusing for me since most of the available documentation is geared for experienced developers.

Please tell me how to acquire tweet using python.

I only want to specify keywords and dates.

So for example, on the advanced search operators it could look something like” iPhone lang:ja since:2018-07-03_JST until:2018-12-08_JST”.

I want to save the acquired data as csv.

Let me know what you think works best for your team and how I can help!


#2

I don’t know if there is a good package for R for the Sandbox / Premium APIs, but the one for Python that should work:

Once you have it installed, and create a ~/.twitter_keys.yaml file with your keys like the readme suggests, you can use the command line version of it so it won’t involve much python development at all: https://github.com/twitterdev/search-tweets-python#using-the-comand-line-application

eg:

search_tweets.py --max-results 100 --results-per-call 100 --filter-rule "iPhone lang:ja since:2018-07-03 until:2018-12-08" ---filename-prefix test_search --no-print-stream

I’m not 100% sure about the _JST part of the until:2018-12-08_JST query - i didn’t know there was a way to specify the timezone in a query like that - i didn’t try to run it to check.


#3

Thank you for your reply.

I had read the article already, but I could not understand. Thank you very much for your kind instruction.


I made yaml file with my keys named “twitter_keys.yaml”.

search_tweets_api:
  account_type: premium
  endpoint: <https://api.twitter.com/1.1/tweets/search/30day/().json>
  consumer_key: <()>
  consumer_secret: <()>

I put my personal keys in ().
The save destination is in Anaconda3 file, is there a problem?


I entered it at the command prompt.

search_tweets.py --max-results 10 --results-per-call 100 --filter-rule "iPhone lang:ja since:2018-12-01 until:2018-12-02" ---filename-prefix test_search --no-print-stream

I thought that tweets would be displayed, but “search_tweets.py” was opened in IDLE.

#!c:\users\social002\anaconda3\python.exe
# Copyright 2017 Twitter, Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
import os
import argparse
import json
import sys
import logging
from searchtweets import (ResultStream,
                          load_credentials,
                          merge_dicts,
                          read_config,
                          write_result_stream,
                          gen_params_from_config)

logger = logging.getLogger()
# we want to leave this here and have it command-line configurable via the
# --debug flag
logging.basicConfig(level=os.environ.get("LOGLEVEL", "ERROR"))


REQUIRED_KEYS = {"pt_rule", "endpoint"}


def parse_cmd_args():
    argparser = argparse.ArgumentParser()
    help_msg = """configuration file with all parameters. Far,
          easier to use than the command-line args version.,
          If a valid file is found, all args will be populated,
          from there. Remaining command-line args,
          will overrule args found in the config,
          file."""

    argparser.add_argument("--credential-file",
                           dest="credential_file",
                           default=None,
                           help=("Location of the yaml file used to hold "
                                 "your credentials."))

    argparser.add_argument("--credential-file-key",
                           dest="credential_yaml_key",
                           default=None,
                           help=("the key in the credential file used "
                                 "for this session's credentials. "
                                 "Defaults to search_tweets_api"))

    argparser.add_argument("--env-overwrite",
                           dest="env_overwrite",
                           default=True,
                           help=("""Overwrite YAML-parsed credentials with
                                 any set environment variables. See API docs or
                                 readme for details."""))

    argparser.add_argument("--config-file",
                           dest="config_filename",
                           default=None,
                           help=help_msg)

    argparser.add_argument("--account-type",
                           dest="account_type",
                           default=None,
                           choices=["premium", "enterprise"],
                           help="The account type you are using")

    argparser.add_argument("--count-bucket",
                           dest="count_bucket",
                           default=None,
                           help=("""Bucket size for counts API. Options:,
                                 day, hour, minute (default is 'day')."""))

    argparser.add_argument("--start-datetime",
                           dest="from_date",
                           default=None,
                           help="""Start of datetime window, format
                                'YYYY-mm-DDTHH:MM' (default: -30 days)""")

    argparser.add_argument("--end-datetime",
                           dest="to_date",
                           default=None,
                           help="""End of datetime window, format
                                 'YYYY-mm-DDTHH:MM' (default: most recent
                                 date)""")

    argparser.add_argument("--filter-rule",
                           dest="pt_rule",
                           default=None,
                           help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)")

    argparser.add_argument("--results-per-call",
                           dest="results_per_call",
                           help="Number of results to return per call "
                                "(default 100; max 500) - corresponds to "
                                "'maxResults' in the API")

    argparser.add_argument("--max-results", dest="max_results",
                           type=int,
                           help="Maximum number of Tweets or Counts to return for this session")

    argparser.add_argument("--max-pages",
                           dest="max_pages",
                           type=int,
                           default=None,
                           help="Maximum number of pages/API calls to "
                           "use for this session.")

    argparser.add_argument("--results-per-file", dest="results_per_file",
                           default=None,
                           type=int,
                           help="Maximum tweets to save per file.")

    argparser.add_argument("--filename-prefix",
                           dest="filename_prefix",
                           default=None,
                           help="prefix for the filename where tweet "
                           " json data will be stored.")

    argparser.add_argument("--no-print-stream",
                           dest="print_stream",
                           action="store_false",
                           help="disable print streaming")

    argparser.add_argument("--print-stream",
                           dest="print_stream",
                           action="store_true",
                           default=True,
                           help="Print tweet stream to stdout")

    argparser.add_argument("--extra-headers",
                           dest="extra_headers",
                           type=str,
                           default=None,
                           help="JSON-formatted str representing a dict of additional request headers")

    argparser.add_argument("--debug",
                           dest="debug",
                           action="store_true",
                           default=False,
                           help="print all info and warning messages")
    return argparser


def _filter_sensitive_args(dict_):
    sens_args = ("password", "consumer_key", "consumer_secret", "bearer_token")
    return {k: v for k, v in dict_.items() if k not in sens_args}

def main():
    args_dict = vars(parse_cmd_args().parse_args())
    if args_dict.get("debug") is True:
        logger.setLevel(logging.DEBUG)
        logger.debug("command line args dict:")
        logger.debug(json.dumps(args_dict, indent=4))

    if args_dict.get("config_filename") is not None:
        configfile_dict = read_config(args_dict["config_filename"])
    else:
        configfile_dict = {}
    
    extra_headers_str = args_dict.get("extra_headers")
    if extra_headers_str is not None:
        args_dict['extra_headers_dict'] = json.loads(extra_headers_str)
        del args_dict['extra_headers']

    logger.debug("config file ({}) arguments sans sensitive args:".format(args_dict["config_filename"]))
    logger.debug(json.dumps(_filter_sensitive_args(configfile_dict), indent=4))

    creds_dict = load_credentials(filename=args_dict["credential_file"],
                                  account_type=args_dict["account_type"],
                                  yaml_key=args_dict["credential_yaml_key"],
                                  env_overwrite=args_dict["env_overwrite"])

    dict_filter = lambda x: {k: v for k, v in x.items() if v is not None}

    config_dict = merge_dicts(dict_filter(configfile_dict),
                              dict_filter(creds_dict),
                              dict_filter(args_dict))

    logger.debug("combined dict (cli, config, creds) sans password:")
    logger.debug(json.dumps(_filter_sensitive_args(config_dict), indent=4))

    if len(dict_filter(config_dict).keys() & REQUIRED_KEYS) < len(REQUIRED_KEYS):
        print(REQUIRED_KEYS - dict_filter(config_dict).keys())
        logger.error("ERROR: not enough arguments for the program to work")
        sys.exit(1)

    stream_params = gen_params_from_config(config_dict)
    logger.debug("full arguments passed to the ResultStream object sans password")
    logger.debug(json.dumps(_filter_sensitive_args(stream_params), indent=4))

    rs = ResultStream(tweetify=False, **stream_params)

    logger.debug(str(rs))

    if config_dict.get("filename_prefix") is not None:
        stream = write_result_stream(rs,
                                     filename_prefix=config_dict.get("filename_prefix"),
                                     results_per_file=config_dict.get("results_per_file"))
    else:
        stream = rs.stream()

    for tweet in stream:
        if config_dict["print_stream"] is True:
            print(json.dumps(tweet))


if __name__ == '__main__':
    main()

I saw GitHub, and tried and error, but I can not go well.
How can I get a tweet?


#4

What exactly was the error? If it was “ERROR:searchtweets.credentials:Account type is not specified and cannot be inferred.” try specifying where the credentials yaml file is too, i placed a twitter_keys.yaml file in the same directory where i’m running search_tweets.py

If the error was about “since” or “until” not being supported operators, i didn’t notice earlier - those dates need to be specified using --start-datetime and --end-datetime parameters not since: until: in the query.

This worked for me and i had a test_search.json created in the same directory with 100 tweets (i also changed --no-print-stream to --print-stream here):

search_tweets.py \
--credential-file twitter_keys.yaml \
--max-results 100 \
--results-per-call 100 \
--filter-rule "iPhone lang:ja" \
--start-datetime 2018-12-01 \
--end-datetime 2018-12-02 \
--filename-prefix test_search \
--print-stream

Once you have that working you can edit the --filter-rule and datetime parameters to suit you. Not all search parameters you might be familiar with are supported in the Sandbox / Premium / Enterprise search: this page details what’s possible: https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/operators-by-product


#5

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.