Using more requests than expected with searchtweets module and fullarchive endpoint


#1

Hi,

I’m using more requests than intended (I’d like to use 1) when grabbing tweets for a handle over the last 90 days. Please see relevant code and output below. Passing in max_requests to ResultsStream() also doesn’t help. Finally, in this case only 88 tweets are matched by that search query.

Is this expected behaviour?

Thanks

Output:
Starting: 0 API calls used
collecting results…
{“query”: “from:xxxxxx”, “maxResults”: 500, “toDate”: “201502050000”, “fromDate”: “201411070000”}
You have used 3 API calls

Code:

raw_data_test = {}

print("Starting: {} API calls used".format(ResultStream.session_request_counter))

def make_rule(handle, to_date, from_date):
    _rule_a = "from:"+handle
    rule = gen_rule_payload(_rule_a,
                        from_date=from_date,
                        to_date=to_date,
                        results_per_call=500)
    return rule

days_to_scrape = 90 
for indx_,handle_list, date in to_scrape[32:33]:
    
    to_datetime = pd.to_datetime(date)
    from_dateime = to_datetime - pd.Timedelta(days_to_scrape, unit='D')
    from_datestring = str(from_dateime.date())
    to_datestring = str(to_datetime.date())

    for handle in handle_list:
        #print(handle)
        print('collecting results...')
        search_query = make_rule(handle, to_datestring,from_datestring)

        print(search_query)
        rs = ResultStream(rule_payload=search_query,
                                      max_results=500,
                                      **search_args)

        results_list = list(rs.stream())
        print("You have used {} API calls".format(ResultStream.session_request_counter))
        raw_data_test[search_query] = results_list
        #time.sleep(2)

#2

sorry for duplicating, this seems to be the same as issue as here:


#3

@JHCornford -
Using 3 API pagination is likely due to the 90-day window, data is paged by the maxResults parameter, or 30 day (really 31 day) windows, whichever comes first. Unless your search result contained over 1500 results within those 3 months, you will always only have 3 request pages.

This is in our docs here: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

Request/response behavior
Using the fromDate and toDate parameters, you can request any time period that the API supports. The 30-day endpoint provides Tweets from the most recent 31 days (even though referred to as the ‘30-day’ endpoint, it makes 31 days available to enable users to make complete-month requests). The Full-archive endpoint provides Tweets back to the very first tweet (March 21, 2006).

Each Tweet data request contains a ‘maxResults’ parameter (range 10-500, with a default of 100. Sandbox environments have a maximum of 100) that specifies the maximum number of Tweets to return in the response. When the amount of data exceeds the ‘maxResults’ setting (and it usually will), the response will include a ‘next’ token and pagination will be required to receive all the data associated with your query (see the HERE section for more information).

For example, say your query matches 6,000 Tweets over the past 30 days (if you do not include date parameters in your request, the API will default to the full 30-day period). The API will respond with the first ‘page’ of results with either the first ‘maxResults’ of Tweets or all Tweets from the first 30 days if there are less than that for that time period. That response will contain a ‘next’ token and you’ll make another call with that ‘next’ token added to the request. To retrieve all of the 6,000 Tweets, approximately 12 requests will be necessary.


#4

Hi, where would you write the ‘next’ token in the code? I get an error saying that “Invalid json, could not parse.” when I put it with the query search


#5

I have exactly the same problem and I’m trying to understand whether it will be worth my while to buy a better dev API package - please help?

I use tweetsearch module for python as well and do a search for a keyword for a one week period in 2017. The basic sandbox package allows 50 searches per month and after about 6 tests, the 50 limitation was reached. maxResults is set to 100 at the moment and I didn’t check how many results were returned, but it certainly appeared to be far less than 100.

Right now my usage says 52 of 50 requests and 700 of 1M tweets.

What are my options? I’d like to do research on the sentiment of tweets from 2006 onwards. Based on my experience so far, I’ll be able to get 6 weeks of data for 50 requests, which means I’ll have to purchase the $774 package to get the data I need. Are my assumptions/calculations correct or am I doing something wrong? How do other researchers afford this?


#6

That sounds about right - requests are paginated by 31 days or 100 tweets for sandbox level.

Also if you upgrade, your requests won’t reset, your limits will update but your used requests will not, until the next “billing period”.

When buying premium, you can maximize your calls with better queries, filtering retweets for example.

I haven’t tried it myself yet, but you can try buying the lowest possible tier, $99 to get access to the counts endpoint so you can plan your collection better if that helps. (Either trade off money for more requests, or time, spreading collection over a couple of months of billing cycles)

If you’re doing sentiment research i’d recommend looking for existing datasets you can hydrate with the standard REST API - with statuses/lookup, eg: SemEval datasets are a good start: http://www.aclweb.org/anthology/S17-2088 (the data and scripts for downloading are on the task specific Results pages for each task: http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools )


#7

Thanks for the reply @IgorBrigadir. I think I’m screwed :slightly_smiling_face:, but just for my understanding I would like to know how to limit the damage per query.

If the maxResults is set to 100 and I use the sandbox version of Premium, I should be able to do 50 queries per month, right? I was able to do about 10 and would like to understand why.


#8

I am also using the searchtweets open source python project provided by twitter. The rates and costs are challenging to understand.

I have a general question that might fit in with this thread., @IgorBrigadir might be able to provide some guidance,

The search-tweets-python project uses a twitter supplied python library that appears to use a generator to yield tweets. Does that generator (called ‘stream’ I think ) pay any attention to rate limits at all?

I tested it in sandbox mode a few times and ran right into monthly limits,.

Comments or advice are requested. I want to download Full archive of tweets for a user that has over a million tweets. Can I even do this with a premium account?

Thanks


#9

There’s always a way to get “unscrewed” :stuck_out_tongue: - given the cheapest $99 paid tier, the counts endpoint, and efficient use of search queries I think it’s possible to get more data if you carefully plan searches in advance, prioritizing specific users, types of tweets (quotes, replies, retweets) or more exact timeframes (maybe you can get away with only crawling daytime tweets, skip weekends etc).

No, there are no rate limits, but that’s a known missing feature https://github.com/twitterdev/search-tweets-python/issues/40

A million tweets is a very large undertaking, it also depends on the time span - but with current prices that’s at least two months of crawling (since the largest number of requests you can buy hits the 1 Million monthly tweet cap) - in that case, are you sure it’s not easier to just ask the account owner for their Tweet Archive from Settings - you can arrange to have the Tweet IDs extracted from the csv twitter provides?


#10

@IgorBrigadir - thank you for your prompt reply. “You get what you pay for” certainly applies here. I think the enterprise solution is a better choice for my customer. As you point out, streaming millions of tweets is not an efficient option, Enterprise /GNIP is looking a more feasible, albeit expensive option,