Low number of tweets returned

python
search

#1

I use the following code to try to scrape data about the Humbolt Hockey team, but I only get 7013 tweets returned. I ran a similar query using the free API in R which only covers the last seven days and I see 4699 tweets. Any ideas as to why this is happening?

Python for premium API query:

rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
                        from_date="2018-04-05",  # UTC 2018-04-05 00:00
                        to_date="2018-04-27",  # UTC 2018-04-21 00:00
                        results_per_call=500)

tweets = collect_results(rule,
                         max_results=200000,
                         result_stream_args=premium_search_args)

These are the counts but they don’t match what I see in the free API

rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
                        from_date="2018-04-05",  # UTC 2018-04-05 00:00
                        to_date="2018-04-27",  # UTC 2018-04-21 00:00
                        count_bucket="day",
                        results_per_call=500)
counts = collect_results(rule, max_results=500, result_stream_args=premium_search_args)
[print(c) for c in counts];
{'timePeriod': '201804260000', 'count': 27}
{'timePeriod': '201804250000', 'count': 87}
{'timePeriod': '201804240000', 'count': 74}
{'timePeriod': '201804230000', 'count': 40}
{'timePeriod': '201804220000', 'count': 44}
{'timePeriod': '201804210000', 'count': 61}
{'timePeriod': '201804200000', 'count': 43}
{'timePeriod': '201804190000', 'count': 71}
{'timePeriod': '201804180000', 'count': 54}
{'timePeriod': '201804170000', 'count': 71}
{'timePeriod': '201804160000', 'count': 118}
{'timePeriod': '201804150000', 'count': 131}
{'timePeriod': '201804140000', 'count': 154}
{'timePeriod': '201804130000', 'count': 349}
{'timePeriod': '201804120000', 'count': 1048}
{'timePeriod': '201804110000', 'count': 433}
{'timePeriod': '201804100000', 'count': 624}
{'timePeriod': '201804090000', 'count': 871}
{'timePeriod': '201804080000', 'count': 921}
{'timePeriod': '201804070000', 'count': 2017}
{'timePeriod': '201804060000', 'count': 10}
{'timePeriod': '201804050000', 'count': 12}

R for free API query:

tweets <- search_tweets(q = "humboldt", #search term
                      n = 10000,   #maximum number of tweets to return
                      include_rts = FALSE, #whether or not to include retweets
                      retryonratelimit = TRUE, #deals with limits on searching -- keep this
                      geocode = lookup_coords("usa"), #accesses geocode for locations -- important for mapping
                      lang = "en" #language,
)

This results in 615 tweets for 4/22 UTC


#2

Hi @willcipolli - have you dug into the quality of the data? The search methods are different between the premium Search API and standard statuses/search


#3

@happycamper What do you mean the quality of the data?

The two methods I posted are different:

  • The top uses the premium API through python’s ``searchtweets" library. I can confirm these are querying my premium account by the count shown in my account dashboard.
  • The bottom uses the free API through R’s ``rtweet" library.

My questions isn’t about what data or the quality of data being returned, but the number of tweets captured.

Running the following in Python with my premium credentials I get 18 tweets.

rule = gen_rule_payload("(humbolt) -is:retweet lang:en",
                         from_date="2018-05-09",  # UTC 2018-04-05 00:00
                         to_date="2018-05-10",  # UTC 2018-04-21 00:00
                        results_per_call=500)

tweets = collect_results(rule,
                         max_results=200000,
                         result_stream_args=premium_search_args)
len(tweets)
18

Using my free-api credentials in R I get over 350 tweets.

> tweets <- search_tweets(q = "humboldt", #search term
+                         n = 10000,   #maximum number of tweets to return
+                         include_rts = FALSE, #whether or not to include retweets
+                         retryonratelimit = TRUE, #deals with limits on searching -- keep this
+                         geocode = lookup_coords("usa"), #accesses geocode for locations --                            important for mapping
+                         lang = "en", #language
+                         since="2018-05-09",until="2018-05-10"
+ )
Searching for tweets...
Finished collecting tweets!
> length(unique(tweets$text)) ##how many tweets did we collect?
[1] 355