I use the following code to try to scrape data about the Humbolt Hockey team, but I only get 7013 tweets returned. I ran a similar query using the free API in R which only covers the last seven days and I see 4699 tweets. Any ideas as to why this is happening?
Python for premium API query:
rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
from_date="2018-04-05", # UTC 2018-04-05 00:00
to_date="2018-04-27", # UTC 2018-04-21 00:00
results_per_call=500)
tweets = collect_results(rule,
max_results=200000,
result_stream_args=premium_search_args)
These are the counts but they don’t match what I see in the free API
rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
from_date="2018-04-05", # UTC 2018-04-05 00:00
to_date="2018-04-27", # UTC 2018-04-21 00:00
count_bucket="day",
results_per_call=500)
counts = collect_results(rule, max_results=500, result_stream_args=premium_search_args)
[print(c) for c in counts];
{'timePeriod': '201804260000', 'count': 27}
{'timePeriod': '201804250000', 'count': 87}
{'timePeriod': '201804240000', 'count': 74}
{'timePeriod': '201804230000', 'count': 40}
{'timePeriod': '201804220000', 'count': 44}
{'timePeriod': '201804210000', 'count': 61}
{'timePeriod': '201804200000', 'count': 43}
{'timePeriod': '201804190000', 'count': 71}
{'timePeriod': '201804180000', 'count': 54}
{'timePeriod': '201804170000', 'count': 71}
{'timePeriod': '201804160000', 'count': 118}
{'timePeriod': '201804150000', 'count': 131}
{'timePeriod': '201804140000', 'count': 154}
{'timePeriod': '201804130000', 'count': 349}
{'timePeriod': '201804120000', 'count': 1048}
{'timePeriod': '201804110000', 'count': 433}
{'timePeriod': '201804100000', 'count': 624}
{'timePeriod': '201804090000', 'count': 871}
{'timePeriod': '201804080000', 'count': 921}
{'timePeriod': '201804070000', 'count': 2017}
{'timePeriod': '201804060000', 'count': 10}
{'timePeriod': '201804050000', 'count': 12}
R for free API query:
tweets <- search_tweets(q = "humboldt", #search term
n = 10000, #maximum number of tweets to return
include_rts = FALSE, #whether or not to include retweets
retryonratelimit = TRUE, #deals with limits on searching -- keep this
geocode = lookup_coords("usa"), #accesses geocode for locations -- important for mapping
lang = "en" #language,
)
This results in 615 tweets for 4/22 UTC