Low number of tweets returned

python
search

#1

I use the following code to try to scrape data about the Humbolt Hockey team, but I only get 7013 tweets returned. I ran a similar query using the free API in R which only covers the last seven days and I see 4699 tweets. Any ideas as to why this is happening?

Python for premium API query:

rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
                        from_date="2018-04-05",  # UTC 2018-04-05 00:00
                        to_date="2018-04-27",  # UTC 2018-04-21 00:00
                        results_per_call=500)

tweets = collect_results(rule,
                         max_results=200000,
                         result_stream_args=premium_search_args)

These are the counts but they don’t match what I see in the free API

rule = gen_rule_payload("(humbolt OR Humbolt) -is:retweet lang:en",
                        from_date="2018-04-05",  # UTC 2018-04-05 00:00
                        to_date="2018-04-27",  # UTC 2018-04-21 00:00
                        count_bucket="day",
                        results_per_call=500)
counts = collect_results(rule, max_results=500, result_stream_args=premium_search_args)
[print(c) for c in counts];
{'timePeriod': '201804260000', 'count': 27}
{'timePeriod': '201804250000', 'count': 87}
{'timePeriod': '201804240000', 'count': 74}
{'timePeriod': '201804230000', 'count': 40}
{'timePeriod': '201804220000', 'count': 44}
{'timePeriod': '201804210000', 'count': 61}
{'timePeriod': '201804200000', 'count': 43}
{'timePeriod': '201804190000', 'count': 71}
{'timePeriod': '201804180000', 'count': 54}
{'timePeriod': '201804170000', 'count': 71}
{'timePeriod': '201804160000', 'count': 118}
{'timePeriod': '201804150000', 'count': 131}
{'timePeriod': '201804140000', 'count': 154}
{'timePeriod': '201804130000', 'count': 349}
{'timePeriod': '201804120000', 'count': 1048}
{'timePeriod': '201804110000', 'count': 433}
{'timePeriod': '201804100000', 'count': 624}
{'timePeriod': '201804090000', 'count': 871}
{'timePeriod': '201804080000', 'count': 921}
{'timePeriod': '201804070000', 'count': 2017}
{'timePeriod': '201804060000', 'count': 10}
{'timePeriod': '201804050000', 'count': 12}

R for free API query:

tweets <- search_tweets(q = "humboldt", #search term
                      n = 10000,   #maximum number of tweets to return
                      include_rts = FALSE, #whether or not to include retweets
                      retryonratelimit = TRUE, #deals with limits on searching -- keep this
                      geocode = lookup_coords("usa"), #accesses geocode for locations -- important for mapping
                      lang = "en" #language,
)

This results in 615 tweets for 4/22 UTC


#2

Hi @willcipolli - have you dug into the quality of the data? The search methods are different between the premium Search API and standard statuses/search


#3

@happycamper What do you mean the quality of the data?

The two methods I posted are different:

  • The top uses the premium API through python’s ``searchtweets" library. I can confirm these are querying my premium account by the count shown in my account dashboard.
  • The bottom uses the free API through R’s ``rtweet" library.

My questions isn’t about what data or the quality of data being returned, but the number of tweets captured.

Running the following in Python with my premium credentials I get 18 tweets.

rule = gen_rule_payload("(humbolt) -is:retweet lang:en",
                         from_date="2018-05-09",  # UTC 2018-04-05 00:00
                         to_date="2018-05-10",  # UTC 2018-04-21 00:00
                        results_per_call=500)

tweets = collect_results(rule,
                         max_results=200000,
                         result_stream_args=premium_search_args)
len(tweets)
18

Using my free-api credentials in R I get over 350 tweets.

> tweets <- search_tweets(q = "humboldt", #search term
+                         n = 10000,   #maximum number of tweets to return
+                         include_rts = FALSE, #whether or not to include retweets
+                         retryonratelimit = TRUE, #deals with limits on searching -- keep this
+                         geocode = lookup_coords("usa"), #accesses geocode for locations --                            important for mapping
+                         lang = "en", #language
+                         since="2018-05-09",until="2018-05-10"
+ )
Searching for tweets...
Finished collecting tweets!
> length(unique(tweets$text)) ##how many tweets did we collect?
[1] 355

#4

Could you please provide more information about the difference between these two search methods? I have experiencing the same thing. I search with Standard API this q=%23customerexperience%20%23omnichannel%20-filter%3Aretweets%20-filter%3Amedia%20filter%3Alinks&lang=en&count=100&result_type=recent
and with Premium API Sandbox this
rule = gen_rule_payload("#customerexperience #omnichannel lang:en -has:media has:links", to_date=to_date, results_per_call=100)
where to_date is the date of the oldest tweet found by Standard API. Standard search gets 37 results for the last 7 days and 30day Sandbox get 11 tweets.


#5

The operators used with the two different products behave in different ways, ie match on different JSON objects, which could be the cause of this. As Emily suggested, have you investigated the quality of the results between the two APIs?

I personally recommend that you stick to the Premium product as much as possible, as we invest a lot more energy into this newer product.