I am using Tweepy to scrape the API for tweets mapped to particular cities that contain specific keywords. I am then conducting statistical analysis to see where and when keyword mentions in these cities spike. So it’s really important that my Python program is turning up the entire population of tweets I am looking for. I have run into a problem, however, that makes my analysis difficult to do.
My program has two queries: One fetches me the number of tweets from 2015 until now that are mapped to a bounding box enclosing a given city and contain my desired keyword. The second query is just a search that fetches me the total number of tweets from 2015 until now that are mapped to the bounding box. From there, I organize the data and run some functions to calculate keyword frequency over time in each city. Here’s the code for the queries:
client = tweepy.Client(bearer_token="REDACTED",
consumer_secret=SecretKey, access_token=AccessToken, access_token_secret=AccessTokenSecret, wait_on_rate_limit=True)
# search query
query = 'KEYWORD bounding_box:[-87.296643 14.018294 -87.116257 14.164161]'
# fromDate
start_time = '2015-01-01T00:00:00Z'
#endDate
end_time = '2022-08-01T00:00:00Z'
#make a list of tweets
KeywordTG=[]
for tweet in tweepy.Paginator(client.get_all_tweets_count, query=query,
start_time=start_time, end_time=end_time, granularity="day").flatten(limit=5000):
KeywordTG.append(tweet)
**QUERY NO. 2**
client = tweepy.Client(bearer_token="REDACTED",
consumer_secret=SecretKey, access_token=AccessToken, access_token_secret=AccessTokenSecret, wait_on_rate_limit=True)
# search query
query = 'bounding_box:[-87.296643 14.018294 -87.116257 14.164161]'
start_time = '2015-01-01T00:00:00Z'
end_time = '2022-08-01T00:00:00Z'
#make a list of tweets
AllTweets=[]
for tweet in tweepy.Paginator(client.get_all_tweets_count, query=query,
start_time=start_time, end_time=end_time, granularity="day").flatten(limit=5000):
AllTweets.append(tweet)
I am finding that for two of the cities I am studying – Tegucigalpa, Honduras and Guatemala City, Guatemala – the second query (which gets me total number of tweets) is turning up lower numbers of tweets over time. So while the first few months of 2015 turn up over 100,000 total tweets per month, by summer 2022 that number is down to a little over 1,000 per month. What’s perplexing is that this is not a problem for my query in San Salvador, El Salvador. I am using the same exact code for my query (although of course with a different bounding box), and the total number of tweets mapping to El Salvador does not decline over time.
I know that the bounding_box parameter matches tweets that a) have user-provided exact coordinates falling within the box or b) a user-provided location whose coordinates fall within the box. Is this issue I’m running into a problem with the bounding_box parameter? Is it reflective of the fact that Twitter phased out geotagging? Or might there be some local phenomenon in Guatemala City and Tegucigalpa, wherein Twitter users began to stop providing information about their location? I have also deduced that the number of tweets that match has:geo has declined over time in Guatemala City and Tegucigalpa, but not in El Salvador. Very weird.
Any help would be greatly appreciated, as my statistical analysis is much less valid if this problem remains.