Data mining (API) tweet collection for academic research--security changes?


#1

I am part of an academic research group who has been using Tweepy Twitter API and Twitter Advanced Search to collect tweets on health topics.
We collected a number of tweets from a variety of sources (personal twitter accounts, organization and company accounts) in Feb, but when we attempted to collect more at the end of April, beginning of May, Tweets seem to only be coming from formal sources (organizations etc) and not personal accounts.
Has something in the security settings changed? Will we still have access to all tweets if we purchase the Premium API Search?


#2

It sounds like you’re using the standard search API and Tweepy to run searches on specific terms?

There have been no changes that I’m aware of either in terms of the search index, or in security, that would cause the volume of Tweets matching your queries to drop - are there specific Tweets that you believe should be returned by your code that are not appearing?

It would be very helpful if you were able to provide specific examples, and sample code.


#3

Yes that is what we have been doing (standard search API and Tweepy), as well as using Twitter Advanced Search (https://twitter.com/search-advanced?lang=en) and Evernote to download–all have been giving similar results.

Some Tweets we are expecting to see like these are not appearing in either search:


We want to collect as many tweets as we can on the topic from Jan 1 to Dec 31 2017. Is there a better method you can recommend to do this?

here is the code we are using on standard search API and Tweepy

import tweepy

consumer_key = "Xxxxxxx"
consumer_secret = "bxxxxxxx"

auth = tweepy.OAuthHandler(consumer_key=consumer_key, consumer_secret=consumer_secret)

api = tweepy.API(auth)

results = api.search(q="breastfeeding", count="100", geocode="45.391758,-75.7234487,25km")

def print_tweet(tweet):
    print "<tweet>"
    print "<handle>@%s</handle>" % (tweet.user.screen_name)
    print "<profilelocation><%s/profilelocation>" % (tweet.author.location)
    print "<coordinates>%s</coordinates>" % (tweet.coordinates)
    print "<geotag>%s</geotag>" % (tweet.geo)
    print "<username>%s</username>" % (tweet.user.name)
    print "<timestamp>%s</timestamp>" % (tweet.created_at)
    print "<text>%s</text>" % (tweet.text)
    print "</tweet>"
for i in range(len(results)):
    print_tweet(results[i])

#4

The standard search API does not support searching outside of a seven day window from today’s date. Also, note that you’ve included a geo filter which will heavily restrict the number of results that you’ll see - only a small percentage of Tweets are geotagged (in the 2-3% range) so that will reduce the effectiveness of this search.

In order to do a search that covers Jan 1 to Dec 31 2017 you’d need to apply for access to the new premium full archive search API. However, tweepy has not yet been updated to work with the premium APIs, so you’d probably want to look at an alternative like our search-tweets Python library.