Does streaming API miss some tweets?


#1

I am using tweepy to collect streaming tweets by keywords, and store them into my database.
Then I found that I keep missing a lot of tweets. Say, a tweet was retweeted 10,000 times, so the retweet_count of the latest one will be 10,000. But in my database, I get around 8,000 records of this specific tweet.
And I try to use my friend’s twitter account and my own account to do the test, I retweeted my friend’s post, it seems like sometimes the data really can’t be caught by tweepy. And looks like it happened randomly?

If it is true that i can’t get full data from standard Streaming API, is there a way to do so? My purpose is to get all the retweeters of some specific tweets, but statuses/retweeters/ids can’t return more than 100 retweeter ids, anyone could let me know how to get the whole list of retweeters? Thanks so much.


#2

It happened to me as well before. To get almost 100% of the data, I now use the REST API in parallel of the Streaming API to backfill potential missing tweets that the streaming API didn’t send to me.

Not sure about what you’re doing in details but could it be possible that most of the retweets you don’t get were made by private accounts? Because a 2k gap is a lot.


#3

You’re attempting to use the standard APIs for access to full fidelity Twitter data. This has never been possible and you either need to fall back to fill in the data via rate limited REST calls; or subscribe to a premium or enterprise API tier.


#4

Would you kindly explain a bit more about “fall back to fill in the data via rate limited REST calls”, like, use which endpoints then I can get the (randomly) missing data?


#5

Would you please explain a bit more about how you use REST API for this purpose? Like which endpoints?
I don’t think it is because of private accounts because the ones that I can get, in my database, many of them are private accounts.
I do think a 2k gap is a lot but sadly, when I search in my database, I search those start from “retweet_count” == 1, it means I was collecting the data since the beginning of these specific tweets, but still, in the dataset, I see 9K rows – 11K retweet count, 16k rows – 20K retweet count, something like this, I’ve never seen any tweet has a smaller gap than this.
It is really making me a headache, because I need the full retweet list of a tweet, but I can’t do it with Streaming API or REST API


#6

@rAin_Nevermore
You can learn more about the endpoint that you will have to use to fill in the gaps at this link.

You can upgrade to the premium or enterprise API tier to access full fidelity Twitter data and avoid missing tweets like you currently are with tweepy.
You can apply to get access to our Premium API at this link.
If you are interested in applying for Enterprise, you can do so at this link.


#7

I’ve applied the premium API but still didn’t get the hang of it…
Does Premium API provide the possibility to get the retweet list (more than 100)? AKA, the statuses/retweeters/ids endpoint, still has a 100 limitation or not?


#8

Well I use the User Stream so I don’t know if it fits your situation completely. I use the Streaming API to get replies made to my quiz bot. Replies are basically the most important thing in my game because they contain the guesses of my players I have to analyze in order to give points or not.

I was missing a lot of replies made to my account, like on daily basis. Even if my bot receives only 300 replies a day. It was a problem because there are rankings and sometimes good answers didn’t make it.

The first thing I made was to use the with=user parameter, because I didn’t need to see all the activity from the people I follow. Next, I also added a “track” parameter when connecting to the Streaming API. I set up the API to track “@whattheshot” (the bot nickname), and it almost reduced the number of missing tweets down to zero. The API is smart, it doesn’t send you duplicate stuff if you get the mentions from the standard user stream feature and the same ones through the track parameter.

It was almost perfect, but sometimes a few mentions didn’t make it or my game was off because the streaming API was having problems, just like a few weeks ago. So I created a “backup” script, that uses the REST API and call the statuses/mentions_timeline.json end point in my case every 30 seconds (to respect the rate limits) every time my quiz is on (every 10 minutes for five minutes).

Of course it’s not 100% perfect but it is way better than before. I just needed to control what the backup system gives me as tweets to be sure I didn’t have already consumed them through the Streaming API.


#9

No. The premium APIs currently only offer 30-day search. There’s no change to the standard retweeters/ids endpoint.