Search API vs Streaming API

streaming
search
api

#1

Because the Search API documentation includes the following:

“Before getting involved, it’s important to know that the Search API is
focused on relevance and not completeness. This means that some Tweets
and users may be missing from search results. If you want to match for
completeness you should consider using a Streaming API instead.”

I thought I’d compare the data from the Search API to the data from the Streaming API.

Using Python and the statuses//filter API I recorded tweets where track=@bostonglobe - I left this open for 7 hours (9:16 AM EST through 4:16 PM EST). I then used the search/tweets API where q=@bostonglobe and then filtered the resulting data down to only those tweets made between those times for which I had the streaming API open.

statuses/filter provided 1,734 total tweets that mentioned @bostonglobe
search/tweets provided 2,168 total tweets for the query @bostonglobe

There were 248 tweets in the streaming data that were not included in the search data, and there were 701 tweets that were in the search data that were not in the streaming data.

There was no particular time span that contributed to the discrepancy, and I haven’t identified any other pattern for tweets that are missing from one or the other. Any ideas as to why there is such a difference? I was actually expecting to see the search data to be substantially lower than the streaming data.


#2

The Streaming API is limited to 1% of the firehose, so my guess would be that at any particular point in time you weren’t seeing everything that might have been flowing through.

In either case, the public APIs do not offer any guarantee of data completeness and are “as-is” - if that kind of fidelity is required, then the supported route is the commercial set of offerings from Gnip.


#3

In further examining the data I was able to figure out the discrepancy:

The 701 tweets from the search API that were not in the streaming API all started with “RT @BostonGlobe:” - and there were no tweets starting with that from the streaming API. So it appears that the streaming API does not track retweets of tweets made by the user ID that is being queried. But it does track retweets or any other tweets that mention the user ID somewhere else within the tweet.

The 248 tweets that were in the streaming API but not in the search API contained “@bostonglobe” somewhere else within the tweet data, but not within the tweet text itself. For example: https://twitter.com/MTraum/status/687311492585238529 <-- in that tweet @BostonGlobe is in the text of the tweet that is being linked to but not in the original tweet itself; this shows up in the stream but not in the search.


#4

This is really great analysis, thanks for the follow-up!

Part of what you’ve uncovered there is the difference between retweet behaviour. Before there were native retweets (and references to original Tweets inside the newly-retweeted Tweet object), this was commonly done using the “RT” text syntax, so the Search API is picking those up. The Streaming API has evolved somewhat differently to the REST API so this is why there’s different behaviour between the two.


#5

I had this issue, when I used streaming api as well as search api, many tweets texts were not complete, not even containing the query term, still they appear in the result, I really wonder how does this happens, obviously the query is present in the tweets that’s why it’s extracted but why the text of the tweet is incomplete and doesn’t contain it?

Ps. @michoco whoever you are, seriously great analysis! Thanks for clearing my other doubts.