Diffence between sample and filter streaming API


#1

I’ve a simple question, according to the documentation of the GET statuses/sample it “Returns a small random sample of all public statuses. The Tweets returned by the default access level are the same, so if two different clients connect to this endpoint, they will see the same Tweets”. To my opinion this means that the tweets for the streaming api are chosen in a deterministic way, and is the same for each one is using the api. The other API is the POST statuses/filter in which the documentation says “Returns public statuses that match one or more filter predicates”. My question is, the public status that match one or more filter predicate, are the same that is contained in the “random” sample? The filtered status is always a subset of the random sample?
Thank you for the attention.


Current Streaming API sampling rate?
Did Twitter stop sending delete notifications in the streaming API?
#2

Hi Luca,

Thanks for the question. Whether you use the [node:10390] or [node:10389] endpoints you mentioned, you can retrieve roughly 1% of the firehose which is a source of all public Tweets.

With [node:10389] in particular, you are filtering from the firehose, with a maximum resulting volume of 1% of the total Tweets at that moment, so this is not a subset of the sample. In other words, if the keywords you are tracking account for less than 1% of the firehose, you will receive all the matching Tweets, otherwise you will be capped. To give you an idea, there are more than 500 million Tweets posted every single day on Twitter, so 1% still represents a very large number.

See also this FAQ entry:

[faq:6861]


#3

If for example i need to analyze all the tweet referred to a special event, whould be better use the filtering with some keyword, respect to use the sample and then “clean” my data?
For instance if I need all the tweet on a Nasa mission is better use the filter api using track=nasa,moon,launch and so on, respect to use the sample API and then search?
(I know that there is the search api but I’m interest in a real time stream)
I’ve tried both and seems that the filter api contains tweet that isn’t inside the sample, so this seems to be true :smiley: , I asked just to be sure .
Thank you for the help.


#4

Absolutely, [node:10389] is the way to go here.

Using the sample, you will receive a random selection of Tweets, about 1% of all Tweets, but in your example, it means you may miss a majority of Tweets related to the NASA mission you are interested in, and you will have to manually filter as a second step to extract them.

However, using the filtering endpoint, not only you will exclusively gather Tweets matching at least one of the keywords you are tracking, you may actually receive a majority of them, if not all of them, as long as they account for less than roughly 1% of the full firehose.


#5

Hi Romain,

If I track #somerandomhashtag with https://stream.twitter.com/1.1/statuses/filter.json?track=%23somerandomhashtag will I - disconnections aside - receive EVERY SINGLE TWEET containing #somerandomhashtag, so long as the pool of tweets containing my chosen hashtag do not exceed 1% of the full firehose of tweets?

I don’t anticipate my hashtag reaching 1% of the full stream of tweets.

Thanks!


#6

That’s exactly the intent of of the API, yes. So long as your filter query is not going to exceed 1% of the total pool, you should see all of the tweets matching the filter.


#7

Hi,
I have one quick question, when you guys said “1% of all Tweets”, what is the exactly time period? like every second? every 10 seconds? Thanks,


How did twitter generate the 1% of the Firehose tweets to status/filter endpoint and status/sample endpoint?