Stream.filter sample size


#1

Another developer and I used the filter endpoint to track an identical list of search terms over the course of a few days. After reading through the forum, I see that filter should pull a 1% sample.

After analyzing our datasets, we have around 90% of the same tweets. How “random” is the 1% sample and is there a possible explanation for the overlap?


#2

The random sample from is 1% of the whole firehose volume, not 1% of all the tweets matching the query - so, if your query selects a sufficiently unique / rarely used term then the likelihood might be that you’d get a lot of similar hits inside the filter.


#3

Thanks that’s helpful. I found an error in my analysis of the two datasets - it looks like they are very close to matching 100%. This is over a 100 search terms. So does the filter stream the same 1% random sample for everyone?