I have read that the sample streaming API contains approximately 1% of the tweets from the full firehose. Is this a truly random sample? Is there any preprocessing performed on a given tweet or its metadata to determine whether a tweet might be excluded from the sample?
The sampling is based on a hash that is completely agnostic to any substantive metadata, so it should be a fair and proportional representation across all cross-sections, including whether there are links, whether there are hashtags, @-mentions, @-replies, what app/client generated the Tweet, etc.
So how is each sample drawn? Simple random sampling? There seems to be no documentation about that.
Also, what proportion of the population is it?
I read a study which attempted to compare the statistical properties of the streaming API sample and a sample of the Firehose. They concluded that the streaming API provided on average, 43.5% of the total Firehose on any given day.
The Sample endpoint provides a statistically relevant sample of 1% of the full firehose.
Thanks for that, however is each sample from the population equally likely to be drawn? Is there documentation for that?
I need to know that particularly as I’m using the streaming API As a source of data for my masters thesis.
I understand the reason for your question; unfortunately we are not able to share any more details of the API beyond that.
I don’t understand, are you not allowed to tell me if the sample is truly random?
I’m just unable to tell you here how the sampling is done, as there’s no public documentation on that. Our description is just that it is “a statistically relevant sample” and there’s no further detail than that.