I have Stream A to fetch tweets based on a geographic bounding box. I actually consider only tweets that have indeed a geo-location, i.e., I remove all tweets that passed the filter because the user profile contained geo-coordinates in the “place” field. The current dataset contains tweets collected over 200 consecutive days. When I ploted the number of tweets per day I noticed something very distinct: At first, the average number of tweets was around 20k. After around 12 days (09./10. Nov’14) it dropped to around 10k, and after another 170 days (27./28. April’15) it eventually dropped down to <5k tweets per day. The drops are extremely obvious.
Well, that made me curious. So I started a Stream B using the exact same filter conditions on another machine to compare the results. I did this for an interval of 12 days and the results unambigious. Stream A results in only 50% (almost exactly) compared to Stream B, despite the exact same filter conditions. In fact, I use the same Python script just with different access tokens.
What is happening here? I know that there are caps if the rate of tweets is two high. But how come that (a) the number if tweets returned by a tweets suddenly goes down to 50%, and (b) why result two identical streams in completely different number of tweets? Get long-running tweets less and less priority? Did I miss something in the documentation?
EDIT: One of the drops I see in my data is on the 27./28. of April. Other users seem to have experienced the same problem [1,2]. I cannot see how the explanation might apply to my case since I only consider tweets that indeed have geo-coordinates (independent from any information derived from the “place” field)