Two identical streams - one returns only 50% of tweets

rate-limits
oauth
streaming

#1

For testing purposes I created two identical Streams A and B, i.e., the same filter condition based on a bounding box, using different applications / access credentials. Both streams are running on the same machine.

However, Stream A returns only 50% of tweets compared to the other. In fact, A misses every second tweet that Stream B returns – see the screenshot of two terminal windows just outputing the tweets of both streams side by side:

What could be the reason for getting different results from Stream A and B. Apart from the used access credentials, I’m using the exact same script. Right now I have to assume that my on application has been restricted or something. I couldn’t find anything about that in the documentation though. Any ideas or suggestions?


#2

The streaming API is not designed for completeness, depending on the volume of Tweets that it would have to return, it is usually limited to 1% of the full amount of Tweets. If you need all Tweets, you need to get special access from Twitter or use Gnip, both require payment as far as I know.


#3

Yes, I understand this. But I cannot see how that explains that two identical streams A and B differ exactly in such a way that one stream yields only every second tweet compared to the other stream. By now, I actually run the same script (= stream with same filter conditions but different access credentials) on two different machines 1 and 2. So I have for streams A1, A2, B1, B2. Three of them return exactly the same tweets, and one, say A1, only 50%, with every second tweet missing.

I understand that A2, B1 and B2 do not necessarily return the full amount of tweets. I’m just wondering why A1 returns less tweets that the other streams.


#4

Just to rule out any weirdness on your side, you could try seeing if it is possible to move the 50% issue from the computer running A1 to the computer running A2. Try switching IP’s to see if the issue moves with the IP. If you cannot re-create the issue on different computers, then it may point to a problem with the computer or it’s connectivity.


#5

The issue doesn’t move with the IP. I just started the same stream A2 (same filter conditions + same access credentials) on the second machine 2. A1 lacking every second tweet I see in A2. Thus, the only difference I could currently make out is the IP – it’s really a very simple script that collects all tweets given a bounding box. I also stopped and restarted A1; no change. [Edit: As already mention, the same stream on Machine 1, just with other credentials also doesn’t show this 50% issue. So I think I can rule out any problems with the machine and connectivity.]

Maybe it helps to elaborate a bit. A1 with the 50% output is my long-running crawler, continuously collecting tweets for about 250 days by now. So maybe Twitter decided at some point to limit that stream. The following graph shows the number of daily tweets for the first 200 days, starting from October 31st, 2014:

As one can see, after 10 or 11 days there’s been a significant drop in the average number of tweets, about 50%. Then I have some spikes that are hard to explain, particularly the one on the weekend of March 7th/8th. Looking at the data showed the number of tweeting users was much higher on the weekend, and not the normal number of users just sent more tweets.

A second clear dropped happend on April 27th/28th. This has been observed by several people [1,2] and coincides with changes Twitter made with respect to fetching tweets based on a bounding box – although this change does not explain why the number of tweets per day would drop so significantly.

[1] 80% reduction in tweets with coordinate data
[2] Volume drop in streaming API


#6

Problem solved…and it has nothing to do with Twitter!

To be more flexible, i used RabbitMQ as message queuing system. My crawler publishes each received tweet into a queue. So far so good. However, i have two consumers, i.e., two scripts that listen to that message queue and process each newly incoming tweet. The problem was that i configured the message queue in such a way that the incoming tweets are SHARED among the two consumer scripts: 1 for script A, 1 for script B, 1 for script A, … hence the 50%.

Now I have configured the message queue as a real Publish/Subscribe system, i.e., both scripts have each their OWN queue, and the crawler publishes into both of them. Now both scripts get 100% of all the incoming tweets.