Doubts about how much data can you get through Steaming API


#1

Hi there,

I am using Streaming API to harvest geo-tagged English tweet for Greater London area. After reading several papers I found those dataset contains much more data than I get each day. For example, in one research they collect 41.2 million geo-tweets during whole 2014. Well, I can only collect 7k-9k per day (even though I collect English tweets only). I am wondering if there is something wrong with my code/setting, or they were using commercial APIs so they could have that much data?

Can anyone share your experience? How much data you can have for one day? I also want to know if 7k-9k tweets per day is a reasonable number?


#2

There are several angles to this.

If you’re using the Streaming API you have access to up to 1% of the Twitter firehose. If the terms or context you’re tracking is inside that 1% you may receive “all” of the Tweets on the topic or context; otherwise you’ll see limit messages in the stream responses.

Only a relatively small number of Tweets carry geo information - some estimates put that in single digit percentages overall. You’re layering on a language restriction, which will also potentially lower the volume of Tweets you’re capturing.

So, it is not easy to say whether the volume you’re seeing is typical - I haven’t personally attempted to track Tweets for that area for a 24 hour period - but it is not outside the scope of possibility.

Enterprise (commercial) APIs do not carry the firehose volume restriction, but they are still subject to the limitation on the number of Tweets that are posted with geodata, as that is opt-in on a user account basis.