Extended streaming of georeferenced tweets



Hi everyone,

for a research project I would like to compare spatiotemporal tweet patterns across nine global cities. This is a non-profit
research project which will add to my PhD study in geography and I basically will only use the coordinates and timestamps of the tweet content. I understand that since the specified bounding boxes to stream the tweets from the API are quite large I might hit the rate limit. On the other hand I understand it is ‘against the rules’ to go around the rate limits by registering multiple applications to stream the data. Could anyone tell me if there is a way to achieve this anyway (maybe via an extended access in terms of streaming)? Since I have no funding for this project I am looking for a free of charge solution, i.e. GNIP or other data sale options are not an option. Any help appreciated! Thanks, Martin


The limits on the streaming API relate to the number of filter / track terms permitted on a connection, as well as an overall 1% cap on the firehose volume. You’re allowed up to 25 location bounding boxes in a query, so your nine cities should be covered; additionally, since the number of Tweets that carry location data is relatively small (~2%overall) I’d imagine you shouldn’t hit a limit.

“extended access” as such is commercial via Gnip; I’m not sure what other options to suggest, unfortunately.


Dear Andy, thanks a lot for your reply. That helps a lot. So far I have been trying to retrieve the data in separate scripts for each city that connect to the application and was getting 420s when using more than two connections on one app. I will try to incorporate all cities in one script and see how it works… Thanks again!


Yes, you’ll certainly have an issue if you’re trying to use more than one connection - sorry about that, it is another limitation of the public streams.


Thanks Andy, there is one last question I meant to ask regarding the sample that can be obtained. You mentioned:

How is the sampling applied? Taking your rough estimates: If I stream only geotagged
tweets (via the definition of particular bounding boxes) do I get the only the geolocated tweets from the 1% firehose sample (i.e. 2% of 1% of the firehose stream) or is cap applied to the total number of geotagged tweets (i.e. if 2% percent of the total population are getagged I would get half of them)? Thanks a gain for your helpful comments!


The cap is on 1% of the total firehose at a point in time, so if you were tracking a keyword (for example) and the total volume of discussion on that was < 1% of the total, you’d theoretically get all the Tweets related to it. I’m honestly less familiar with how this works on a geofencing side of things, but you should still get a statistically decent sample; albeit be aware that in general since the number of Tweets with geodata associate is so small overall, you may not get a large amount of data.


Thanks a lot, Andy! All that information really helps.