Getting all tweets for research purposes


#1

I am a student of Software Engineering, as part of research I do I have to work on a wide database of tweets.

I understood that the best way is to collect the streaming of all tweets for a certain period using curl

But I am having two problems:

  1. I can only use https://stream.twitter.com/1.1/statuses/sample.json. With https://stream.twitter.com/1.1/statuses/firehose.json I get an error:
HTTP / 1.1 403 Forbidden
    ....
User not in role

I understood that firehose not available to developers by default. How can I get access to it?

  1. After about fifteen minutes, the connection is interrupted, how can I create a long-term connection? the OAuth only valid for a few minutes, so I have to create new one every time again. Is there any way to keep the connection alive or reconnect using curl with the same Authorization?

Thanks! and sorry for my English, It’s not my native language…


#2

The Firehose consists of the entire Tweet and event stream for Twitter, which currently carries over 500m Tweets every day. This is a massive volume of Tweets per second, and it requires significant infrastructure to process. Firehose access is limited to a very small number of certified data partners. If you need to use that data you can contact them for access to their APIs. One example is Gnip (which is part of Twitter), but there are other independent data partners as well. One advantage is that these partner APIs are able to offer some very advanced filtering and tracking options.

You can use the sample or filter endpoints to pull up to 1% of the current Tweet stream at any one time. This is often enough for many applications, including those tracking hashtags, search terms, etc.

The timeout on your curl connection is almost certainly a limitation of curl itself, which by default disconnects after 15 minutes. You should either look at how to configure curl, or write an application of your own that listens to the streaming endpoint without any disconnection issues.


#3

Thanks!
Is there any time limit for listening to the streaming endpoint or I can leave it open for a week or so?


#4

There’s no limit assuming your own network connections do not timeout or disconnect.


#5

Thank you very much, you helped me a lot.
Now I’m writing a small software for that purpose, I’d love to know if there is some way I can get permanent OAuth authorization, because what I get from the OAuth tool is valid for only few minutes, so if I have to create an auto script I need permanent authorization.
Thanks again.


#6

There are lots of libraries that could help you with this, if you were using a language like Python, PHP etc.