How many days old can a Tweet be to guarantee recovery (using REST)?

rest
streaming
search

#1

Hi. I have a research project in which I gather data in real time using the Streaming API, for months at a time. I use some track filter, such as a few hashtags, to identify tweets about some subject. When Streaming disconnects or I have some kind of crashes, I reconnect and I attempt to recover the lost data from that crash period using GET search/tweets with the same filter I use for the stream. Note that I usually (99% of the time) do not hit the limits of the Streaming 1% endpoint, and this question does not refer to hitting that limit.

My problem is that I cannot be certain that I am recovering all the missing data, since the REST API documentation does not guarantee that, and furthermore it also does not give any kind of measure of how many tweets I did not retrieve, such as the {limit:“45”} messages you sometimes get with the Streaming API.

The documentation says I can retrieve tweets that are 6-9 days old, but that means that if a tweet is 7 days old, there is a chance that I do not retrieve it. At the same time, I cannot be certain that if a tweet is 6 days old it will be retrieved, as it seems that this retrieval process is a continuum rather than a hard limit. What is for sure is that if a tweet is 10 days old, it is forever lost to me.

How many days old can a tweet be at maximum, that I can be certain that I can retreive it with a REST call?

Also, if you have a better method for recovering lost data from Streaming disconnections, please mention that instead. PLEASE DO NOT offer the solution of using paid services, I want to know the most efficient way to do this using the Public API!

If recovering the lost data is not possible, or the guaranteed recovery of a tweet cannot be made based on its age, it would be enough for me to know exactly how much data I lost in a specific time frame. Is there any way to find out how many tweets can be retrieved from the last X days using a certain query (such as a bunch of hashtags), even if I don’t actually retrieve those tweets?

Many thanks in advance!

PS: I could not find any of this information on any online documentation or forum, this post is a last resort.


#2

Unfortunately I don’t think there’s a way to be any more specific than we have been in the documentation regarding the search limit - but in my experience you’re usually able to retrieve up to 7 days back. Beyond that, you might not find the Tweet. Additionally, note that not all Tweets are indexed for the Search API anyway - if an account is particularly new then Tweets may not appear in search right away (the docs do say that the index is not complete).

Your second request is not something that is possible in the public API. I think there are parts of the GNIP API that would give you information about how many Tweets match a set of terms without returning the Tweets themselves, but that’s obviously not something that you’re using. I don’t know how you would do this using the public API.


#3

There would be a way to be more specific: specify a single specific number instead of a range (6-9 days), or specify what determines that range to be either 6 or 9 in different cases. I understand if you don’t have access to that information, or the criteria for saving/retrieving data is too cryptic or just secret. I also experienced that usually it’s possible to retrieve up to 7 days back, but usually is not accurate enough for scientific purposes.

Nevertheless, I’ll use your answer to justify that 6 days is a hard limit on the age of a recoverable tweet. I’ll also attempt tweet recovery once every less than 6 days, just to be safe - maybe something like every 2 days.