How to avoid loss tweets after the re-connection to stream?


#1

I’m trying to find a solution of the following problem. I will describe it in details

I have to process mentions for one Twitter account. Every message - is a votes. And I should process all of them without loss data (but in specific time interval)
Regarding this paragraph https://dev.twitter.com/docs/streaming-apis/processing#Message_ordering
"messages are not delivered in sorted order"

I have a connection (User Streams), which started receive the following tweets IDs
"x10"
"x15"
"x11"
“x12”

And let’s suppose that we lost connection after the processing "x15"
after the re-connection, we will continue receive only new tweets like this
"x21"
"x25"
“x26”

So, the problem is in that, that how to read ALL unprocessed tweets from the interval when connection was lost?
Yes, we should use method statuses/mentions_timeline and I have to provide since_id, max_id
How we can correctly define those values?
In sample above since_id=“x15”, and max_id=“x21” (it is the first tweet from just connected stream)
but it is not correct and in case since_id=“x15” we will lose tweets “x11”, "x12"
and in case max_id=“x21” - we will lose tweets “x22”,“x23”,"x24"
Yes, I can take since_id=“x15”-100 and max_id=“x21”+100 - but in this case I have to store tweets and check if they were processed or not
But I would like to avoid storing processed tweet IDs
Is there any other solution of this problem?


#2

Keeping a stream connected while minimizing disconnect time is one of the big challenges of working with the streaming API. If you absolutely need to get every Tweet which was sent, I suggest checking out Twitter data reseller partners such as Gnip or Datasift, which sell access to historical archives of Tweet data.


#3

XXXXXXXXXX
Keeping a stream connected while minimizing disconnect time is one of the big challenges of working with the streaming API.
XXXXXXXXXX
Yes, I understand that we must minimize disconnecting time.
But what if we should count voting more then 100,000 tweets per minute? (yes, this is the maximum for which I have to count at the moment)
in this case if we lost connection during 1-2sec we can lost a lot of votes, because we can not have the best solution of this problem!

XXXXXXXXX
If you absolutely need to get every Tweet which was sent, I suggest checking out Twitter
data reseller partners such as Gnip or Datasift, which sell access to historical archives
of Tweet data.
XXXXXXXXX
And they can sell access to historical archives in realtime? Looks incredible :slight_smile:
Ok, thanks for the response. I have to think about this.


#4

The data resellers can sell you a filtered stream which is based off of the Firehose. For example, Gnip would sell you a @mention stream which would be realtime. Then you would use their historic data product for any disconnection periods you experienced.


#5

That’s great! Good to know about this.
Thanks!