Stream.twitter.com enhancements: implications and timeline


#1

Over April 1-2, we will be rolling out a change to our stream.twitter.com cluster, affecting our [node:6344].

On the backend, this will bring us greater stability and ability to scale, but after April 1-2 and for some time thereafter, if you have enabled gzip compression, you will notice that your connection will require approximately double the bandwidth for the same Tweet/message volume. (If you are not using gzip compression, you will not notice any such difference.)

UPDATE: This WILL affect User & Site Streams (on userstream.twitter.com and sitestream.twitter.com, respectively), but for most implementations, the bandwidth difference will be negligible.


#2

UPDATE: We have had to defer completion of the rollout. Please expect the rollout to continue over the next few days at least.


#3

I don’t get how enabling gzip compressing can increase bandwith. When you introduced gzip it would decrease data to appr. 1/5th the uncompressed size (https://blog.twitter.com/2012/announcing-gzip-compression-streaming-apis)

Do we have to make changes in order to process the data correctly?


#4

Any updates on the rollout?


#5

Dirk: We’re not enabling gzip compression; we’re changing the implementation. If YOU don’t have gzip compression enabled, then you won’t notice any difference, but if you do, then you’ll notice that the effective overall compression ratio is less than what it previously was. The new gzip implementation still requires less bandwidth than uncompressed streams.


#6

As of about 14:00 PDT today (Monday, April 14), we have completed the rollout to the stream.twitter.com cluster.


#7

as of 15:26 EST yesterday (Monday, April 14) all my scripts based on AnyEvent::Twitter::Stream stopped working.

JSON data is truncated, is incomplete. Maybe a gzip related issue? anyway, I can play with the modules and find out but this will affect ALL of the AnyEvent::Twitter::Stream Perl implementations out there!


#8

Hi Arthur, I believe I can fix the problem if I understand better what was changed in the implementation, can you please be so kind to provide details as to what changed?

Thanks!


#9

Since yesterday too many of the Twitter stream connections were not functionning correctly with an “Exceeded connection limit for user”

I did not had this error before! Sometime the error appears directly, sometime it took 5 or 10 minutes to happens… Can we have more details on this error?


#10

I have the same issue, all my incoming json streams are broken!!! And none of our apps are working. Please advise! Everything was working perfectly yesterday


#11

Yeah, something is fishy. In particular, the ?delimited=length option seems not to work anymore. The indicated lengths of the following data seems corrupted.


#12

Correct. They DO NOT match. I also tried delimited=length and the reported length is not matching the actual data received. Twitter will surely notice there is a problem when enough people start reporting it.


#13

I already found a fix and probably also the reason why twitter is not catching the error. I’m ‘validating’ the data, that is, i’m expecting some tweets to be incomplete so I’ll skip them. For that, I’ll check the length reported when I do &delimited=length and compare that to the length of the body received. If they match I consider the data valid, if not, I’ll simply skip it and will not attempt to JSON decode it.

I assume that Twitter is not catching the error because they are using a library that does the validation automatically… obviously, this does NOT solve the problem, it simply shows there IS an issue Twitter is not catching. Fact is the stream is now including a high number of truncated messages and this is breaking some scripts and/or programs.


#14

What is the alternative to using delimited=length ? Just take it out completely?


#15

If you take it out, then the ONLY way for you to know if the data is valid is to attempt to JSON decode it (which is expensive) and you’ll also need to catch any exceptions if it fails decoding so your script continues running (e.g eval to catch the errors in $@)


#16

Ok, so how can we tell how long each json object is if they can’t specify the length? Attempt to decode after every byte? Slightly confused…

To add my own symptoms to the pot, we use the phirehose collector which looks at the status length to decide how many bytes to read into the buffer. Of course now this is kaput, our php script is memory over-running…

Edit: the newer version of Phirehose seems to have different code for checking lengths and at least our collector is up longer than a minute! Will post back later if this has fixed the issue…


#17

Hi all,

I’m an engineer on the streaming API team. For the folks that are saying that delimited=length is no longer working, can you provide a bit more information about what you are seeing?

It’s important to understand that delimited=length is the length of the following status information. It may very likely not correspond to how much data whatever client you are using read in the previous read() call. If you have read less than “delimited length” bytes, you need to buffer until you read that many bytes and then parse, before reading in the next status length and continuing.

If you are seeing errors, can you confirm that after doing the above you are still seeing corrupt data?

Thanks,
Matt


#18

Hi Matt,

Delimited=length ‘does’ work. The issue is about how the API stream changed the way it operates. Before, you’ll receive all the data for a single tweet at once, now, it is neccesary to buffer the data until the line break is found. I was able to fix the problem myself because I modified the module I use (I implemented buffering myself), but you will get tons of requests of people reporting errors as you basically changed the way the stream API sends data. To me, this is a breaking change and since is too recent you won’t see many people complaining.

My point is that before it was NOT necessary to buffer data as the ENTIRE buffer contained the full tweet, now, it contains chunks so libraries ‘break’ thinking the JSON is ‘corrupt’ when in reality is just a ‘portion’ of the data


#19

Yeah, confirmed, the fishiness was in my code. Whereas in the earlier version it was kind of guaranteed that the Delimited=length number of bytes was in a single network packet, this isn’t the case anymore.


#20

I’m sorry for the inconvenience this change has caused you. However it’s important to realize that the existence of ‘delimited=length’ is due to the fact that there is no single solution that will avoid buffering of stream data. I.e., something in the client has to buffer. Stream data is HTTP chunk encoded and ultimately sent over TCP. At each of these layers, data may be broken up into pieces and sent at different times (at standard Internet MTU, most every update is going to be in multiple TCP packets). On the client side, the TCP packets are read back into the client. It is up to the client HTTP library whether it buffers up to the HTTP chunk encoding boundary before delivering the data to your application. Many libraries such as libcurl do not buffer and deliver data as it is received. As such, even if we continued to deliver all status updates in single HTTP chunks (which the API never promised), we are at the whim of the application/library in terms of how the data buffering is handled.

It is true that we made a change and now it is unlikely that individual status updates will be in single HTTP chunks, however as explained above it is still essential that applications deal with this possibility since the behavior is dependent on the client libraries being used. Buffering at the application level is the correct solution here. If there are popular Twitter libraries that are doing this incorrectly please let us know and we will try to reach out to them and help them fix this issue.