Encoding


#1

Summary: is the Streaming API inconsistent with its encodings, everything in UTF-8 but places in latin-1?

A few weeks ago, I started a project in which I need the Streaming API of Twitter. I searched for a good Python wrapper, I am currently using https://github.com/sixohsix/twitter

I’m using the Streaming API via these Twitter Tools, and I noticed after a fairly random time (sometimes after ten minutes, sometimes after forty), my program wouldn’t receive anymore tweets from the iterator. I have been doing a little investigation to find out why, and I think it is because of the encoding used.

The code tries to UTF-8 decode the buffer, which should be fine, as the text of tweets is in UTF-8 according to Twitters API. But: I ran into a field, ‘places’, which contained ‘attributes’, which on its turn had a ‘street_name’. That’s where the UTF-8 encoding broke: it appeared to be encoded in latin-1.

A VERY dirty fix which appears to work, even though I have no idea why, is replacing the ‘utf-8’ decoding by ‘latin-1’ encoding. This parses most of the characters well, even more exotic ones like a heart (u+2665) or smiley (U+1F60C) appear correctly, but there’s also a bit of & lt;3 and such.

I don’t know where the error is: am I doing something wrong, or is the Streaming API inconsistent with its encoding?


#2

I wonder if it’s not the encoding, but instead if the library you are using is handling the socket timeout incorrectly. I had a similar problem with my code and that turned out to be the issue for me.


#3

Could you be a bit more specific? You mean the handling that is described here:
https://dev.twitter.com/docs/streaming-apis/connecting#Reconnecting
?

What would the issue be then, I would like a bit more of an explanation.


#4

The streaming API is unfortunately inconsistent in its encodings, but should always be UTF-8.

There have been some cases of place data being delivered as unencoded UTF-8 characters. For example:

:{“street_address”:“37–47 Brighton Road”}

This contains a multibyte UTF-8 character (the em dash) which, in UTF-8, is represented by the bytes E2 80 93. We used to encode this as \u2013 but the unencoded value is perfectly valid JSON and your parser should be able to handle it.

It may be that your JSON parser is not handling the 3 byte UTF-8 character correctly. If you don’t think that’s the case, could you try to give me some examples (raw bytes would be the most useful) of latin-1 encoded place data and I’ll look into this further.


#5

Yes, your link is what I was referring to. But, maybe it’s a red herring and Arne’s response is more relevant than mine.

The issue would appear for me when a “stall” occurred. I needed to disconnect and reconnect if 90 seconds passed without receiving any data.


#6

The JSON-parser these tools use is just the one from Python itself, I am using Python 2.7.
I couldn’t say whether or not that handles the 3 byte UTF-8 character correctly, I would suppose it does but I’m not very familiar with encoding issues.

I have got some examples for you, these are bytes that broke the JSON parser. I looked up these particular tweets and I found out they contained places which had characters like ü.

A random one:
{“street_address”:“3061\x133067OlinAve”}

And this check-in: http://tinyurl.com/dy3khp8
{“street_address”:“P\xc7A.FLORIANO,55”}
Note the ç, that’s what returned a valueError: ‘utf8’ codec can’t decode byte 0xc7: invalid continuation byte. This 0xc7 is the ç in latin-1, it is invalid in UTF-8.