Summary: is the Streaming API inconsistent with its encodings, everything in UTF-8 but places in latin-1?
A few weeks ago, I started a project in which I need the Streaming API of Twitter. I searched for a good Python wrapper, I am currently using https://github.com/sixohsix/twitter
I’m using the Streaming API via these Twitter Tools, and I noticed after a fairly random time (sometimes after ten minutes, sometimes after forty), my program wouldn’t receive anymore tweets from the iterator. I have been doing a little investigation to find out why, and I think it is because of the encoding used.
The code tries to UTF-8 decode the buffer, which should be fine, as the text of tweets is in UTF-8 according to Twitters API. But: I ran into a field, ‘places’, which contained ‘attributes’, which on its turn had a ‘street_name’. That’s where the UTF-8 encoding broke: it appeared to be encoded in latin-1.
A VERY dirty fix which appears to work, even though I have no idea why, is replacing the ‘utf-8’ decoding by ‘latin-1’ encoding. This parses most of the characters well, even more exotic ones like a heart (u+2665) or smiley (U+1F60C) appear correctly, but there’s also a bit of & lt;3 and such.
I don’t know where the error is: am I doing something wrong, or is the Streaming API inconsistent with its encoding?