Tweet Text HTML Encoding Clarity

rest
api

#1

Hi,

I’ve been using the API for some time now and noticed that at some point over the course of time, the Tweet Text in the API has started being HTML encoded. I remember historically specifically ensuring UI elements that used this data displayed correctly and did not fall prey to any JavaScript inside a tweet. However this is now not the case and our data is showing incorrectly because the data we receive from the API is no longer the true text.

This is not documented anywhere that I can find, the recent changes (https://dev.twitter.com/ads/overview/recent-changes) do not mention a change. There is no mention beyond UTF-8 encoding against the documentation either ([https://dev.twitter.com/overview/api/tweets] (https://dev.twitter.com/overview/api/tweets)), which actually says text contains the “actual text”. Using the API tool to do some testing (https://dev.twitter.com/rest/tools/console) shows the text in the response as not HTML encoded, but this is incorrect, the HTTP response to the tool looks to be encoded. but the tool does not represent that in the JSON object. The actual response coming back from the API using another tool definitely shows the HTTP response as encoded, i.e. text=<script>alert("here be dragons");</script>.

I’d like to resolve the problem, but without documentation I am not certain we can literally just decode the resulting text to get the correct text for the tweet. Is it possible to get the raw text by passing an undocumented flag? If not then what is the expected process to get the actual text of the tweet that has not been processed into display on HTML?

Thanks,
Dan Saltmer


#2

You’re right - this is strange - I just tried with tweet id 743080790070833153 with CURL which has the text:

@dansaltmer <script>alert("here be dragons);</script> 222222

I’d expect to see:
"text":"@dansaltmer <script>alert(\"here be dragons);</script> 222222"

in the json, instead it’s:

"text":"@dansaltmer &lt;script&gt;alert(\"here be dragons);&lt;\/script&gt; 222222"


#3

Wait - it’s not strange at all - Found some old 2013 tweets that would have been written straight to a file, presumably preserving the way the API returned them: <, >, & are indeed HTML encoded!

I guess i never noticed that because of the way i’d pre process the text. Always assumed no HTML encoding - but it looks like the API was returning HTML encoded text for quite some time!

Eg: I had this old tweet retrieved from the API in 2013: https://twitter.com/heachamweather/status/297132225462411265

The json field was "text":"Heacham Weather:Temp=4.8C &amp; is Falling.Low=4.8C &amp; high=4.8C.Pressure=1006.9mb &amp; is Falling.Wind=4.5mph SSW &amp; gust 9.2mph.Rain today=3.3mm."

Exactly the same as it is now. It should be fine if you just HTML Decode the text.


#4

Thanks for looking that far back, I was certain at some point around then I had been dealing with un-encoded tweets, the bug has only just been noticed in our software. I even recall TweetDeck having an issue where they allowed javascript to be executed.

While I am happy to incorporate a HTML decode if that is the intended scenario, it feels wrong to me as I’d like the unaltered text of the tweet, this is transforming it and transforming it back again, that’s not guaranteed to have the same text. But if it’s safe to do so without getting text incorrect then that’s what is required.

If that is also the intended usage, the API tool could do with being updated to stop hiding the issue and the documentation could explain what it is expecting to return. This would really help people going forward coming to the API documentation/tools.


#5

I encountered the problem of html encoding using the API recently. I was looking at the results using the Twitter Development Console as well as Twurl. Console shows no encoding (inspecting with browser tools), however Twurl shows html encoded values. This inconsistency can cause confusion. It would be helpful to have their dev tools consistent on encoding.