Html entities in tweet text


I discovered today that when the REST API returns a ’ html entity, it doesn’t do any encoding, returning it as the “’” character.

Is there an exact list I can refer to of which html entities are encoded and which aren’t, and which encoding is used for each? We have a feature that is trying to do pattern matching, and not being certain of the encodings or getting the encodings wrong is causing production issues in our system.

I found this page on the web for reference, showing that certain entities have multiple possible encodings . I’d like to know which you’re using / which entities you’re encoding and how.

Currently, we are using:

org.apache.commons.lang3.unescapeHtml4 and escapeHtml4

It seems like maybe we should switch to
org.apache.commons.lang3.unescapeJson and escapeJson


Still digging around for the answer for you here, but in the meantime, is the Java implementation of our twitter-text parsing code of any use to you? The file seems to be pertinent.


Hi Andy,

Its an interesting library, but its doesn’t contain the info I need. I need to know which characters you do and don’t encode in the JSON representation of tweet text, and how they are encoded.

I’m guessing you are either using a library of some kind or an ad-hoc mapping that is a slight variant for one of the known encoding schemes.




HI @andypiper,

What I’d really like is either access to the code that populates the “tweetText” field or a
precise description of what’s encoded / what’s not. I’d like to be able to exactly reproduce
the encoding that’s going on.

Is that something that you can make available?




Unfortunately I don’t have that information or that code available. Broadly speaking we use Unicode Normalization Form C in Tweet text internally. I’m not sure we’re able to share code on this beyond the twitter-text library.


Hi @andypiper,

Thanks for the clarification. Between that info & experimenting, I’ll figure out a way to infer an exact mapping.