Html entities in tweet text


#1

I discovered today that when the REST API returns a ’ html entity, it doesn’t do any encoding, returning it as the “’” character.

Is there an exact list I can refer to of which html entities are encoded and which aren’t, and which encoding is used for each? We have a feature that is trying to do pattern matching, and not being certain of the encodings or getting the encodings wrong is causing production issues in our system.

I found this page on the web for reference, showing that certain entities have multiple possible encodings . I’d like to know which you’re using / which entities you’re encoding and how.

https://dev.w3.org/html5/html-author/charref

Currently, we are using:

org.apache.commons.lang3.unescapeHtml4 and escapeHtml4

It seems like maybe we should switch to
org.apache.commons.lang3.unescapeJson and escapeJson


#6

Still digging around for the answer for you here, but in the meantime, is the Java implementation of our twitter-text parsing code of any use to you? The Regex.java file seems to be pertinent.


#7

Hi Andy,

Its an interesting library, but its doesn’t contain the info I need. I need to know which characters you do and don’t encode in the JSON representation of tweet text, and how they are encoded.

I’m guessing you are either using a library of some kind or an ad-hoc mapping that is a slight variant for one of the known encoding schemes.

Thanks,

Chris


#8

HI @andypiper,

What I’d really like is either access to the code that populates the “tweetText” field or a
precise description of what’s encoded / what’s not. I’d like to be able to exactly reproduce
the encoding that’s going on.

Is that something that you can make available?

Thanks,

Chris


#9

Unfortunately I don’t have that information or that code available. Broadly speaking we use Unicode Normalization Form C in Tweet text internally. I’m not sure we’re able to share code on this beyond the twitter-text library.


#10

Hi @andypiper,

Thanks for the clarification. Between that info & experimenting, I’ll figure out a way to infer an exact mapping.