I discovered today that when the REST API returns a ’ html entity, it doesn’t do any encoding, returning it as the “’” character.
Is there an exact list I can refer to of which html entities are encoded and which aren’t, and which encoding is used for each? We have a feature that is trying to do pattern matching, and not being certain of the encodings or getting the encodings wrong is causing production issues in our system.
I found this page on the web for reference, showing that certain entities have multiple possible encodings . I’d like to know which you’re using / which entities you’re encoding and how.
https://dev.w3.org/html5/html-author/charref
Currently, we are using:
org.apache.commons.lang3.unescapeHtml4 and escapeHtml4
It seems like maybe we should switch to
org.apache.commons.lang3.unescapeJson and escapeJson