Is it normal to have \u escaped unicode text in "text" field of json response or you actually retrieves UTF-8 code?


#1

When I tried to fetch my post using https://api.twitter.com/1.1/statuses/user_timeline.json at https://dev.twitter.com/console, it retrieves like :

“text”: “아무리 오픈 소스래지만…”

If there is English text with it, it is like

“text”: “Hello, 아무리 오픈 소스래지만…”

However, when I tried using open source library like twitcurl, what it retrieved is :

“text”: “Hello, \uc544\ubb34\ub9ac \uc624\ud508 \uc18c\uc2a4\ub798\uc9c0\ub9cc…”

so, it only escape Korean text in \u escaped text. ( the code point of 아 is \uC544 and its actual binary representation is 0xEC 0x95 0x84 )

is it supposed to return with the \u escaped text, or the https://dev.twitter.com/console just replaces the \u escaped text with actual Korean text? Do you actually convert those \u escaped unicode with actual printable Korean text in the JavaScript code?
If it’s normal to have \u escaped text, it’s strange that it doesn’t do the same for the “Hello,” part. ( In know that UTF-8 has same code values for ASCII area, but… )

I tried to click the “Snapshot” button and there I can see it raw text, but it doesn’t contain the \u escaped text. It contains real Korean text.
However, by opening “Web Console” of Firefox browser and inspecting the json response, I could see \u escaped text.

So, should we, an applicaton developer, convert those \u escaped text before displaying? ( I use C++. I noticed that there are some json libraries for PHP/Python, which decodes the \u escaped text correct to display such unicode while those English portion is not changed.
(It’s kind of weird only some portion of text is escaped, still. )

I’m trying to build json libraries for C/C++ to have some consistent behavior. But I’m not sure whether it’s correct to turn on Unicode support or not.
If it’s to be displayed as “\u …” it would be that I have to turn off unicode support, because \u … is ASCII representation of UTF-8. But on the other hand, if there is such decode() function which replaces the \u text with real Unicode text, I may need to turn on Unicode support.
To decide what I should do, I need the information asked so far.

Please someone respond.

Thank you in advance.


#2

https://dev.twitter.com/console directly replaces \uXXXX with the correct Unicode character.
It’s the job of the JSON parser to do that replacement. You always want the JSON parser to do that replacement (except maybe when debugging).

Non-ASCII characters are replaced by \uXXXX because it’s what works the best in most cases. But it’s harder to read when working with non-ASCII languages.