Twitter Entity indices need more documentation (UTF-8 or Byte indexes)


#1

Relevant documentation link: https://dev.twitter.com/overview/api/entities#obj-usermention

All the entity fields (hashtag indexes, user mention indexes, URL indexes) specify ranges of characters that specify the location in the tweet corresponding to these entities. Unfortunately, the documentation does not specify whether the indexes are UTF-8 indexes or are byte indexes. As a result, it is hard to parse the tweet correctly. Can the documentation please be updated with an authoritative mention as to whether these are Byte or UTF-8 indices? Please note, I’m already aware of “Counting Characters” as specified here: https://dev.twitter.com/overview/api/counting-characters.


#2

The relevant part seems to be

Tweet length is measured by the number of codepoints in the NFC normalized version of the text

so I expect they do it the same for counting the indexes/ranges.