Relevant documentation link: https://dev.twitter.com/overview/api/entities#obj-usermention
All the entity fields (hashtag indexes, user mention indexes, URL indexes) specify ranges of characters that specify the location in the tweet corresponding to these entities. Unfortunately, the documentation does not specify whether the indexes are UTF-8 indexes or are byte indexes. As a result, it is hard to parse the tweet correctly. Can the documentation please be updated with an authoritative mention as to whether these are Byte or UTF-8 indices? Please note, I’m already aware of “Counting Characters” as specified here: https://dev.twitter.com/overview/api/counting-characters.
The relevant part seems to be
Tweet length is measured by the number of codepoints in the NFC normalized version of the text
so I expect they do it the same for counting the indexes/ranges.