Unicode characters breaking Tweet Entity Indices


I have some PHP for outputting tweets using via the 1.1 api for user_timeline, where I am using the start and end indices for string replacement to add in the links. However, when unicode characters are present, the index values do not take those full character lengths into account, leading to the string replacement cutting off unicode characters incorrectly.

What would be the best method of ensuring that my string replacement for entities will not interfere with unicode characters which have more then one “character” in the string?

Example: https://twitter.com/PresHernandez_/status/390187276195491840

My output right now, note the cut off unicode character before the t.co link.

Another sneak peek��<a href="http://t.co/ZqyFEgJcKI" target="_blank">http://t.co/ZqyFEgJcKI</a>c<a href="http://www.twitter.com/WEtv" target="_blank">@WEtv</a>etv

To generate the replacement, I’m using the following -

foreach($tweet_entities as $entity_replace){
  $tweet_replace = '<a href="'.$entity_replace['replace_url'].'" target="_blank">'.$entity_replace['replace_text'].'</a>';
  $tweet_replace_length = $entity_replace['end'] - $entity_replace['start'];
  $tweet_text = substr_replace($tweet_text, $tweet_replace, $entity_replace['start'], $tweet_replace_length);


Hi, I’ve the same problem, seems that twitter use Unicode Normalization in counting chars. Emoticons for example are represented by two bytes but twitter count it as a single char.
I think here you can find a better description to the problem:

Still that I can’t figure out how to use this in Javascript :frowning: