Struggling with Hashtag entities

api

#1

In my Twitter application I’ve gotten highlighting and handling of hashtag entities to work… mostly.
In 95% of my cases I can highlight the hashtags in the entities just fine. However, in a few cases my highlighting works, but when I’m using the indices in the entity to continue displaying tweet info, some parts of the hashtag in the text remains!

Examples:

Hashtags in tweet with status 876535407378403328 is displayed as: #mbfamilydayy #VB777
And in tweet with status 877155779383525376 they are displayed as: Cancunun

In these examples there are repeated letters after the hashtag (“y”, “7” and “un”), which must be due to my calculated indices being somewhat off and still fetching hashtag text from the text field.

I can’t, for the life of me, figure out why this is happening. The only common denominator I seem to have found is that both these examples have an emoji in them somewhere, but I’m developing in C# where all strings are unicode so that shouldn’t matter (in theory)

Here’s the code that I use to generate the text with highlighted hashtags. I’d really appreciate if someone could show and tell what’s wrong here.

int previousIndex = 0;
foreach(entity in tweet.entities.hashtags) {
    int length = entity.indices[0] - previousIndex;
    string subText = tweet.text.substring(previousIndex, length);
    myTweet.add(subText);

    HashTag hashtag = new HashTag;
    hashtag.text = "#" + entity.text;
    myTweet.add(hashtag);

    previousIndex = entity.indices[1];
}
int finalLength = tweet.text.Length - previousLength;
string finalSubText = tweet.text.substring(previousIndex, finalLength);
myTweet.add(finalSubText);

#2

I ended up using int[] indexLookup = StringInfo.ParseCombiningCharacters(tweetText), and checking the entity indices against that array.

The reason for this is that in C# an emoji consists of two characters. The StringInfo method combines all dual unicode characters to one character. That way I can check that i.e. entity.start_index of 47 is actually index 49 in a string containing two emojis.

This is the edited (somewhat pseudo) code that handles emojis and does not duplicate text after hashtags:

int[] textIndices = StringInfo.ParseCombiningCharacters(tweet.Text);

int previousIndex = 0;
foreach(entity in tweet.entities.hashtags) {
    int length = textIndices[entity.indices[0]] - previousIndex;
    if (length < 0) length = 0;
    if (previousIndex > tweet.text.Length) previousIndex = tweet.text.Length;
    string subText = tweet.text.substring(previousIndex, length);
    myTweet.add(subText);

    HashTag hashtag = new HashTag;
    hashtag.text = "#" + entity.text;
    myTweet.add(hashtag);

    if (entity.indices[1] > textIndices.Length)
        previousIndex = textIndices.Length - 1;
    else
        previousIndex = textIndices[entity.indices[1]] + 1;
}
int finalLength = tweet.text.Length - previousLength;
if (finalLength < 0) finalLength = 0;
if (previousIndex > tweet.text.Length) previousIndex = tweet.text.Length;
string finalSubText = tweet.text.substring(previousIndex, finalLength);
myTweet.add(finalSubText);