Hashtag index issue?


#1

Hi

I’m very new here, so apologies if I’m missing something obvious. I am using the REST API 1.1 to retrieve all statuses for a group of users, along with the associated entities. I have noticed what seems to be a small glitch with the start and end index fields on the hashtags. I have one status update being returned which ends with three consecutive hashtags. The first has the correct indexes returned, but the second and third have identical start and end indexes. The details are:

user: 19902709
status id: 420698230910103552
The text is @Wonkypolicywonk Osborne refuses to back inflation-busting minimum wage rise | by @BethRigby http://on.ft.com/1cySeTz via @FT #ukemplaw #NMW #lowpay

and the hashtags are being returned as follows:
start end hashtag
128 137 ukemplaw
139 140 NMW
139 140 lowpay

By my reckoning it should be
start end hashtag
128 137 ukemplaw
139 142 NMW
143 150 lowpay

This is the only example I have found. There may be others, but I haven’t spent any time looking.

If would like to store the hashtags in a table, and want to put a primary index on status_id, startIndex, endIndex, but this obviously violates uniqueness. I could just add hashtag as well (which would still be a problem if someone duplicated the hashtag) but would like to understand the issue better first. Is this correct, and I am missing something, or is it a bug?

Many thanks


#2

Ok, I’ve only just realised the obvious: that this goes beyond the 140 character limit, but that’s just confused me more! There seem to be hashtags beyond the end of the text.


#3

As I said, very new to this, so I’m picking things up slowly. Perhaps I should have spent a bit more time on this before posting! I can see that these are all retweets, so I guess I’m going to have to go to the original status update to get my data correct.

I think I understand it now, but perhaps someone could confirm it. When someone retweets a status update, the original user name is prepended to the text returned by the API, making it potentially longer than 140 characters. Where this happens, the text is truncated, and any entities occurring beyond the 140 limit are incorrectly referenced by the start and end index fields. Seems a little odd to me, but as long as I know what’s going on, I can account for it.