Changes to Tweet Entities for Retweets


#1

Today, we are announcing a change to [node:127, title="Tweet Entities"] that improves an existing behavior with Retweets. While this should be transparent, it might be a breaking change in some cases, so we want to make sure everyone has the opportunity to review their applications before we start rolling it out on January 6, 2014.

Please note only the applications relying on top-level text and entities for Retweets instead of the ones nested in retweeted_status are impacted by this change. We strongly recommend using the text and entities fields from the retweeted_status object when working with Retweets.

As described in the documentation of [node:127, title="Entities for Retweets", hash="retweets"], a Retweet top-level text property may be truncated further to the addition of the “RT @ username: ” prefix. This attribute has been used to extract entities and consequently, an entity might be missing or incorrect.

The change we are introducing uses the original Tweet instead to correct this behavior. While this will ensure every Retweet has consistent top-level entities, the related from/to indices may now point to a truncated portion of the text attribute, or even only to the final ellipsis character.

Check out the following Retweet example to illustrate the change: https://twitter.com/romainhuet/status/390435661129740288

JSON extract of the Retweet before the change:

{
  ...
  "text": "RT @university: Learn more about the powerful #Linux container engine @docker in this video intro with @solomonstre - http:\/\/t.co\/QJLdA1762\u2026",
  "entities": {
    "hashtags": [{
      "text": "Linux",
      "indices": [46, 52]
    }],
    "symbols": [],
    "urls": [],
    "user_mentions": [{
      "screen_name": "university",
      "name": "Twitter University",
      "id": 1665823832,
      "id_str": "1665823832",
      "indices": [3, 14]
    }, {
      "screen_name": "docker",
      "name": "Docker",
      "id": 1138959692,
      "id_str": "1138959692",
      "indices": [70, 77]
    }, {
      "screen_name": "solomonstre",
      "name": "Solomon Hykes",
      "id": 9551792,
      "id_str": "9551792",
      "indices": [103, 115]
    }]
  }
}

Since the URL was truncated in the top-level text, previously the URL entity would be missing. With this change, the URL entity will be included and its indices will start at the beginning of the URL text, but actually end at the ellipsis character. The URL entity will contain the entire shortened URL, even though that URL is not fully contained in the Retweet text.

The @TwitterOSS user mention entity was also missing since it was entirely truncated. It will now be included but will only reference the ellipsis character at indices [139, 140]. Please also note that the order of entities is not guaranteed to be ordered by from indices.

JSON extract of the Retweet after the change:

``` { ... "text": "RT @university: Learn more about the powerful #Linux container engine @docker in this video intro with @solomonstre - http:\/\/t.co\/QJLdA1762\u2026", "entities": { "hashtags": [{ "text": "Linux", "indices": [46, 52] }], "symbols": [], "urls": [{ "url": "http://t.co/QJLdA1762Y", "expanded_url": "http://youtu.be/Q5POuMHxW-0", "display_url": "youtu.be/Q5POuMHxW-0", "indices": [118, 140] }], "user_mentions": [{ "screen_name": "university", "name": "Twitter University", "id": 1665823832, "id_str": "1665823832", "indices": [3, 14] }, { "screen_name": "docker", "name": "Docker", "id": 1138959692, "id_str": "1138959692", "indices": [70, 77] }, { "screen_name": "solomonstre", "name": "Solomon Hykes", "id": 9551792, "id_str": "9551792", "indices": [103, 115] }, { "screen_name": "TwitterOSS", "name": "Twitter Open Source", "id": 376825877, "id_str": "376825877", "indices": [139, 140] }] } } ```

Top-level entities will therefore reflect the ones from the original Tweet in retweeted_status which remain unchanged. Read more in the documentation: [node:127, title="Entities in Twitter Objects", hash="retweets"].

We plan to progressively roll out this change starting January 6, 2014 for both REST and Streaming APIs. As usual, we have updated our [node:12047, title="calendar of API changes"] accordingly.

Please review your applications if you are currently using the top-level entities for Retweets and feel free to use this thread for any questions.


#2

I’ve notices that sometime the indexes are wrong and they miss one or more characters. An example can be found in this tweet id=425357584103903233
here the retweet status report one media url, starting from 56, ending in 78 and extracting a substring from the retweet status text using these indexes brings one white space at the begin and truncate the last url char.

On another example like: tweet id= 425357572817039360 the indexes for the media truncate the last two chars of the retweeted status.
On either example, seems that the unicode chars in the text are removed before extracting the indexes.

Please could you specify if this is the correct methodology for extract indexes:
1: clean the text from unicode special characters
2: calculate the indexes using regexp or what else methodology

I also tried to normalize the text with NFC but it doesn’t seem to change the result.

Thanks


#3

I’m also seeing errors in the indices for tweet id 473541442758262786. The entity in question is the final (truncated) link: “http://t.c\u2026”. The entity entry gives the indices for this as 139 to 140, which would correspond to the \u2026 (ellipsis), not to the correct span of characters, “http://t.c\u2026” (131-140). This is making it impossible for my application to properly handle this entity. I’d appreciate any assistance from the API team on this issue. Thanks!