Full-archive search issue report:

During recent attempts to fetch tweets using twitter API v2, we noticed that there were missing ‘urls’ under ‘entities’ for tweets retweeting another tweet. This happened most, but not all of the time. For example, tweet with id 1499880727214706692 has entities:

{‘mentions’: [{‘start’: 3, ‘end’: 19, ‘username’: ‘criticalthreats’, ‘id’: ‘106738320’}, {‘start’: 112, ‘end’: 126, ‘username’: ‘TheStudyofWar’, ‘id’: ‘71298686’}], ‘hashtags’: [{‘start’: 27, ‘end’: 35, ‘tag’: ‘Russian’}]}. Notice that the image below is the example tweet, it has a URL.


See attached spreadsheet which includes a subset of retweets for @aei, where the majority have a URL in the original tweet, and yet these aren’t included in entities (column Q is filled where it is included). (Green shaded rows, between 2-21, do not have a URL in the original tweet, and therefore should not have one in entities.)

It isn’t clear why sometimes the URL from the original tweet is included in entities, and why it usually isn’t.

Getting the expanded URL:

Where tweet authors have used a shortened URL in their original tweet, the API often doesn’t include the fully expanded URL.

Retweets are nearly always going to be truncated, so the entities will be missing from the truncated text in the retweet. The solution is to use referenced_tweets.id,referenced_tweets.id.author_id expansions and get the original, retweeted tweet - that will have the full payload. These will be inside includes in the response.

For example, twarc GitHub - DocNow/twarc: A command line tool (and Python library) for archiving Twitter JSON will request all available fields and expansions and twarc-csv will use these to merge retweets properly - GitHub - DocNow/twarc-csv: A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.

2 Likes