Hello everyone, I have a list of tweet ids for which I’d like to collect some metadata.
What I could do so far:
-
I could get the full tweet object using twarc hydrate and also retweets for a list of tweet ids. I tried this for only a small sample of tweet ids because my original file contains thousands of tweet ids.
-
I could get some metadata (e.g., text, #retweets, #followers, etc) by writing a custom code in Python (using Twitter API V1). However, I couldn’t get replies or retweets content with this method.
-
To get the conversation thread (replies, and also replies of replies), I switched to Twitter API V2. I could get all I needed (i.e, replies, intermediary replies, retweets, etc) using Postman. However, it’s very time consuming and inefficient to use Postman to collect data for thousands of tweet ids because there is a limit for the number of tweet ids we can use in each API call (100 tweet ids/request)
My question: The result of the twarc hydrate or twarc retweets is a JSON file that contains the full tweet or user object. Is there a similar command (or any other way) to get only specific metadata and get rid of the fields that I don’t need? (e.g., I don’t need the profile_background_color, geo_enabled, is_translator, etc.)
I was thinking about collecting the full tweet JSON as it is and then remove the unwanted fields by modifying the JSON file. However, it seems it takes so much time to get the full tweet json for thousands of tweet ids using the hydrate command.
twarc hydrate and twarc retweets will always call the v1.1 API, so there is no way to get a subset of fields with those commands, they will always return tweets in v1.1 format: Tweet object | Docs | Twitter Developer Platform
twarc2 hydrate will call the v2 API, in the v2 format: Tweet object | Docs | Twitter Developer Platform and by default, request all possible expansions and fields - this was a design choice, to ensure that data collections should be predictable and complete. The idea is that you’d gather the full payload from the API, and then subset the data when processing.
There is no v2 API equivalent of twarc retweets unfortunately - the v2 API can’t retrieve retweets of a specific tweet, but you can request retweets of a specific user, and sort through the results yourself:
twarc2 search --archive "retweets_of:user" all_user_retweets.jsonl
If the issue is with handling large files when processing that’s a separate problem - but also something i’m interested in fixing for twarc - currently twarc-csv plugin is designed to handle large files for example, but still works to include all fields - so the results would still need to be filtered GitHub - DocNow/twarc-csv: A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
1 Like
Thank you @IgorBrigadir for your answer.
Since I could get retweets of tweets using v1 API, what is your opinion about using v1 instead of v2? The downside of using v1 API is that I won’t be able to get conversation replies thread, but having retweets is more important for me as my main analysis will be on retweets cascade)
Yes, I think one problem is with large files and the other problem is how to remove the unwanted fields from the json file.
Thanks for introducing twarc-csv . Personally, I’m more comfortable working with csv files compared to json. However, I’m not sure if csv would be the best format to keep Twitter data. The json response I get using Postman is all I need (retweets, replies, and even intermediary replies), but it’s not efficient to use it for a large number of tweet ids. Is there any way to get a similar response without Postman? (e.g., a custom code in python)
My only reservation with v1.1 would be that it only gives you the latest 100 retweets, but if this is enough for you, there’s nothing stopping you from:
get retweets using v1.1 with twarc retweets, then extract the IDs with twarc dehydrate and then get the v2 representation with twarc2 hydrate to get everything in a common v2 format.
You can certainly use both v1.1 and v2 to get better coverage, but just be aware of the limitations.
I’m not sure if csv would be the best format to keep Twitter data.
Yes, i totally agree (and i wrote twarc-csv!)
Feel free to reuse the twarc-csv code for your own - twarc-csv/twarc_csv.py at main · DocNow/twarc-csv · GitHub this function is the one that writes the CSV, and here is where it’s using pandas twarc-csv/twarc_csv.py at main · DocNow/twarc-csv · GitHub changing those two functions should be enough to reuse it to make it work for another purpose. If you clone the repo and pip install -e . you will be able to make changes to the code and it will immediately work with twarc2 csv command, which you can rename or do whatever you need: Plugins - twarc for more details on plugins for twarc.
1 Like
Thank you so much @IgorBrigadir or the complete answer.
1 Like
Hi @IgorBrigadir, I know it’s been a while since this post, but I just faced a new question:
I used the above command you suggested (twarc2 search ......"retweets_of:user") to get retweets. Does this command return all retweets for a list of usernames or only the latest 100 retweets?? Thanks!
yep - retweets_of:user will retrieve all retweets as far as i know!
1 Like
Hi! On this topic. I have successfully used twarc2 to get an academic access search_all query, and I used CSVConverter to convert the json file into a cvs file. Now, this csv file has all the field names, which fives me 93 columns. How can I filter inside pythong to only a few field names I’m interested. Is there now way to limite this in the seach_all command directly?
thanks
There’s no way to limit this from the twarc2 search command directly right now, but you can select a subset of fields when converting to CSV: (provided you already have the results and have installed twarc-csv with pip install twarc-csv)
twarc2 csv results.json output.csv --output-columns "id,author.id,text"
See twarc-csv/twarc_csv.py at main · DocNow/twarc-csv · GitHub for a full list of field names.
Thanks Igor! What would be the way to do this inside python w/ CVSconverter(infile, outfile)? I can always read the CSV file and save it again w/ the columns of interest but it wouldn’t be as nice.
This should work with twarc-csv:
from twarc_csv import CSVConverter
with open("input.json", "r") as infile:
with open("output.csv", "w") as outfile:
converter = CSVConverter(infile=infile, outfile=outfile, output_columns="id,author.id,text")
converter.process()
thanks igor, I’m stuck with the double quotes trying to make the code readable
using:
info_returned = "".join(
['id,created_at,text,attachments.media,attachments.media_keys,' # tweet
'author.created_at,author.username,author.name,author.id,' # author
'author.location,author.public_metrics.followers_count,author.public_metrics.following_count,'
'geo.coordinates.coordinates,geo.coordinates.type,geo.country,'
'geo.country_code,geo.full_name,geo.geo.bbox,geo.geo.type,geo.id,'
'geo.name,geo.place_id,geo.place_type,'
'lang'])
and then to replace the single quotes by double quotes
converter = CSVConverter(infile=infile, outfile=outfile,
output_columns=json.dumps(info_returned))
but this still leads to ’ " … " ’ :
any ideas?
I don’t undertand why info_returned is in the form that it is, but output_columns expects a string with comma separated values (quotes don’t matter in python), and you don’t need json.dumps. The "".join() already outputs a single string with comma separated values, so it should work with this:
converter = CSVConverter(infile=infile, outfile=outfile, output_columns=info_returned)
Thank you,
My bug was that I was using e.g. ‘id, created_at, text, attachments.media, attachments.media_keys’ instead of ‘id,created_at,text,attachments.media,attachments.media_keys’ in case this helps anyone.
1 Like