Hello,
I have a list of tweet ids (tids.csv) for which I need to collect all retweets. Since we can’t directly retrieve retweets of a specific tweet in v2 API, I tried to get all retweets of a specific user and then filtering out the retweets of the specific tweets we’re interested in. Here is the command I used to get all retweets of users:
while read line; do twarc2 search --archive --start-time "..." --end-time "..." "retweets_of:$line"; done < usernames.txt > usersRetweets.jsonl
where usernames.txt is a text file including a list of usernames.
PROBLEM:
- There are
~17 thousand unique usernames in my input text file and I found it takes a long time and a very large space to get all retweets of this number of users. For example, it took a couple of days and ~5 GB of space to collect retweets for only ~600 users.
QUESTION:
- What is the most efficient way in terms of time and space to get all retweets?
*The above command retrieves all retweets of all tweets for each user, while I only need all retweets of the specific tweets in my dataset (tids.csv). Here is what I’m thinking to do:
For each user in username.txt:
- Get all retweets of all tweets for a username
- Compare the source tweet of the retweets with the tweet ids in my
tid.csv file, keep only the retweets that their source tweet id found in the file and remove the rest.
Also, it would be great if there is a way to get only the attributes/fields that I need in the json response in step 1 because I don’t need all of them
Can someone please help me with this? I mean I don’t know how to programmatically do this.
Thank you!
1 Like
Unfortunately no such thing exists off the shelf to use. I think at this point it would be better to code this yourself using twarc as a library instead of a bash script: The available functions are here: twarc.Client2 - twarc
You’re on the right track: At a high level, i would break it down and set up a pipeline like this, processing 1 user at a time, as opposed to everything all at once:
For your given time range, retrieve all retweets of a target user, setting the start date to be the created at time of the tweet to minimize getting any other ones.
Extract only the retweets of the tweet you’re interested in discarding the rest (based on referenced_tweets field)
For each remaining retweet, extract only the fields you want in the json from the dictionary.
After that, write this minimal dictionary into a file and then compress it with gzip (json data generally compresses really well)
This way, you can reuse a lot of twarc functions without making major modifications, and processing 1 user at a time will keep space usage to a minimum, but this is provided you know exactly what you want to use.
Code wise this is much more involved and may require some time but there’s nothing required here that wasn’t done before by others so you could get pretty far trying to assemble parts from stackoverflow and other places.
1 Like