twarc was actually designed to be used as a command line tool without having to write code, but this wasn’t documented in the twitter course - we have twarc docs here for that twarc2 (en) - twarc and Tutorials - twarc so if you like, you can skip the python entirely and do this in the terminal:
Setup twarc with a useful csv plugin:
pip install --upgrade twarc twarc-csv
Add your bearer token:
twarc2 configure
Make a search using the academic access endpoint (--archive) and save it as results.jsonl json file with 1 API response per line:
twarc2 search --archive --start-time "2021-01-01" --end-time "2021-05-30" "from:twitterdev -is:retweet" results.jsonl
And then convert the json to CSV to explore:
twarc2 csv results.jsonl results.csv
Alternatively if you want to stick with the code, if this example iterates over each tweet, printing it to screen,
# Twarc returns all Tweets for the criteria set above, so we page through the results
for page in search_results:
# The Twitter API v2 returns the Tweet information and the user, media etc. separately
# so we use expansions.flatten to get all the information in a single JSON
result = expansions.flatten(page)
for tweet in result:
# Here we are printing the full Tweet object JSON to the console
print(json.dumps(tweet))
you can replace the print with a write to a file and write 1 tweet per line
# Twarc returns all Tweets for the criteria set above, so we page through the results
for page in search_results:
# The Twitter API v2 returns the Tweet information and the user, media etc. separately
# so we use expansions.flatten to get all the information in a single JSON
result = expansions.flatten(page)
for tweet in result:
with open("results.jsonl", "w+" ) as f:
# Here we are writing 1 Tweet object JSON per line
f.write(json.dumps(tweet) + "\n")
or write 1 API response per line:
# Twarc returns all Tweets for the criteria set above, so we page through the results
for page in search_results:
# The Twitter API v2 returns the Tweet information and the user, media etc. separately
# Here we are writing 1 of these self contained responses with all metadata per line:
with open("results.jsonl", "w+" ) as f:
f.write(json.dumps(page) + "\n")
By default this is exactly what the command line in twarc2 command does (writes 1 original response per line) - if you want 1 tweet per line, the command:
twarc2 flatten results.jsonl tweet_per_line.jsonl
will convert this to 1 tweet per line