I’m new to scraping Twitter data. My goal is to take the search results and analyze them in Pandas.
I’ve been able to do this with the Twarc2 command line program (I’m running Ubuntu), but saving the results as a .jsonl and converting it to a .csv before importing into a DF is clunky.
From this page it seems accessing Twitter using Twarc as a client offers more flexibility in crafting a query. But I’m struggling to get the results to match those from the command line version.
The command line two-step (jsonl → csv → df) converts the data into one flat list, where each json/dictionary is turned into fields (eg., “public_metrics.like_count”, “public_metrics.quote_count”).
Using the code example linked above, I’ve been able to use the python client to do the search and get the search results into a df. The original code prints the results; I’m converting them to a list, which I then turn into a df:
tweet_list = []
for page in search_results:
# The Twitter API v2 returns the Tweet information and the user, media etc. separately
# so we use expansions.flatten to get all the information in a single JSON
result = expansions.flatten(page)
for tweet in result:
# Here we are printing the full Tweet object JSON to the console
# print(json.dumps(tweet)
tweet_list.append(tweet)
tweets = pd.DataFrame[tweet_list]
This allows me to see the tweet data, but a lot of columns have json/dictionary entries. For example, the “public_metrics” column has
{'retweet_count': 0, 'reply_count': 2, 'like_count': 1, 'quote_count': 0}
I can turn these into individual df columns, but it’s a nuisance:
def make_retweets(tweet):
return tweet['retweet_count']
tweets['retweets'] = tweets['public_metrics'].apply(make_retweets)
It’s worse for more complicated fields like “context_annotations” which has a nested list/dictionary structure.
What is the better / easier way to get the python client data formatted the same way as the command line, two-step process?
Thanks,
Adam