I’m new to scraping Twitter data. My goal is to take the search results and analyze them in Pandas.

I’ve been able to do this with the Twarc2 command line program (I’m running Ubuntu), but saving the results as a .jsonl and converting it to a .csv before importing into a DF is clunky.

From this page it seems accessing Twitter using Twarc as a client offers more flexibility in crafting a query. But I’m struggling to get the results to match those from the command line version.

The command line two-step (jsonl → csv → df) converts the data into one flat list, where each json/dictionary is turned into fields (eg., “public_metrics.like_count”, “public_metrics.quote_count”).

Using the code example linked above, I’ve been able to use the python client to do the search and get the search results into a df. The original code prints the results; I’m converting them to a list, which I then turn into a df:

tweet_list = []
for page in search_results:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            # Here we are printing the full Tweet object JSON to the console
            # print(json.dumps(tweet)
            tweet_list.append(tweet)

tweets = pd.DataFrame[tweet_list]

This allows me to see the tweet data, but a lot of columns have json/dictionary entries. For example, the “public_metrics” column has

{'retweet_count': 0, 'reply_count': 2, 'like_count': 1, 'quote_count': 0}

I can turn these into individual df columns, but it’s a nuisance:

def make_retweets(tweet):
    return tweet['retweet_count']

tweets['retweets'] = tweets['public_metrics'].apply(make_retweets)

It’s worse for more complicated fields like “context_annotations” which has a nested list/dictionary structure.

What is the better / easier way to get the python client data formatted the same way as the command line, two-step process?

Thanks,
Adam

I generally recommend sticking with the command line, unless you have some very specific requirements for altering how things are stored / processed - currently everything you can do in the api, you can do in the command line - so i’d be very interested to see if we’ve missed something or if we should change stuff around to make it easier to use via scripts.

The way to replicate what you do in command line in scripts is also possible with twarc, you can use the twarc-csv dataframe converter: Examples of using twarc2 as a library - twarc

The example here still writes an intermediate json file before converting to a CSV and a dataframe, but i would not recommend downloading twitter data without keeping the original API responses (it makes it easier to regenerate a dataframe in case you need to change how fields / columns are handled.)

However, there’s no actual need to flatten the tweets or write json or CSVs if you don’t want to, here’s an example that grabs the data and loads everything into a dataframe (it still saves json separately, but you can leave this out if you really need to):

from twarc import Twarc2, expansions
import datetime
import json

from twarc_csv import DataFrameConverter # need to pip install twarc-csv

import pandas as pd

# Replace your bearer token below
client = Twarc2(bearer_token="aaaaaa")

# Write a result to a file, and pass it on to another function
# Get rid of this if you don't want to save the data
def yield_and_write_json(results):
    for result in results:
        with open("results.jsonl", "w") as f:
            f.write(json.dumps(result) + "\n")
        yield result

def main():
    # Specify the start time in UTC for the time period you want Tweets from
    start_time = datetime.datetime(2022, 10, 31, 0, 0, 0, 0, datetime.timezone.utc)

    # Specify the end time in UTC for the time period you want Tweets from
    end_time = datetime.datetime(2022, 11, 1, 0, 0, 0, 0, datetime.timezone.utc)

    # This is where we specify our query as discussed in module 5
    query = "from:twitterdev -is:retweet"

    # The search_all method call the full-archive search endpoint to get Tweets based on the query, start and end times
    # note max_results is max per page, not overall!
    search_results = client.search_all(query=query, start_time=start_time, end_time=end_time, max_results=100)

    # Default options for Dataframe converter (process retweets, entities etc)
    converter = DataFrameConverter()

    # Save results to json and convert them to a dataframe on the fly:
    df = converter.process(yield_and_write_json(search_results))
    # Or just convert straight away without saving:
    #df = converter.process(search_results)

    # Results:
    print(df.describe(include="all"))


if __name__ == "__main__":
    main()

Igor,
Thanks for your response. I’ll look some more at the command line program since you recommend it.

I’ve also figured out a way to get the data into the form that I want, using the pandas json_normalize().

I’ve written a nicely formatted response, but I keep getting an error when I try to post it to the system :frowning:

Thanks!
Adam

What’s the exact error?

Also, json_normalize is exactly what twarc-csv does underneath! Except there’s a bunch of extra things that format the data a bit better too - twarc-csv/dataframe_converter.py at main · DocNow/twarc-csv · GitHub

When I “reply” to your message, with a longer, formatted, post, I get the message:

“An error occurred: Sorry you cannot post a link to that host.”

With a bit more investigation / experimentation the problem seems to be that I included a link to
towardsdatascience dot com//all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd in the text of my message.

I cut-and-pasted this from my browser’s address box and when it is a properly shaped HTTPS address (that includes the https:// and a period (instead of the ‘dot’)), the preview on the right side of the screen says:

Sorry, we were unable to generate a preview for this web page, because the following oEmbed / OpenGraph tags could not be found: description, image 

So the problem seems to be related to this system’s desire to follow HTML links to provide a preview and, if this is not possible, it generates an error message.

1 Like

Oh, that’s just the forum being flaky, it’s no big deal. If any links fail you can ways use the “preformated text” option!

So I’m trying to use the command-line option, as you recommended.

One question – that is probably more linux / jupyter / python than Twarc2: how can I pass variables to the command line as parameters and have them interpreted properly?

I’m looking for tweets about the Doctor Strange movie, so I created a string with my search terms:

query = '\"Doctor Strange\"  -is:retweet'

and then tried to call the command line Twarc2:

!twarc2 search query --start-time 2022-1-1 --limit 5000 --archive tempFile.jsonl

but what was returned was tweets that contained the word “query” rather than my query string.

I’d like to be able to submit other parameters (like the integer for the limit or the name of the file) to the command line. Is this possible?

Thanks!
Adam

Oh, it looks like you’re running this in a jupyter notebook, it’s maybe worth running these command line commands in a terminal, not in a cell, but the way to do it is with $ shell variables if you need, or just writing the query out, but being careful about escaping " quotes appropriately:

twarc2 search "\"doctor strange\"" --start-time "2022-01-01" --limit 5000 --archive tempFile.jsonl

The outer " enclose the command line parameter, the inner ", escaped as \" define the phrase

Also good idea to enclose the date in " too, making it more explicit in case it gets parsed incorrectly.