Hi,

I am working on an academic project and have full academic access. I am trying to pull tweets and then save them into a csv file. I am able to get tweets and save them into a jsonl file without an issue. Below is the code I am using:

> def main():
>     # Specify the start time in UTC for the time period you want Tweets from
>     start_time = datetime.datetime(2021, 6, 1, 0, 0, 0, 0, datetime.timezone.utc)
> 
>     # Specify the end time in UTC for the time period you want Tweets from
>     end_time = datetime.datetime(2021, 6, 2, 0, 0, 0, 0, datetime.timezone.utc)
> 
>     # This is where we specify our query as discussed in module 5
>     query = "(\"IPCC\" OR (\"Intergovernmental Panel\" (\"Climate change\" OR \"Global warming\")) OR (\"Intergovernmental Report\" (\"Climate change\" OR \"Global warming\")) OR ((\"United Nations Panel\" OR \"UN Panel\") (\"Climate change\" OR \"Global warming\")) OR ((\"United Nations Report\" OR \"UN Report\") (\"Climate change\" OR \"Global warming\"))) lang:en -is:retweet"
> 
>     # The search_all method call the full-archive search endpoint to get Tweets based on the query, start and end times
>     search_results = client.search_all(query=query , start_time=start_time, end_time=end_time, max_results=10)
> 
>     # Twarc returns all Tweets for the criteria set above, so we page through the results
>     for page in search_results:
>         with open("results.jsonl", "a", encoding="utf8") as json_file:
>             json.dump(page, json_file)
>             json_file.write("\n")  
> if __name__ == "__main__":
>     main()

The jsonl file looks fine.

Then I use this code to convert the jsonl file to csv

#converting to csv from json, specifying output column for csv file through: output_columns="id,text,created_at,author_id"    
with open("results.jsonl", "r", encoding="utf8") as infile:
    with open("output.csv", "w", encoding="utf8") as outfile:
        converter = CSVConverter(infile, outfile, json_encode_all=False, json_encode_lists=True, json_encode_text=False, inline_referenced_tweets=True, allow_duplicates=False, batch_size=1000, output_columns="id,text,created_at,author_id")
        converter.process()

I get a csv file but the text in it changes–it is not just clearly readable alphabets like in the jsonl file.

Screen Shot 2021-07-15 at 8.29.03 AM

Does anyone have suggestions for how I can avoid getting such text: “Äôs” “‚Äú”?

Also, when I don’t specify output column in the jsonl to csv cde, the csv data is mangled. For example:


The ID column fr example has a youtube link, or says mDFw/featured. Similarly, in some rows, other columns have information that should be in another column.

Thank you!

What version of twarc-csv comes up when you list the installed packages like pip list

Make sure you have the latest one with:

pip install--upgrade twarc-csv

Also try running the csv command in the terminal, not in code and see if there are any errors reported:

twarc2 csv results.jsonl output.csv

Also, how are you opening the CSV file? Using what software? It should be opened as a utf8 encoded file

1 Like

Thank you for your response, Igor!

I wanted to clarify that I am using Jupyter notebook-- I am not sure if that has any implications for running the command in the terminal as opposed to in code?

For twarc, I have version 2.3.10; and for twarc-csv, I have version 0.3.6

When I run

I get the following error:

  File "<ipython-input-16-53dc62dd766d>", line 1
    twarc2 csv results.jsonl output.csv
           ^
SyntaxError: invalid syntax

I forgot to answer your last question–I am using Microsoft excel to open the CSV file–I believe it is able to able to open a utf8 encoded file?

Yes, those versions should be ok.

In jupyter, to run twarc2 commands using the command line instead of python, I think the syntax is:

!twarc2 csv results.jsonl output.csv

Excel should have a drop-down option for utf8 when importing or opening the CSV Is it possible to force Excel recognize UTF-8 CSV files automatically? - Stack Overflow

1 Like

Thanks again Igor! When I run the command, I get this message

This is strange because I did have to enter my bearer token to initialise the twarc client… and I don’t fully understand what to do for

Please enter your Bearer Token (leave blank to skip to API key configuration):

I have tried putting in be bearer token like this:

Bearer Token="xxx "

and also tried leaving it blank

Bearer Token=" "

But it stays stuck on *

Oh, you may have to run this in a terminal instead, it will be easier:

twarc2 configure

To set up the commandine tool (commands in the terminal are without the !)

Separately, if the compressed tweet file is small enough, you can also see if you can attach it to a new issue on Issues · DocNow/twarc-csv · GitHub I’ll try to reproduce the same error

1 Like

Thanks, Igor! strangely when I run this code, I get a syntax error :frowning:

Ah, this looks like it is still in the jupyter interface, twarc2 configure command has to be run inside a command prompt or terminal window instead

Thanks Igor, and sorry about that! I was able to configure toward using the terminal, but unfortunately the csv is still mangled. I’ve attached it to a new issue on the git hub link you provided.

1 Like

Thanks!

I had a look, and I think the issue is not with the CSV file itself. Format wise, it is valid, however - when importing it into some software - you may have to specify some extra parameters: Especially in Excel.

Unfortunately i don’t have excel to test, but i think these instructions will apply here: How to open CSV files safely with Microsoft Excel - SupportAbility Knowledge Base (you need to explicitly import the CSV into excel as Text only, and not allow it to mangle the data trying to guess the format)

I would not use Excel for any analysis as it struggles with data types like this.

As an alternative, it should work with Pandas in jupyter notebooks:

import pandas as pd

df = pd.read_csv("output_Singal.csv")
df.describe(include='all')

Preview data:

df.sample(10)
                       id                created_at                                               text  ...                                        __twarc.url __twarc.version  Unnamed: 93
1470  1399791897615421448  2021-06-01T18:16:09.000Z  @TomPlesier @LouisCy72344053 @BarbaraGirouard ...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
102   1403384238590791681  2021-06-11T16:10:50.000Z  Climate Change &amp;amp; Biodiversity loss are und...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
403   1402993239909564419  2021-06-10T14:17:09.000Z  But here&apos;s the new kicker.\n\nCovid created a ...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
2192  1399586234113028096  2021-06-01T04:38:55.000Z  #Stop violating the law in tribal areas... #BA...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
944   1398793161405874176  2021-05-30T00:07:32.000Z  @GeraldKutney @ejwwest @DaisyMcnice @Robin_Hag...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
928   1400003670524104711  2021-06-02T08:17:40.000Z  @fagandr1 @ejwwest @BarbaraGirouard @Dsp3ncr @...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
736   1400848750843772941  2021-06-04T16:15:43.000Z  @GeraldKutney @Homer4K @TinTincognito @donkose...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
549   1402233393966989313  2021-06-08T11:57:48.000Z  UN climate change experts say doing nothing to...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
145   1403565332955074560  2021-06-12T04:10:27.000Z  @suba_says @ger_Kreuz @WalterLapp @sherean @bu...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN
1341  1399809104156282881  2021-06-01T19:24:32.000Z  @Richard16022464 @TomPlesier @LouisCy72344053 ...  ...  https://api.twitter.com/2/tweets/search/all?ex...          2.3.10          NaN

[10 rows x 94 columns]

If you want run twarc commands in a jupyter notebook you should be able to preface it with an exclamation mark:

! twarc2 configure
1 Like

Thanks a bunch foo your help, Igor!

1 Like