Hi everyone. Thank you in advance for any help. I am having a lot of trouble working with Twarc2 output. My search runs and returns a jsonl with about 200k tweets but it doesn’t look like the pretty json I see returned in a Postman query. I have tried numerous approaches to bring this jsonl into R or work with it in Python and all of them fail for one reason or another. I’m working on an M1 Mac, which I mention because I know there are some issues with Pandas and Numpy and other data science packages. I have been wrestling with this for about two weeks. Any help much appreciated.

I’ve been unable to take the CSV route to R: Twarc-csv generates an error code at 37%. I have rerun Twarc2 many times with different searches but all the files generate the same error (pasted below).

Following various patterns found online, I’ve tried to use jsonlite and lapply in R:
library(jsonlite)
library(dplyr)
setwd(“/Users/rogerschoen/”)
lines ← readLines(“hashtagTCH.jsonl”)
lines ← lapply(lines, fromJSON)
lines ← lapply(lines, unlist)
x ← bind_rows(lines)

but this generates many errors of the type:
New names:

  • data.referenced_tweets.type → data.referenced_tweets.type…1
  • data.referenced_tweets.id → data.referenced_tweets.id…2
  • data.referenced_tweets.type → data.referenced_tweets.type…3
  • data.referenced_tweets.id → data.referenced_tweets.id…4
  • data.referenced_tweets.type → data.referenced_tweets.type…5

I also tried the below in R (thanks to Michael Dow at U Montreal) which generates a “trailing garbage” error and I can’t figure out how to parse and edit my data to make it work:

Replace ‘…’ with the directory that has your files

setwd(“/Users/rogerschoen/”)
files = dir(pattern = “*.jsonl$”)

As many JSON fields as you want go here, for instance:

cols = c(
“id”,
“conversation_id”,
“referenced_tweets.replied_to.id”,
“referenced_tweets.retweeted.id”,
“referenced_tweets.quoted.id”,
“author_id”,
“in_reply_to_user_id”,
“retweeted_user_id”,
“quoted_user_id”,
“created_at”,
“text”,
“public_metrics.like_count”,
“public_metrics.quote_count”,
“public_metrics.reply_count”,
“public_metrics.retweet_count”,
“entities.hashtags”,
“author.id”,
“author.created_at”,
“author.username”,
“author.location”

)

for (i in 1:length(files)){
start = Sys.time()
lines = readLines(files[i])
temp = do.call(
rbind,
lapply(
lines, function(x)
unlist(jsonlite::fromJSON(x))[cols]
)
)
colnames(temp) = cols
y = as.data.frame(temp)
save_as_csv(y, sub(“jsonl”, “csv”, files[i]), prepend_ids = T, fileEncoding = “UTF-8”)
end = Sys.time()
message(files[i], ": ", round(difftime(end, start, units = “mins”), 2),
" minutes for “, length(lines), " tweets”)

file.remove(files[i])

Uncomment above if you want the jsonl files to be permanently deleted (like if space is an issue)

}

The Twarc2 CSV error:

rogerschoen@Rogers-Macbook-Pro-2021 ~ % twarc2 csv hashtagTCH.jsonl TCH.csv
37%|█████████████████████▋ | Processed 286M/766M of input file [00:31<00:35, 14.2MB/s]Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/3.10/bin/twarc2”, line 8, in
sys.exit(twarc2())
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1128, in call
return self.main(*args, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1053, in main
rv = self.invoke(ctx)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 754, in invoke
return __callback(*args, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/twarc_csv.py”, line 148, in csv
writer.process()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py”, line 81, in process
self._write_output(self.converter.process(batch), first_batch)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py”, line 65, in _write_output
_df.to_csv(
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py”, line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/format.py”, line 1105, in to_csv
csv_formatter.save()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 257, in save
self._save()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 262, in _save
self._save_body()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 300, in _save_body
self._save_chunk(start_i, end_i)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 311, in _save_chunk
libwriters.write_csv_rows(
File “pandas/_libs/writers.pyx”, line 72, in pandas._libs.writers.write_csv_rows
_csv.Error: need to escape, but no escapechar set
37%|█████████████████████▋ | Processed 286M/766M of input file [00:32<00:53, 9.32MB/s]

If you could narrow down the chunk of data that fails that would help a lot, I can’t reproduce this yet but it would help if I could get the line from the file that fails - maybe if you slit the file in half and then run twarc2 csv on both parts and continue splitting it up like that you can get a reasonably sized chunk you can zip and attach to this issue: twarc2 csv _csv.Error: need to escape, but no escapechar set · Issue #37 · DocNow/twarc-csv · GitHub

Try

split -n l/2 hashtagTCH.jsonl hashtagTCH_parts

To split it maybe.

Or DM me a link to where I can download it and try to fix it later

I’ll do both! Thank you!

1 Like

I got a hold of the file, but unfortunately i have not been able to reproduce the error - i suspect it may be something in the environment - i have Python 3.8.6 and pip list has:

twarc                      2.8.1
twarc-csv                  0.5.1
pandas                     1.3.3

So maybe it’s worth trying to install those exact versions to check:

pip install --upgrade twarc==2.8.1 twarc-csv==0.5.1 pandas==1.3.3

try those versions in your python environment first, and if that fails, maybe try python 3.9.1 (this one should also work with M1 chips i think?)

to install different versions of python on your system i recommend pyenv GitHub - pyenv/pyenv: Simple Python version management

I could not get pyenv installed and python to build successfully until I tried running the terminal under Rosetta 2. That worked and I have 3.8.6 and those versions of pandas, twarc and twarc-csv running under pyenv.

Now, when I run twarc csv tweets.jsonl tweets.csv I get

twarc: error: argument command: invalid choice: 'csv' (choose from 'configure', 'dehydrate', 'filter', 'followers', 'friends', 'help', 'hydrate', 'replies', 'retweets', 'sample', 'search', 'timeline', 'trends', 'tweet', 'users', 'listmembers', 'version')

When I try

twarc2 csv tweets.jsonl tweets.csv

I get

Error: No such command 'csv'.

I have tried uninstalling and reinstalling twarc-csv but I still get the error.

1 Like

Oh! that’s progress, i think it should work if you:

pip3 install --upgrade twarc twarc-csv

(pip3 usually goes to the right place - because of different python environments - i think it should be possible to tell with which python or which pip3 commands)

And the command is twarc2 csv not twarc csv

yes! progress! I’m very grateful for your help.

I thought it might be a location issue but twarc and twarc-csv are both in the same place. When I try

pip3 install --upgrade twarc twarc-csv

I get

pip3 install --upgrade twarc twarc-csv    
Requirement already up-to-date: twarc in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (2.8.1)
Requirement already up-to-date: twarc-csv in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (0.5.1)

Twarc searches run fine.

oh, that’s strange - does this work?

python3 -m twarc csv tweets.jsonl tweets.csv

(confusingly it’s twarc for twarc2 commands when running it as python -m twarc ...)

YES! Perfectly. Thank you again.

1 Like

Thanks! I’ll get around to fixing it so it will work with python 3.10 later - but the error with twarc2 not working is something to do with the PATH / environments etc - i don’t have a Mac so i can’t test it, and as far as i know it does things slightly differently to Linux.