Hi everyone. Thank you in advance for any help. I am having a lot of trouble working with Twarc2 output. My search runs and returns a jsonl with about 200k tweets but it doesn’t look like the pretty json I see returned in a Postman query. I have tried numerous approaches to bring this jsonl into R or work with it in Python and all of them fail for one reason or another. I’m working on an M1 Mac, which I mention because I know there are some issues with Pandas and Numpy and other data science packages. I have been wrestling with this for about two weeks. Any help much appreciated.
I’ve been unable to take the CSV route to R: Twarc-csv generates an error code at 37%. I have rerun Twarc2 many times with different searches but all the files generate the same error (pasted below).
Following various patterns found online, I’ve tried to use jsonlite and lapply in R:
library(jsonlite)
library(dplyr)
setwd(“/Users/rogerschoen/”)
lines ← readLines(“hashtagTCH.jsonl”)
lines ← lapply(lines, fromJSON)
lines ← lapply(lines, unlist)
x ← bind_rows(lines)
but this generates many errors of the type:
New names:
- data.referenced_tweets.type → data.referenced_tweets.type…1
- data.referenced_tweets.id → data.referenced_tweets.id…2
- data.referenced_tweets.type → data.referenced_tweets.type…3
- data.referenced_tweets.id → data.referenced_tweets.id…4
- data.referenced_tweets.type → data.referenced_tweets.type…5
I also tried the below in R (thanks to Michael Dow at U Montreal) which generates a “trailing garbage” error and I can’t figure out how to parse and edit my data to make it work:
Replace ‘…’ with the directory that has your files
setwd(“/Users/rogerschoen/”)
files = dir(pattern = “*.jsonl$”)
As many JSON fields as you want go here, for instance:
cols = c(
“id”,
“conversation_id”,
“referenced_tweets.replied_to.id”,
“referenced_tweets.retweeted.id”,
“referenced_tweets.quoted.id”,
“author_id”,
“in_reply_to_user_id”,
“retweeted_user_id”,
“quoted_user_id”,
“created_at”,
“text”,
“public_metrics.like_count”,
“public_metrics.quote_count”,
“public_metrics.reply_count”,
“public_metrics.retweet_count”,
“entities.hashtags”,
“author.id”,
“author.created_at”,
“author.username”,
“author.location”
)
for (i in 1:length(files)){
start = Sys.time()
lines = readLines(files[i])
temp = do.call(
rbind,
lapply(
lines, function(x)
unlist(jsonlite::fromJSON(x))[cols]
)
)
colnames(temp) = cols
y = as.data.frame(temp)
save_as_csv(y, sub(“jsonl”, “csv”, files[i]), prepend_ids = T, fileEncoding = “UTF-8”)
end = Sys.time()
message(files[i], ": ", round(difftime(end, start, units = “mins”), 2),
" minutes for “, length(lines), " tweets”)
file.remove(files[i])
Uncomment above if you want the jsonl files to be permanently deleted (like if space is an issue)
}
The Twarc2 CSV error:
rogerschoen@Rogers-Macbook-Pro-2021 ~ % twarc2 csv hashtagTCH.jsonl TCH.csv
37%|█████████████████████▋ | Processed 286M/766M of input file [00:31<00:35, 14.2MB/s]Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/3.10/bin/twarc2”, line 8, in
sys.exit(twarc2())
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1128, in call
return self.main(*args, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1053, in main
rv = self.invoke(ctx)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py”, line 754, in invoke
return __callback(*args, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/twarc_csv.py”, line 148, in csv
writer.process()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py”, line 81, in process
self._write_output(self.converter.process(batch), first_batch)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py”, line 65, in _write_output
_df.to_csv(
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py”, line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/format.py”, line 1105, in to_csv
csv_formatter.save()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 257, in save
self._save()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 262, in _save
self._save_body()
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 300, in _save_body
self._save_chunk(start_i, end_i)
File “/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py”, line 311, in _save_chunk
libwriters.write_csv_rows(
File “pandas/_libs/writers.pyx”, line 72, in pandas._libs.writers.write_csv_rows
_csv.Error: need to escape, but no escapechar set
37%|█████████████████████▋ | Processed 286M/766M of input file [00:32<00:53, 9.32MB/s]