I am trying to extract tweets that need to include words from both of the groups (group1_word1, group1_word2)(group2_word1, group2_word2) using twarc2. I have around 300 words in group1 and 300 words in group2. I am thinking to create a custom function to run through all the words/pairs but wanted to double-check if there is an easier way or possible ready function in twarc2?

Also, I am trying to run twarc2 as a python library. I’ve been noticing that my final csv output differs when I extract tweets on my Mac (twarc csv 0.3.8) and windows (twarc csv 0.6.0). The issue is that it gives me higher number of tweets using older twarc csv version (with the specification extra_input_columns=“edit_history_tweet_ids”) whereas the newer version of twarc csv on Windows produces a csv file with every other row empty (and it does not let me use the same specification for extra input columns). I am not sure what seems to be an issue and if it somehow related to using windows (which I eventually would need to use to extract all data).

I recommend using the latest version of twarc-csv. The the extra tweets are from referenced tweets for example, if your query matched a reply, the original tweet being replied to is also included. In the new version, this is optional with --inline-referenced-tweets option. In the old one it was on by default, that’s why the difference is there.

Twarc doesn’t have query building apart from the searches command: where you can specify a text file with a bunch of searches that it will automatically batch together and run.

This is all for the command line however.

Do follow up if with how you solve this though, maybe we can incorporate or make things easier in future!

1 Like

Thanks Igor, I will follow-up once I move further with the code. I wanted to double check another issue that has been previously discussed on the forum. I sometimes get an error “UnicodeEncodeError: ‘charmap’ codec can’t encode characters” (other times I have an empty csv file for running the same code but it does not show any error) on my windows laptop.

I did follow your advice from the forum to include encoding=“utf-8”; however, sometimes I get the twitter data and sometimes I still end up having an empty csv file for the same query (so adding this specification for utf-8 does not work all the time).

However, adding the following code (see the screenshot) gives me consistent output with twitter data in the csv file. I want to double check if that is correct and it does not mess up anything (since I did not see any suggestion like that on the forum


).

Thanks

Try replacing

open("vax_results.jsonl", "r") as infile

with:

open("vax_results.jsonl", "r", encoding="utf-8")

and

open("vax_output.csv", "w")

with

open("vax_output.csv", "w", encoding="utf-8")

Does that work?

Also you can run command line commands in the jupyter prompt:

In a new cell:

!twarc2 csv --json-encode-all vax_results.jsonl vax_output.csv

should give you the same thing.

1 Like

Hi Igor @IgorBrigadir,

So we are working to extract multiple queries (with two groups of words occurring at the same time; thus double parenthesis) using txt file and twarc library. It seems like I can append pages within one json file (for the same query) and convert to csv. But I am not sure if an additional step needs to be taken to append all jsons files (since we have multiple queries). The code seems to be working but I am not sure if that works correctly in terms of using multiple json files and saving all tweets. If you could guide me in terms of resources/issues, that would be super-helpful!

Also, do you happen to know if I correctly use escape characters for the phrase:
(\“johnson & johnson\'s\”)(word1, word2) ?

This query for these 2 groups of words does not work while all other queries work (everything is the same except this \“johnson & johnson\'s\”).

Do you have a sample list of your queries?

So we are working to extract multiple queries (with two groups of words occurring at the same time; thus double parenthesis)

(\“johnson & johnson's\”)(word1, word2)

In that case, in the text file, that should be:

("johnson johnson s") (word1 word2)

& ampersands and ' apostrophes are stripped from text in search, so they should be removed. Quotes are not escaped when writing queries in a text file, only when writing them in command line. And there should be a space, which is an implicit logical AND between the two parentheses. There are no commas in queries, if you want logical OR, it’s like: (word1 OR "two words" OR word3) in case there’s a 2 word phrase.

Instead of the above python code, you can do the exact same thing in command line:

!twarc2 searches --archive --start-time "2021-06-28" --end-time "2021-11-28" queries_auto.txt vax_results.jsonl
!twarc2 csv vax_results.jsonl vax_output.csv

That’s if you want everything in one file.

To write to separate files using your code, change the line:

with open("vax_results.jsonl", "a+") as f:

to something unique to the query, since you’re enumerating them anyway, like

with open(f"vax_results_query_{count}.jsonl", "a+") as f:

Then you’ll have to change the open("vax_results.jsonl", "r") and open("vax_output.csv", "a") below to the same file: open(f"vax_results_query_{count}.jsonl", "r") and open(f"vax_output_query_{count}.csv", "a") when converting to CSV. And then indent all the CSV code from converter down to be under the main for .. enumerate() loop, to get a bunch of different files like:

vax_results_query_0.jsonl
vax_output_query_0.csv
vax_results_query_1.jsonl
vax_output_query_1.csv
vax_results_query_2.jsonl
vax_output_query_2.csv
1 Like

Thank you so much Igor! Yes, everything seems to be working. A sample query would look something like (specific words have been substituted, and we have around 20 terms in the second parenthesis):

(cat) (“pretty dog” OR “very big lion” OR “big tiger” OR monkey) -“small bird” -wolf -bee -#smallbird -#wolf -#bee lang:en -is:retweet

Please let me know if there is anything problematic about this sample query that will be placed in a txt file. Thanks again!

1 Like

thanks! Yes, that looks ok to me - that looks like it has the right spaces parentheses etc.

1 Like

perfect, thanks again!

1 Like