Hi everyone! Whats the best way to build a historical dataset using postman? I have my query ready but given the fact that a maximum of 100 tweets is returned in one call, how can I scale this? Let’s say I am collecting tweets on ‘COVID and hoax’ from January to April 2021. Much appreciated!

I recommend using twarc instead of postman. Postman is a tool useful for debugging but not built for data collection.

v2 Standard access will not be able to retrieve these tweets. If you are an academic, you can apply for academic access to the v2 full archive endpoint Twitter Developer Access - twarc

Thank you! Twarc2 was easy to setup. I think the search is still limited to maximum number of tweets per call right? How can I pipeline this to collect data for let’s say few months? I’m on the academic research track.

To use academic search, add --archive to the twarc2 search command.

twarc2 search --archive "query" output.jsonl

twarc will paginate for you - by default it will use page sizes of 500 tweets but it will write out all results. You can modify the page size with --max-results if you need, or set a --limit for the total number of tweets. see twarc2 --help command for details.

1 Like

I am facing some inconsistency with twarc2. I used the following querry:

!twarc2 search “(vaccine OR jab OR vaxine) (-is:retweet) (lang:en)” --archive --start-time 2021-03-01T00:00:00 --end-time 2021-03-03T00:00:00 --limit 300 raw_output.json

!twarc2 flatten ‘raw_output.json’ ‘flattened_output.json’

!twarc2 csv --output-columns “id,created_at,text” ‘flattened_output.json’ ‘outputshort.csv’

I understand that this should only retreive tweets with vaccine, jab, vaxine keywords. However, I’m getting tweets not having any of this (for ex: https://twitter.com/nyknicks/status/1366884819523801089) and (https://twitter.com/jerusalemprayer/status/1366886289522495488)
Also, there were few rows with tweet IDs but no text. I also got 750 rows even though I specified limit of 300 tweets, What am I doing wrong?

There’s no need to run this step if you’re converting to CSV - unless you want that format yourself.

By default you will also get referenced_tweets here, like the original retweeted tweets, replies, quoted tweets are all included. To remove these, add --no-include-referenced-tweets. This also may explain the number of tweets - although generally you get more than the limit because of the page size (500 tweets, which you can set in twatc2 search with --max-results - see twarc2 (en) - twarc )

Also, there were few rows with tweet IDs but no text.

This might be an error though - can you upload the raw_output.json or flattened_output.json in an issue to Issues · DocNow/twarc-csv · GitHub

Thank you! Just to clarify, this includes replies to tweets with my keywords even if the replies are totally irrelevant to the main tweet?

Sure!

This should go with the query right? Because I’m getting Error: no such option: --no-include-referenced-tweets.

Oh, sorry i got my own command line wrong - it’s --no-inline-referenced-tweets so:

twarc2 csv --output-columns "id,created_at,text" --no-inline-referenced-tweets flattened_output.json outputshort.csv

Just to clarify, this includes replies to tweets with my keywords even if the replies are totally irrelevant to the main tweet?

the referenced tweets are ones that are included in includes - so if a tweet that matched your query was in reply to another tweet, the other tweet is included by default. So yes - they may not match your query at all. --no-inline-referenced-tweets includes tweets only matching your query.

This is exactly what I needed, thank you so much! :slight_smile:

Btw, is it the same as using -is:quote -is:reply in the query? I’m thinking if this would help in limiting the number of tweets I’m fetching.

No, not exactly - that applies to the tweets you retrieve, modifying the search query.

--no-inline-referenced-tweets is whether or not to include the extra tweets found in expansions in the CSV or not.

But adding -is:quote -is:reply would give you fewer results. Not sure if -is:quote makes sense here (i would consider quote retweets as the same as original tweets) but it depends entirely on what analysis you’re carrying out.

1 Like