Hi everybody,
for my master thesis, I would need to collect a large set of Tweets. Unfortunately, my Academic Access was not approved, because in Europe it is not common to have profile on your department’s website. So, I will buy a subscription for the full archive search.
I want to use Twarc and made myself familiar with it. I already downloaded some sample Tweets with the full archive query.
For my research, I need to download Tweets from March 11, 2020 until September 1, 2021 from a list of 270 accounts. My question therefore is how can I write my code, so that I don’t need to include all 270 user names (or user ids) in the query? The second question is, is there a way to loop through my code, so that when I reach a rate limit, the code “sleeps” for some time and will then run again.
I am a social scientist and not really familiar with programming, so please excuse if my questions sound a bit obvious …
Any help is highly appreciated!
Twarc will handle rate limits and waiting for you. If you are going to use the paid full archive search you will be using the v1.1 API
twarc1 (en) - twarc and it may be way more restrictive. But maybe you could crawl the full timelines (which can go back to the last 3200 tweets) for each user and then use the premium search to fill in the gaps.
You don’t need any programming skill to use twarc, but you do need to know how to write commands in a terminal.
If it helps, if you got rejected maybe your supervisor or professor can apply instead Twitter Developer Access - twarc sometimes that’s a way around it.
1 Like
Thanks so much @IgorBrigadir!! I will try to do that then!
Is there any way I could summarize all the user names or ids in one vector and write it then into the search query? Or do I have to write them all out?
This is one of those cases where knowing some code can really help, if you have all the names in an array you could do it.
In python something like:
' OR '.join(list_of_names)
But you have to also make sure it’s not over the query length limit for the endpoint you’re calling ( Premium Full archive is 1,024 characters with paid, 256 with Sandbox.)
Alternatively, this is significantly easier with twarc2 but you may not get older tweets for the users (unless you get academic access at some point)
twarc2 timelines one_user_per_line.txt output.jsonl
You need to make a file with 1 user name or ID per line, and then run that.
That will give you the latest 3200 tweets for each user
And
twarc2 timelines --use-search one_user_per_line.txt output.jsonl
Will get all tweets (works with academic access only)
1 Like
Thanks so much @IgorBrigadir! You help is very much appreciated!
I have another question. How can twarc continue collecting Tweets, while my laptop is offline for a while? Does it start my request automatically again? Or do I have to start it again? And how can I prevent the call from collecting Tweets twice?
Or would cloud computing be a solution to that?
Yes, on a server in the cloud would work better - twarc will always attempt to recover from connection failures and retry requests for you, but if interrupted or if you stop twarc and start again, it won’t resume by default - it will start again from scratch. We’re going to add some features to make this easier in future Recover from stream failure with backfill · Issue #477 · DocNow/twarc · GitHub but right now the best thing to do is to install twarc on a server and let it run there. I’ve had an instance running for a very long time without issues.
Okay thanks! Can you recommend a provider (preferably cheap)?
Sometimes your institution will have a relationship or deal with a large cloud provider like AWS or GCP or Azure - or your department may already have servers you can use - so it’s worth asking around in case there’s already a way to do this.
Otherwise, Linode or DigitalOcean are both very good for this.
Okay great. I will probably use GCP.
Another question related to twarc1. Is it possible to include retweets in my search query?
twarc --app_auth search 'from: username_1 OR username_2' --fullarchive "my_dev_env" --s
andbox --limit 500 > archivetest_tweets.jsonl
1 Like
In that command retweets should already be included - it is not possible to exclude retweets when using sandbox access level.
When using sandbox, it’s a good idea to try queries on the 30day endpoint instead, and limit it to 100 tweets. Also there is a space in the from query that will break it, try:
twarc --app_auth search 'from:username_1 OR from:username_2' --30day "my_dev_env" --s
andbox --limit 100 > archivetest_tweets.jsonl
(Provided the environment exists for the 30day endpoint too)
Also, your query will search for any tweets authored by username_1 or, any tweets by anyone mentioning username_2, if you wanted any tweets authored by username_2 you need to specify:
from:username_1 OR from:username_2
As the query. See Operators by product | Docs | Twitter Developer Platform for all possible operators.
And always specify the --start_date and --end_date when using --archive because otherwise you get results from the last 30 days only by default. twarc1 (en) - twarc
1 Like
This is the code I wanted to use for my search. I have upgraded my full archive search to premium.
twarc --app_auth search 'from:chinafrance OR from:chinafrica1 OR from:chinaganda OR from:xhportugues OR from:xhchinenouvelle OR from:chinanewsweek OR from:dxinjiang OR from:dostifm98 OR from:ednewschina OR from:fullframecgtn OR from:globaltimesnews OR from:globaltimesrus OR from:gtopinion OR from:guangming_daily OR from:huxijingt OR from:huxijin_gt OR from:huanxinzhao OR from:jtao98 OR from:pdchinabusiness OR from:mundo_china OR from:xinhuachinese OR from:pcchinese OR from:peoplesdailyapp OR from:pdchina OR from:pdoaus OR from:phoenixtvusa OR from:phoenixtvhk OR from:puebloenlnea OR from:qingqingparis OR from:frenchrenmin OR from:renmindeutsch OR from:shen_shiwei OR from:globaltimesbiz OR from:thepapercn OR from:thouse_opinions OR from:cd_visual OR from:wangguanbeijing OR from:hongfenghuang OR from:wangxh65 OR from:xinhuatravel OR from:xhindonesia OR from:xhnorthamerica OR from:xhturkey OR from:xinhua_hindi OR from:yicaichina OR from:zhang_heqing OR from:zichenwanghere OR from:cns1952 OR from:pdchinese OR from:peopledailyjp OR from:phoenixcneeu' --fullarchive "my_dev_env" --from_date 2020-03-11 --to_date 2021-09-01 > weets.jsonl
However, I get this error code:
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url:
I checked the query length (under 1024 characters) and also the development environment, everything seems fine. Maybe you can help me with that?
is that the full error message? what does twarc.log contain? does it work if you make the query smaller?
Hi Igor,
I am unable to scrap time specific tweets from twarc2. It gives me option to search for only 1 week. is there anyway I can get tweets of last three months on some specific issue?
Just for clarity: this is an old thread about twarc that uses v1.1 API, but twarc2 uses v2 API.
Defaults for twarc2 is to use the 7 day recent search, if you have Academic Access you can specify:
--archive in the command line to use the search/all endpoint:
twarc2 search --archive "..."
(provided that you are using credentials from an app attached to an Academic Access Project on your developer dashboard)