Hello
I am using the academic research track in order to scrape tweets from the API. I have been reading a bit in the forum and it seems Python or Java are the preferred programs for this. However, I am a R user and I have been recently been recommended to take a look at this package.
I wondered if anyone else has an example (besides the ones they provide there) on how to obtain a sample of tweets. I am facing some issues with it.
My research is based on getting tweets for the Black Lives Matter’ protests that took place the past summer. The function “get_hashtag_tweets” returns all tweets with the specify query and I can’t see a way of limiting the number of tweets.
Furthermore, introducing other query parameters to filter by country or avoid retweets doesn’t seem to work in my case. I can’t either understand how does the function specifically search for the tweets, is there a way of creating a random sample of N tweets, for instance? (I know the function calls the /2/tweets/search/all but I am not sure if there is a way to limit the amount of tweets through the query itself)
I attach an example of the code I am currently using
I have also tried including a file path which stores the data into json files but this seems very messy to me in order to convert them back into dataframes.
Thanks a lot in advance and sorry for the basic questions. I have been trying to look for some hints online but I haven’t found much for R.
1 Like
The github issues of that project are probably the best place to get help with that R package - it depends on how they implemented things. Technically, to limit the number of tweets you get you would stop paginating the results.
Your query is missing some parentheses - so it may not be getting you what you want due to the precedence of operators Building queries | Twitter Developer (spaces are logical ANDs and are processed first, then ORs)
Your query should be: "(#BlackLivesMatter OR # BLM) -is:retweet place_country:US"
This is significantly more difficult and involved but theoretically it’s possible with 1000s of calls limiting to small slices of time. See here for more details: Generating a random set of Tweet IDs
1 Like
I will try to keep working on it and ask in the Github repository too.
Thank you a lot for your answer it has been really helpful.
2 Likes
I’ll also add that “scraping” generally refers to grabbing data from the HTML web page (which is against the Twitter Terms of Service); you are accessing Tweets via the Twitter API, not “scraping”.
3 Likes
On a slight tangent, it’s super interesting to me how it seems like the R community tends to lean towards conflating “Scraping” and “REST API” access - every time i read R docs it’s all called “scraping”
i don’t have numbers for this but I hope some linguists take note how different communities adopt terminology in different ways (I mention this because this overlaps with my own research interests, just throwing it out there into the world.)
2 Likes
Agreed - it is an interesting linguistic / descriptive divergence!
1 Like
Hello again.
I am having some issues with the location of the tweets. I have been trying to see some threads here or here about how to get back the location information from the place_id data.
The way I understand it is that the expansions that are required to get it will return a different “file” with the object includes that contains the information.
My current code is:
blm<-get_hashtag_tweets("(#georgefloyd OR georgefloyd OR #justiceforgeorgefloyd OR #ICantBreathe OR #icantbreathe OR #blm OR #BLM OR #BlackLivesMatter OR #blacklivesmatter) has:geo lang:en -is:retweet place_country:US", "2020-05-25T00:00:00Z", "2020-06-25T01:00:00Z", bearer_token)
and with this I obtain a dataframe with the column place_id which I would like to convert to a location. I understand that maybe the fact that I do not get includes has to do with the way the function get_hashtag_tweets is built (to specify the expansions) so I wondered if there is some way to get back proper locations.
Do the place_id returned follow some sort of pattern that can be reverted by other means?
Thanks a lot!
Yes, it’s a different part of the response, so tweets are in data and all the extra things specified are in includes.
You are correct, the R code you’re using does not seem to be processing the data from the Twitter API.
Either the R code has to be updated and fixed to process all the includes, or you’ll have to process the json yourself, I don’t know R well enough to suggest how best to do this.
An alternative you can try, is to use twarc as a command line tool to write out a CSV you can import into R, this will process all the extra information appropriately:
You may need to force the correct twarc version to be installed like described here:
https://twarc-project.readthedocs.io/en/latest/twarc2/ may help to use it as a command line tool, and to convert output to a CSV you can import into R, use https://pypi.org/project/twarc-csv/
The docs are still being written so unfortunately it’s a bit all over the place right now.
1 Like
Thanks a lot for your help!
Hello Igor
I have been trying to follow your advice and import the data using twarc2 in python. I am having some issures with respect to the geolocation parameters that twarc is reporting me. When I used the following code in R:
blm2<-get_hashtag_tweets("(#georgefloyd OR georgefloyd OR #justiceforgeorgefloyd OR #ICantBreathe OR #icantbreathe OR #blm OR #BLM OR #BlackLivesMatter OR #blacklivesmatter) has:geo lang:en -is:retweet place_country:US", "2020-05-25T00:00:00Z", "2021-01-31T01:00:00Z", bearer_token)
I obtained a specific column gor geo.coordinates.coordinatescontaining exact geolocation for the tweets.
Now I am trying to build the same query in python as:
twarc2 search --archive "(#georgefloyd OR georgefloyd OR #justiceforgeorgefloyd OR #ICantBreathe OR #icantbreathe OR#blm OR #BLM OR #BlackLivesMatter OR #blacklivesmatter) has:geo lang:en -is:retweet place_country:US" --start-time "2020-05-25" --end-time "2020-05-26" >output2.json
This does not give me the data for the coordinates but only for geo.bbox which If I understand correctly from here can be referred to users location or a place they’re tweeting about.
Do you have any idea why this might be happening?
The R script does some expansion of the includes in the background, in twarc - this is a separate step, because we wanted to keep the original responses for archiving.
To get the geo data that’s included in the responses, you can run:
twarc2 flatten output2.json output2_flat.json
or, add --flatten to the command (but it’s preferable to run the search first, and then flatten it. The advatange is being able to hold on to the original data so you can use different tools if you need.)
twarc2 search --flatten --archive "(#georgefloyd OR georgefloyd OR #justiceforgeorgefloyd OR #ICantBreathe OR #icantbreathe OR#blm OR #BLM OR #BlackLivesMatter OR #blacklivesmatter) has:geo lang:en -is:retweet place_country:US" --start-time "2020-05-25" --end-time "2020-05-26" output2.json
“flatten” command processes all the available expansions like geo, mentions etc. and includes them with each tweet.
edit: I didn’t know View() has a limit of columns to be displayed so that’s what I thought I still had problems. Everything works perfectly now.
Thanks for your help!
1 Like