Hi,
I am running the following command using academictwitteR package and I am unable to find the country information in the output.
tweets ← get_all_tweets(query = “happy”,
start_tweets = “2020-03-21T00:00:00Z”,
end_tweets = “2020-12-31T00:00:00Z”,
n = 10,
is_retweet = FALSE)
I want to collect data for all countries initially, but then analyze it by each country. I see that the output “tweets” has a variables “geo.coordinates.type”, “geo.coordinates.coordinates” and “geo.place_id” but they do not include any country code or name, and are always equal to NA, NULL, and NA, respectively.
Here is how my output looks like:
colnames(tweets)
[1] “attachments” “id” “possibly_sensitive” “author_id” “public_metrics”
[6] “conversation_id” “source” “entities” “text” “lang”
[11] “in_reply_to_user_id” “referenced_tweets” “created_at” “geo”
Any help will be appreciated as to how I can find the country information from my output. I certainly do not want to run separate command for each country since that will not be practical.
Thank you!
to get tweets with geo information, i think you’ll have to specify it in the query - so using place_country: operator for example. It’s unlikely you’ll find any with a normal search, as there is a very small fraction of tweets with geo / place info. Not all tweets have geo / place information, so even if you specify a query like place_country:US OR place_country:IE OR... etc you will still end up with a small sample.
Thank you for the reply. You are right indeed, many of the tweets do not have location information. I just looked at a bigger file, the variables “geo.coordinates.type”, “geo.coordinates.coordinates” and “geo.place_id” do contain some numbers for a few tweets. These numbers don’t seem to be mapped to any country unfortunately.
For further exploration, I ran a command by specifying a country (country = “US”) in my query and noted that “geo.place_id” is non-empty – although it is not the same for all tweets, where as, I would assume they all originated in the US.
tweets_US ← get_all_tweets(query = “happy”,
start_tweets = “2020-03-21T00:00:00Z”,
end_tweets = “2020-12-31T00:00:00Z”,
n = 10,
is_retweet = FALSE,
country = “US”)
The geo variable in the output file looks like this:
Since there is no country information in my output file, is there a way to include that information in my output file? Certainly I can specify a country in my query but then I would have so many queries and it seems unpractical to me for now. Also, is there a way to use “place_id” to identify country from my output?
I appreciate all the help!
This depends entirely on how academictwitteR deals with it. I know the API itself has this data:
I use twarc, this is the output i see:
Setup and configure:
pip install twarc twarc-csv
twarc2 configure
Query a 100 tweets & convert to CSV:
twarc2 search --archive --start-time "2020-03-21" --end-time "2020-12-31" --limit 100 "happy place_country:US -is:retweet" results.jsonl
twarc2 csv --no-inline-referenced-tweets results.jsonl results.csv
And i get:
geo.place_id geo.coordinates.type geo.coordinates.coordinates geo.country
40 946ccd22e1c9cda1 NaN NaN United States
76 6a4364ea6f987c10 Point [-122.069984, 37.349526] United States
82 20c4b8c36d778d21 NaN NaN United States
31 015eab953129500a NaN NaN United States
68 7142eb97ae21e839 NaN NaN United States
9 f97108ab3c4a42ed NaN NaN United States
51 b49b3053b5c25bf5 Point [-104.98360001, 39.7391] United States
39 723d666e3a15fd22 NaN NaN United States
72 c3f37afa9efcf94b NaN NaN United States
20 3df4f427b5a60fea NaN NaN United States
(so the data exists, but it’s just very sparse)
Thank you for the help. So you see variable “geo.country” using twarc!! I don’t see this variable at all when using “get_all_tweets” in academictwitteR package in R. I see all other geo-variables but not the geo.country.
Any idea why this might be the case? I wonder if geo.country is there and I am somehow not being able to access it.
Thanks so much!
I think the best place to ask is in Issues · cjbarrie/academictwitteR · GitHub
This issue looks related extraction of $places from jsons · Issue #175 · cjbarrie/academictwitteR · GitHub
Alternatively you can always crawl the tweets with twarc, and import the csv for analysis into R.
Thank you so much for the help, Igor.
I have now been able to run twarc2 search and you are right, I can see geo.country variable in the output. I can certainly use the final csv output file in R.
One question, how do I get all tweets from 2017 that have words, say, “Happy” or “Happiness”? I am not sure how to set the “limit” for this. In R, I have been using n = Inf so that it fetches all the tweets within the timeframe specified. Here is my code for twarc2:
twarc2 search --archive --start-time “2017-01-01” --end-time “2017-12-29” --limit 10000000 “Happy OR Happiness -is:retweet” happy_tweets2017.jsonl
I have a monthly quota of 10000000. Does the above command fetch all the tweets with the given words (hopefully less than 10000000), or does it fetch 10000000 tweets regardless?
Sorry for this naive question, I learnt to use twarc2 just a few hours ago, and I am so glad my csv files now has country information (unlike what I had in R from get_all_tweets).
A big thanks to you for your help enabling me a quick start with twarc2.
1 Like
Good to hear!
In twarc, you’ll get all available results, if the --limit is hit it will stop. If there are fewer tweets than the limit, it will stop before the limit. You can leave the limit out entirely to always get everything but you might exhaust your limit for the month that way. Hope that makes sense. It’s not a target number of tweets, so it won’t exactly give you that many when you specify it.
Also, your query needs some parentheses around the OR to group them properly, see Search Tweets - How to build a query | Docs | Twitter Developer Platform
A good thing to do is to use counts to see how many matching tweets there are ahead of time, eg:
twarc2 counts --archive --start-time "2017-01-01" --end-time "2017-12-29" --granularity "day" --csv "(Happy OR Happiness) -is:retweet" happy_tweets2017_counts.csv
That should give you a daily count of results.
To see the options or here twarc2 (en) - twarc
Thank you so much for clarifying --limit in twarc. It all makes sense to me now.
Also, the ‘counts’ command is incredibly helpful, particularly for my current analysis, so thanks for sharing that.
Although, I could not get country information from academictwitteR package in R (that I was originally looking for), using twarc to get tweets as a csv and using the csv in R is working smoothly for me.
Once again, thanks for helping me, Igor.
1 Like