Hi,

I am running the following command using academictwitteR package and I am unable to find the country information in the output.

tweets ← get_all_tweets(query = “happy”,
start_tweets = “2020-03-21T00:00:00Z”,
end_tweets = “2020-12-31T00:00:00Z”,
n = 10,
is_retweet = FALSE)

I want to collect data for all countries initially, but then analyze it by each country. I see that the output “tweets” has a variables “geo.coordinates.type”, “geo.coordinates.coordinates” and “geo.place_id” but they do not include any country code or name, and are always equal to NA, NULL, and NA, respectively.

Here is how my output looks like:

colnames(tweets)
[1] “attachments” “id” “possibly_sensitive” “author_id” “public_metrics”
[6] “conversation_id” “source” “entities” “text” “lang”
[11] “in_reply_to_user_id” “referenced_tweets” “created_at” “geo”

Any help will be appreciated as to how I can find the country information from my output. I certainly do not want to run separate command for each country since that will not be practical.

Thank you!

to get tweets with geo information, i think you’ll have to specify it in the query - so using place_country: operator for example. It’s unlikely you’ll find any with a normal search, as there is a very small fraction of tweets with geo / place info. Not all tweets have geo / place information, so even if you specify a query like place_country:US OR place_country:IE OR... etc you will still end up with a small sample.

Thank you for the reply. You are right indeed, many of the tweets do not have location information. I just looked at a bigger file, the variables “geo.coordinates.type”, “geo.coordinates.coordinates” and “geo.place_id” do contain some numbers for a few tweets. These numbers don’t seem to be mapped to any country unfortunately.

For further exploration, I ran a command by specifying a country (country = “US”) in my query and noted that “geo.place_id” is non-empty – although it is not the same for all tweets, where as, I would assume they all originated in the US.

tweets_US ← get_all_tweets(query = “happy”,
start_tweets = “2020-03-21T00:00:00Z”,
end_tweets = “2020-12-31T00:00:00Z”,
n = 10,
is_retweet = FALSE,
country = “US”)

The geo variable in the output file looks like this:

Since there is no country information in my output file, is there a way to include that information in my output file? Certainly I can specify a country in my query but then I would have so many queries and it seems unpractical to me for now. Also, is there a way to use “place_id” to identify country from my output?

I appreciate all the help!

This depends entirely on how academictwitteR deals with it. I know the API itself has this data:

I use twarc, this is the output i see:

Setup and configure:

pip install twarc twarc-csv
twarc2 configure

Query a 100 tweets & convert to CSV:

twarc2 search --archive --start-time "2020-03-21" --end-time "2020-12-31" --limit 100 "happy place_country:US -is:retweet" results.jsonl
twarc2 csv --no-inline-referenced-tweets results.jsonl results.csv

And i get:

        geo.place_id geo.coordinates.type geo.coordinates.coordinates    geo.country
40  946ccd22e1c9cda1                  NaN                         NaN  United States
76  6a4364ea6f987c10                Point    [-122.069984, 37.349526]  United States
82  20c4b8c36d778d21                  NaN                         NaN  United States
31  015eab953129500a                  NaN                         NaN  United States
68  7142eb97ae21e839                  NaN                         NaN  United States
9   f97108ab3c4a42ed                  NaN                         NaN  United States
51  b49b3053b5c25bf5                Point    [-104.98360001, 39.7391]  United States
39  723d666e3a15fd22                  NaN                         NaN  United States
72  c3f37afa9efcf94b                  NaN                         NaN  United States
20  3df4f427b5a60fea                  NaN                         NaN  United States

(so the data exists, but it’s just very sparse)

Thank you for the help. So you see variable “geo.country” using twarc!! I don’t see this variable at all when using “get_all_tweets” in academictwitteR package in R. I see all other geo-variables but not the geo.country.

Any idea why this might be the case? I wonder if geo.country is there and I am somehow not being able to access it.

Thanks so much!

I think the best place to ask is in Issues · cjbarrie/academictwitteR · GitHub

This issue looks related extraction of $places from jsons · Issue #175 · cjbarrie/academictwitteR · GitHub

Alternatively you can always crawl the tweets with twarc, and import the csv for analysis into R.

Thank you so much for the help, Igor.

I have now been able to run twarc2 search and you are right, I can see geo.country variable in the output. I can certainly use the final csv output file in R.

One question, how do I get all tweets from 2017 that have words, say, “Happy” or “Happiness”? I am not sure how to set the “limit” for this. In R, I have been using n = Inf so that it fetches all the tweets within the timeframe specified. Here is my code for twarc2:

twarc2 search --archive --start-time “2017-01-01” --end-time “2017-12-29” --limit 10000000 “Happy OR Happiness -is:retweet” happy_tweets2017.jsonl

I have a monthly quota of 10000000. Does the above command fetch all the tweets with the given words (hopefully less than 10000000), or does it fetch 10000000 tweets regardless?

Sorry for this naive question, I learnt to use twarc2 just a few hours ago, and I am so glad my csv files now has country information (unlike what I had in R from get_all_tweets).

A big thanks to you for your help enabling me a quick start with twarc2.

1 Like

Good to hear!

In twarc, you’ll get all available results, if the --limit is hit it will stop. If there are fewer tweets than the limit, it will stop before the limit. You can leave the limit out entirely to always get everything but you might exhaust your limit for the month that way. Hope that makes sense. It’s not a target number of tweets, so it won’t exactly give you that many when you specify it.

Also, your query needs some parentheses around the OR to group them properly, see Search Tweets - How to build a query | Docs | Twitter Developer Platform

A good thing to do is to use counts to see how many matching tweets there are ahead of time, eg:

twarc2 counts --archive --start-time "2017-01-01" --end-time "2017-12-29" --granularity "day" --csv "(Happy OR Happiness) -is:retweet" happy_tweets2017_counts.csv

That should give you a daily count of results.

To see the options or here twarc2 (en) - twarc

Thank you so much for clarifying --limit in twarc. It all makes sense to me now.

Also, the ‘counts’ command is incredibly helpful, particularly for my current analysis, so thanks for sharing that.

Although, I could not get country information from academictwitteR package in R (that I was originally looking for), using twarc to get tweets as a csv and using the csv in R is working smoothly for me.

Once again, thanks for helping me, Igor.

1 Like