Hello.

I need to get a sample of geocoded tweets (yes, I know is only 1% of the sample, but I need the tweets to be geocoded).

Using a general code I downloaded a little sample in twarc. There I saw that I got some tweets with coordinates (the coordintes were in geo.coordinates.coordinates, and there was a Point value in the geo.coordinates.type field).

Now I want do use Twarc2 to only download geocoded tweets from a region. I know has:geo isn’t the key since it downloads withs linked with a country or region, with without coordinates (there isn’t a Point value). I only want tweets with a Point value.

Since the tools of Twitter (Downloader, Query Builder, etc) haven’t been working for weeks, I’ve tried to create my own query. This is what I’ve tried to do:

twarc2 search --archive --start-time 2021-01-01 --end-time 2022-01-01 “has:coordinates place_country:es” espana.json

But has:coordinates doesn’t exist, so I can’t do that filter, and if I download every tweet from the country and then I do the filter by coordinates later I would be already crossing the 10 million tweets limit I have per month.

I’ve seen twarc1 had an option that could make you filter by coordinates “–yes-coordinates”, which isn’t working in twarc2.

So I wonder if there was a method to do the filter of the geocoded tweets. It’s possible for example to try to do the filter by the geometry? After all, all the tweets I want must have a Point value.

These are all the valid operators you can use in a query using twarc2: Search Tweets - How to build a query | Docs | Twitter Developer Platform

To get the largest proportion of exact geo containing tweets, the best strategy is to use has:geo and either point_radius: or bounding_box: operators, and sample a location but NOT place_country because that matches on place objects, and the vast majority of those do not have exact coordinates. These operators match both exact coordinates if available, or place objects - so you’ll have filter for those yourself after retrieving the data.

eg:

twarc2 search --archive --start-time "2021-01-01" --end-time "2022-01-01" "has:geo point_radius:[-3.70256 40.4165 25mi]" madrid.json

By my rough estimate, about 5% of geo matching tweets, end up with exact point coordinates. Most of these are posted by third party clients, because twitter removed exact geo from their clients long ago. Instagram is the most frequent: https://twitter.com/housefguadalupe/status/1477052180129632264 all the other geo matching tweets will be based on place objects, that doe not have exact point coordinates.

Hope that helps!

Hello. Thanks for the help.

Then, if I want to download directly a sample of geolocated tweets published in Spain, I have to use has:geo point_radius[coordinates 1000km] (meaning I put the center in Madrid and I do a point radius that covers all Spain)? I recall in the Tweet Download API the point radius had a distance limiter (I gues it’s the same in Twarc), so that’s why I planned to filter the tweets by the ones who had Spain as place country, and then get only the tweets which have coordinates point geometry.

I’ll go then with the has:geo point_radius/bounding_box option. I checked the valid operators but the options regarding coordinates are only for using determinated coordinates for filtering, not just filtering if a tweet has coordinates or not.

The max here is 25 miles for that operator - so you can’t specify 1000km, but you can use km though. That’s why you’ll have to sample locations, picking a bunch of random points with 25mi radius all over the country to get a good representative sample. Both the tweet downloader and twarc would be calling the same API in the end, so they’re equivalent.