Incorrect tweets returned when filtering on "locations"


#1

Hi,
I’m trying to measure some statistics on relative quantities of tweets for a few cities.
I understand Streaming API is the only correct option for that (despite returning only 1% of tweets).

I use python tweepy for that, with a following code:

class StdOutListener(StreamListener):
    def on_data(self, data):
        tweet_object = json.loads(data)
        print(tweet_object["text"].encode(sys.stdout.encoding, 'replace'))
        print(tweet_object["coordinates"])
        print(tweet_object["place"])
        print(tweet_object["lang"])
        return True

    def on_error(self, status):
        print(status)

if __name__ == '__main__':
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)
    stream.filter(locations=[52.0978767, 20.8512898, 52.3679992, 21.2710983])

However, it returns me for example tweets like this one:

??????: ??? ??? ??? ??? ?? ??? ??? ??? http://t.co/zXAzq8mSqA
???
??? #??? #?????????
None
{u’country_code’: u’SA’, u’url’: u’https://api.twitter.com/1.1/geo/id/001ad0741538b980.json’, u’country’: u’\u0627\u0644
\u0645\u0645\u0644\u0643\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u0627\u0644\u0633\u0639\u0648\u062f\u064a\u06
29’, u’place_type’: u’admin’, u’bounding_box’: {u’type’: u’Polygon’, u’coordinates’: [[[44.6569686, 17.0167619], [44.656
9686, 29.1031061], [55.6666671, 29.1031061], [55.6666671, 17.0167619]]]}, u’full_name’: u’Eastern, Kingdom of Saudi Arab
ia’, u’attributes’: {}, u’id’: u’001ad0741538b980’, u’name’: u’Eastern’}
ar

Which clearly should not fall in that filter (based on lack of coordinates in coordinates property and different coordinates for the object in place property).
What am I doing wrong?


#2

You are doing nothing wrong. Twitter checks if coordatinates matches your locations filter. If that fails Twitter checks place. Here is the doc.


#3

Exactly.

The streaming API uses the following heuristic to determine whether a given Tweet falls within a bounding box:

  • If the coordinates field is populated, the values there will be tested against the bounding box. Note that this field uses geoJSON order (longitude, latitude).
  • If coordinates is empty but place is populated, the region defined in place is checked for intersection against the locations bounding box. Any overlap will match.
  • If none of the rules listed above match, the Tweet does not match the location query. Note that the geo field is deprecated, and ignored by the streaming API.

So we I received that tweet above from Saudi Arabia? Neither coordinates of the tweet nor it’s place did fit the coordinates in my query.


#4

Your location is roughly longitude 52, latitude 20.
The returned bounding box is longitude 44 to 55, latitude 17 to 29.
If I’m doing this correctly, your location and the bounding box match.


#5

Ok, it seems that I swapped around latitude and longitude.
Thanks a lot.


#6

Hi Marcin.

Do you actully get any tweets using your method? I tried to get Arabic tweets from Egypt using this technique and only managed to get 1 tweet every minute or so = useless.


#7

One thing to note is that very few Tweets overall are tagged by the user with location data.


#8

Hi Andy.

I take your point, thanks.

Does the same apply for language? (i.e are very few Tweets tagged for language?).

There must be some way to collect Arabic tweets (other than by keyword) surely…


#9

Language is inferred or set as part of a user’s device or web browser configuration and I’m afraid I’m not able to comment on volumes related to the public stream as I’m simply unaware. In principle I can imagine that more Tweets may have a language tag or value than the low ~2% where the authors explicitly add location. I’m unable to comment on the specific volumes sorry.


#10

My findings so far are:

  1. Collecting by keyword - works but is bad for what i want to do
  2. Collecting by location - works but only at a rate of 1 every 30-60 seconds

You know the API better than me - what do you think is the best way to gather Arabic tweets without restricting by keyword?

Thanks!


#11

The streaming API offers a 1% sample of the worldwide firehouse and if you track by location only around 1-2% of Tweets are covered so I’m not surprised that your results are limited.

For full fidelity content tracking then you’ll need to look at Twitter’s commercial Gnip products.


#12

Awesome!

I will check that out.

Thanks!


#13

The Gnip APIs are enterprise commercial paid products so be aware that they represent a major investment compared to the free APIs. That said, that would be the way to go for much wider tracking of terms, languages and locations.


#14

i hear you buddy.

i have contacted them, lets see what they come back with … hopefully they’ll have some special rates for students :wink:

Thanks for your assistance.