A lot of noise when trying to "follow" 2000 users



Hi Twitter!

I’m using the streaming API to follow around 2000 users with their user ids. I am using the phirehose library and codes from 140dev. The script seems to run just fine collecting tweets of the 2000 users, but it is also storing a lot of tweets from user id’s that I did not include in my array of user id’s to follow, basically it is giving me tweets that I do not want and tweets from user id’s that are not in my filter. Is there usually this much noise with the streaming API? or could there be something wrong with the phirehose library and its functions? I can provide the functions and scripts if anyone wants to look at it.


Just to confirm this observation: I also use the streaming API to follow almost 5,000 users, and in about 10% of the tweets I cannot find any user id that is in the list of my 5k users – neither in ['user']['id_str'], ['in_reply_to_user_id_str'], ['retweet_status']['in_reply_to_user_id_str'] or ['retweet_status']['user']['id_str'].


Yes!!! But in my case almost 50% of the tweets are from user ids that are not in my 2000 list. Almost completely random tweets are being collected.


Just to clarify, what do you mean by “[…] tweets are from user ids […]”? It reads a bit as if a tweet is from User 123 is ['user']['id_str']=123. However, as far as I understand the API documentation, you will also get tweets the are replies to User 123 or retweets of tweets of User 123. If you only look at ['user']['id_str'], the 50% sound fine to me. But I assume you also checked the other fields like ['retweet_status']['user']['id_str'] or ['in_reply_to_user_id_str'].

Admittedly, I’m not even really sure if I understand the API documentation [1] correctly. For the follow paramter it says that you get “Replies to any Tweet created by the user.” To me this is ambigious. Either the user wrote the reply or the tweet that someone hast replied to. Looking at my data, I assume it’s the latter, since I have tweets with, for example, ['user']['id_str']=XXX and ['retweet_status']['user']['id_str']=1234 with User 1234 of the users I follow.

[1] https://dev.twitter.com/streaming/overview/request-parameters


Correct. As I understand it, you will basically get every tweet that is associated with the user_id you are following as explained in the documentation. So yes I do expect to get tweets from user_ids that I am not following, which I get a lot. But in my data, I am getting tweets that are completely irrelevant. I am even more sure considering the fact that the user_ids I am following are from a certain country, but I would get a lot of tweets from other nationals. So I was just wondering why would there be so much noise.


that’s interesting, making progress of 300 additional