Your data team tweets some cool data stuff following each world cup match. For instance, this tweet:
claiming that there were 8 million tweets containing the #USAvPOR hashtag during the US-Portugal match on June 22nd.
I happened to be running a twitter stream gathering all of those tweets at the time, matching that hashtag as well as other US soccer hashtags like #USMNT, #OneNationOneTeam, and #1N1T.
Between 2014-6-22 7pm (UTC) and 2014-6-23 2am (UTC), which is a large time period surrounding the match, that stream gathered a total of 450,515 tweets, and received “tweets missed” messages amounting to a total of 45,918 tweets. These numbers add up closely to half a million tweets, which is quite far from the @TwitterData reported 8 million.
- Can you tell me exactly how @TwitterData gets their numbers, which tweets they include in that count?
- Is it possible that the “number_missed” messages from the streaming API underreport the missed tweets so gravely?