Difference between downloaded and parsed Tweets from Streaming API



I am downloading tweets from streaming API but after the session expires, parsed tweets are almost 50% less than downloaded tweets. Like see below:

"Capturing tweets…
Connection to Twitter stream was closed after 86404 seconds with up to 175507 tweets downloaded.
89271 tweets have been parsed. "

Any idea how can I parse all downloaded tweets??



What code are you using that reports this message? This is not something that comes back from the Twitter API. What is “parsing” in this context?


Here is the code in R

tweetFileName <- paste("SunJul17201604AM.json",sep="") 
  filterStream( file.name = tweetFileName  ,
                track= keywords , oauth=credential, timeout=10,  lang=c(' en'))
  df <- parseTweets(tweets= "SunJul17201604AM.json")
  assign(paste(Sys.Date()), df)
  temp3 <- clean(df)
  dbWriteTable(con, tweetFileName ,  value =  temp3 )


No matter what the time out is…parsed tweets are always less than downloaded tweets.


Well, I don’t know R very well, and I’ve never used the streamR package in particular, but according to the documentation for the parseTweets function that I just read:

The total number of tweets that are parsed might be lower than the number of lines in the file or object that contains the tweets because blank lines, deletion notices, and incomplete tweets are ignored.

If you want to know more, you will probably have to ask the author of the project for more help (unless anyone else here on our forums can help you)


Just throwing this out there, depending on your key words, you may be picking up a lot of non English tweets, or tweets in English from users who don’t have English as their language preference. Have you tried removing the English filter?