Suddenly not receiving any tweets


#1

I’ve been working on a scala program since past week, and it was working perfectly fine until a few hours ago.

import ...

object TwitterPopularTags {

  def main(args: Array[String]) {
...
    val ssc = new StreamingContext("local[2]", "TwitterPopularTags", Seconds(60),System.getenv("SPARK_HOME"))

        val stream = TwitterUtils.createStream(ssc, None)
        stream.print

       ssc.start()
       ssc.awaitTermination()
  }
}

That’s my code, it opens a stream and then print it. It was showing tweets, but now it isn’t.


#2

what error are you seeing?


#5

Self bump because of desperation


#6

There is no error, it just seems like it isn’t receiving data but it was before. Normal behaviour was:

Time: 1312300000
{Tweet here}
{Tweet here}

Time: 1312360000
{Tweet here}
{Tweet here}

But now it goes like this, all time.

Time: 1312300000

Time: 1312360000

This is my build.sbt, where the only changes I did before the no-tweets behaviour where the versions of Spark, from 0.9.0 to 1.5.0 in case it helps:

name := “TwitterPopularTags”

version := “0.1.0”

scalaVersion := “2.10.3”

libraryDependencies ++= Seq(“org.apache.spark” % “spark-core_2.10” % “1.5.0”,“com.github.scala-incubator.io” %% “scala-io-file” % “0.4.2”,“org.apache.spark” % “spark-streaming-twitter_2.10” % “1.5.0”,“org.apache.spark” % “spark-streaming_2.10” % “1.5.0”,“org.twitter4j” % “twitter4j-core” % “3.0.6”,“org.mortbay.jetty” % “servlet-api” % “3.0.20100224”)

libraryDependencies += “org.apache.hadoop” % “hadoop-client” % "2.7.2"excludeAll ExclusionRule(organization = “javax.servlet”)

libraryDependencies += “org.apache.hadoop” % “hadoop-hdfs” % "2.7.2"excludeAll ExclusionRule(organization = “javax.servlet”)

resolvers += “Akka Repository” at “http://repo.akka.io/releases/

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList(“META-INF”, xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}


#7

Are you able to connect using the same filter using any other language or environment e.g. twurl at the command line, using the same keys?

Is there any way to enable a debug mode in your Spark / Scala code to ensure that they are not swallowing and hiding an error?


#8

Yes, I’m able to connect using twurl. Neither Spark or Scala drops any error or exception. Strangest part is that if I set my sbt to the 0.9.0-incubating version of Spark, it does get tweets, but I’m in the need of using 1.5.0 and onwards.


#9

That sounds odd - I wonder what changes happened between those versions to cause this issue.


#10

I solved it. One of the Spark workers stopped working, but it dropped no exception, not even in debug mode, I had to search in an ocean of logs. Thanks for the support.

By the way, is there a way to obtain more tweets?


#11

Glad to hear you got it fixed. What do you mean by “more Tweets”? - the Streaming API is limited to 1% of the total firehose volume. Depending on your filter terms, you may well already be receiving all of the Tweets on the topic you are searching for.


#12

Oh, okay, I got it now. I thought that if I filter the stream searching for “banana”, for example, I’ll get the same amount of tweets that without the filter. If that so, exists a way to get more than that 1%?


#13

Right, so if < 1% of the total Tweet volume at any one time consists of your search term, you’ll get all the Tweets; if you are searching a particularly high volume / popular term, then you’ll “miss” some of the Tweets due to the volume cap.

You can apply for elevated access to greater percentage sample volumes, but that is rarely granted - if you need a reliable source of more data then our enterprise products from Gnip would be my recommendation - they provide full firehose access and are fully supported.