Detecting trends from Twitter requires listening to real-time Twitter APIs and processing Tweets on the fly. And while trend detection can be complex work, to categorize trends, Tweet themes and topics must also be identified—another potentially complex endeavor as it involves integrating with NER (Named Entity Recognition) / NLP (Natural Language Processing) services.
The Twitter API Toolkit for Google Cloud: Filtered Stream solves these challenges and supports the developer with a trend detection framework that can be installed on Google Cloud in 60 minutes or less.
Please use this thread to report issues/defects with the toolkit.
3 Likes
This is a great tutorial, thank you @prasannacs.
I had a question about having the tweet loader job run in the background to continue to accumulate tweets in BigQuery. It seems that regardless of the cron job scheduler, if the connection to the stream is lost which would occur when exiting the terminal, tweets stop loading into BigQuery. How would one keep the connection to the stream without having to be in the Google Cloud Terminal?
Hi Kaivman17,
Thanks for trying it out. You can connect to the stream by triggering the URL
curl https://<<APP_ENDPOINT_URL>>/stream
Thanks for the response @prasannacs!
But wouldn’t one need to be in the Google Cloud terminal to write the curl command? In other words - is there a way to stay connected to the stream?
The code is deployed to the AppEngine, which will be continuously running. The curl command is to trigger the initial connection.
You can view the logs of the AppEngine with the command. gcloud app logs tail -s default
It appears that the connection to the stream gets lost after around ~1 hour of listening to tweets, even while connected to the Google Cloud Shell. Would there just need to be a separate cron job that triggers the connection to the stream if one wanted to continue to listen to tweets without needing to be active in the terminal/Cloud Shell?
Hi,
The toolkit deployed on AppEngine will attempt to reconnect automatically if you have configured the cloud scheduler (CRON). Unfortunately, the Filtered Stream API does not have connection monitoring capabilities, unlike the PowerTrack API.
You can also manually reconnect to the stream by triggering curl https://<<APP_ENDPOINT_URL>>/stream/connect
Prasanna
Thanks for bearing with me here and apologies for any of my misunderstandings @prasannacs
So to my understanding, the cron job pulls from the pubsub topic, but the tweet stream that sends tweets to the topic will disconnect after around an hour and the toolkit does not support automated twitter api reconnection.
I really appreciate your time!
The streaming API (FilteredStream) is a live HTTP connection that can get server-side disconnects. So the CRON listener will try to reconnect if it pulls zero Tweets more than three times (reconnect time is configurable - refer confg.js reconnectCounter). The toolkit is configured for reconnects but does not guarantee data loss.
1 Like
Got it, so reconnects to the API are configured.
Unfortunately, the reconnects don’t appear to be working for me.
Below I connect to the stream around 18:47:
But appear to lose connection to the API around 19:04 and am clearly unable to reconnect to the API as tweets are no longer being pulled:
I reconnect to the stream via the curl command and tweets are loading again:
However, I am trying to listen to tweets throughout the day without being next to my machine, so manually reconnecting to the stream is not ideal.
Below is a screenshot of my cron job scheduler configuration
Is there anything that I am doing wrong in my configuration that could be leading to this?
Thanks for the time and help!
Since you have a very few Tweets for the rules, can you try setting the CRON schedule to every 5 minutes?
1 Like
Thank you for that suggestion @prasannacs. I changed the cron scheduler to every 5 minutes and this allowed me to stay connected to the API for much longer.
I connect to the stream at 17:01:
And remain connected until 19:35, but unfortunately am unable to reconnect.
It has received zero tweets more than 3 times at this point and is still unable to reconnect.
Hi, could you please tell me if there is a possibility to write endpoint that would close already running stream?
You can kill the process/container to disconnect from the stream. There is no Twitter API to disconnect fro m the stream.
From your screenshot, I don’t see zero Tweets pulled for 3 consecutive times. Please note even after stream disconnect, the cloud scheduler could be pulling Tweets from PubSub due to the previously available messages/Tweets. The reconnection may be instant after disconnect and you would have some data loss. If data loss is critical you will need the Enterprise API access (PowerTrack) that has redundancy features.
So in this screenshot:
The last few runs of the cron job have resulted in 0 tweets pulled from PubSub - wouldn’t that mean that there are no more tweets available in PubSub and the connection to the stream is lost and it needs to be reconnected?
In any case, I will start to explore Enterprise API access to solve this, thank you for the suggestion.
Looks like there aren’t any Tweets to pull and I don’t see disconnects. When disconnect eventually occur, you will find a log message as “reconnecting to stream” which is the actual confirmation of a disconnect that occurred.
1 Like
That is interesting. Below are my rules:
I am 100% sure that tweets containing or referring to “49ers” are being constantly tweeted throughout the day, so that doesn’t make any sense to me why there would not be any tweets to pull. Is there any error in my stream rules that you see that could lead to this occurring?
I can verify here that there are more tweets and the stream rules don’t appear to be an issue. Here no tweets are being pulled:
So I then run the curl command again to manually reconnect to the stream:
And now tweets are being pulled again:
Thanks for this greate tutorial @prasannacs
When I tried step 11 tailing the log file for the deployment application, I got stuck and can’t receive the tweets
Could you please advise on the revision?
Thank you and happy new year!
Best Regards,
ty