Setting up daily CRON job to get last day's tweets


#1

I want to run a nightly CRON that gets & saves all tweets (by search term) since the last CRON run. Based on the documentation, this seems like the right CRON approach:

– query my DB to get the last stored tweet id and use this as the since_id in my initial twitter request. This says, “I only want tweets since this one.”
– Twitter then begins returning paginated tweets, starting with the MOST recent tweets, and working back toward my since_id. With each new page request, I then use [next_results] => ?max_id=########## to get the next page of tweets.

This should then keep returning pages of tweets, working toward my since_id. Is this correct?

The issue I see with this approach is that were something to interrupt the CRON, or were I to run out of requests before reaching my since_id, there will be a gap of tweets that were never gotten, and the next CRON run will set the since_id at the maximum id, never then retrieving that missing set.

I’m mainly trying to check whether I am understanding this correctly, and if there is a better way to accomplish what I want without missing tweets.


#2

I think that’s pretty much how I would implement it.


#3

I still don’t know how to deal with the issue where the process stops before completion, leaving a gap of missed tweets that I don’t know how to recover.

Any thoughts on how to deal with that?

Thanks for responding!


#4

Andy – not sure if you’re still following this post but am still wondering about the issue I describe above, where an API request using since_id fails before completing, leaving a gap in missed tweets.


#5

I’m afraid I don’t have any better way of resolving the issue than the one you’ve described. It could indeed lead to that gap, but the only solution I can think of would be to use something like GNIP (which is commercial).