At Engagor we created an incremental crawler that performs search request using the since_id parameter. Each time a tweet is found we bump the since_id. So we only fetch tweets that are more recent since the one we have seen. Lately we have been missing out on some tweets. After some investigation we found the following
We did a search request at Tuesday, March 10, 2015, 07:02:27 with the following params
{"q":"@nmbs","count":100,"result_type":"recent","since_id":"575187765559713792"}
We got back a tweet with id 575189806499635201 (posted Tuesday, March 10, 2015, 07:01:51). However this request should also have returned tweet with id 575189802754129920 (posted Tuesday, March 10, 2015, 07:01:50). If we rerun the original request now, we do get the tweet we missed…
Is it possible that in some cases newer tweets are returned before older ones? If so, is there a reliable way to build an incremental crawler?