Twitter REST API v1.1 "search/tweets" pagination problem


#1

Greetings,

First of all this is not a question but what I believe a bug report. I found a legitimate problem in the “search/tweets” REST API v1.1 endpoint [1]. The problem is NOT severe, but causes trouble with some Twitter libraries that make rightful assumptions. I will explain as below.

In the aforementioned API endpoint, the documentation states that for optimal search results page navigation, the “Working with Timelines” official guide [2] should be followed, including the usage of the “since_id” and “max_id” API parameters. This guide is very concise and clear on the best practices for paging. In particular, the last figure shown in this guide states that the following workflow should be implemented:

  1. Set the “since_id” parameter with the Id of the last Tweet your application has processed. Also the “count” parameter can be optionally set. Then call the API for the first request.
  2. In subsequent requests, provide again the “since_id” parameter with the SAME VALUE as the initial request but also provide the “max_id” parameter with the latest Tweet Id processed by the application from the last request, minus 1 (to account for inclusive matching). Repeat this step until you finish cursoring the results.

This approach is very efficient and straightforward to implement. Even more, the “search/tweets” endpoint responses also provide a “search_metadata” field that contains information about the search itself, like time spent and in particular a “next_results” field containing a request parameters encoded string to use for the subsequent API call directly. This can be seen in [1] in lines 404-414 of the example payload.

Now the actual problem: as shown above, this field includes everything needed for the next API request (including “max_id” parameter, the query, and query parameters like “result_type”). However, the “since_id” parameter is NOT included in this string. If you never explicitly provided “since_id” from the beginning, the paging will work fine. However, if you did provide it, the subsequent pages will NOT be returned correctly.

I discovered this problem by using the widely-known Java library Twitter4J [3]. This library relies exclusively in the “next_results” field returned by Twitter to construct the next API request call, as shown in the official example included [4]. This search fails to return correct results if you try to use a specific “since_id” value for subsequent pages.

I managed to circumvent this problem by manually constructing subsequent API calls and setting explicitly the “since_id” and “max_id” values for each call by locally computing them, thus effectively bypassing Twitter4J’s native paging mechanism (based on Twitter’s search metadata field). One may think that the problem should be fixed in Twitter4J, but the library assumes that the fully correct string is returned by Twitter to operate correctly.

This problem was even reported as a library bug before [5] (thus I’m not the only affected user), however I don’t think the problem is in Twitter4J, but in the Twitter’s “search/tweets” API endpoint.

I consider the assumption of correctness from Twitter4J fair enough, and I consider that Twitter should return a “next_results” HTTP query string field that fully matches the initial search parameters/conditions used, including the “since_id” parameter value (which is missing). This because, as the name implies, “next_results” should correctly lead to the next results page without any further intervention. For these reasons, I believe the engineering team should consider this issue at least for discussion.

Best regards,
Hugo.

[1] https://dev.twitter.com/docs/api/1.1/get/search/tweets
[2] https://dev.twitter.com/docs/working-with-timelines
[3] http://twitter4j.org/
[4] https://github.com/yusuke/twitter4j/blob/master/twitter4j-examples/src/main/java/twitter4j/examples/search/SearchTweets.java
[5] http://issue.twitter4j.org/youtrack/issue/TFJ-782


#2

Hi Hugo!

We have recently run into the same problem as yours.

Namely: we used Twitter4j library to get tweets from the search API. However we were surprised of very low number of results even for quite popular search queries. After investigation, we discovered exactly the same problem: the Twitter4j library is relying exclusively on the “search_metadata.next_results” field of the JSON response when implementing QueryResult.nextQuery(). We tried some raw queries only to find out that 4 out of 10 invocations of /1.1/search/tweets.json doesn’t return “next_results” in the result JSON.
We tried all the queries with “count=100” parameter. When we queried the API without “count”, we always got the proper “next_results” value.

I think this is erroneus behaviour on the Twitter API side, as this this strange behaviour might be misleading various applications (and their developers :slight_smile: ).

The only solution is to manually get the maxId and handcraft the next search invocation, which is exactly what is stated in the “Working with Timelines”.

Best,
Mateusz & Nines