Unexpected behavior with pagination/cursoring -- cannot get more than ~5,200 records


#1

Hello,

I’m having trouble understanding why I’m not able to return large data sets (number of query results, or “collections”) even when I’m following the instructions for dealing with the pagination limit of 5,000 records:

https://dev.twitter.com/docs/misc/cursoring

For example, when trying to query all of the followers of a particular user who has 38,000+ followers, I’m attempting to use the “next_cursor” record to request (the first) 5,000 records, another 5,000 records, another 5,000 records… etc. My expectation is that for a user with 38,000 records, I would be able to submit 8 total queries using “/1.1/followers/ids.json” (ie 7 queries with 5,000 results, and 1 final query with ~ 3,000 results).

However, the actual data I’ve gotten back, rather than having 38,000 unique Twitter user id’s, I have only around 5,200 total unique records.

Here’s the code I’ve used thus far (I’m using the Developer console to check my syntax, and executing using Twurl):

  1. Get valid Oauth (consumer key):

bin/twurl authorize --consumer-key XXXXX --consumer-secret YYYYY
Go to https://api.twitter.com/oauth/authorize?oauth_consumer_key=NNNNNNNNNN&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1359667877&oauth_token=TTTTTTTTTTTTTT&oauth_version=1.0 and paste in the supplied PIN

… select “Authorize App” from the web browser,and get the 7 digit code. Enter back on the terminal session:

XXXXXXX
Authorization successful

  1. Query the followers of a specific screen name, then make subsequent queries with the appended “cursor” argument:

bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422051957302770720
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422054807940119867

bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422461748394431581
(etc)

  1. Filter out just the numerical user ids into a flat text file (1 record per line) from the JSON, using Awk:

for i in seq 1 5000; do cat results_1.txt | awk -v num="$i" -F, {‘print $num’}; done >> set_01.txt

head set_01.txt
55991139
763964120
20368419
517788112
42142914
1132114194
28491079
152356046
851879449
254802477

  1. Look at the ‘unique-ness’ of these records:

$ wc -l set_0*
5000 set_01.txt
5000 set_02.txt
5000 set_03.txt
5000 set_04.txt
5000 set_05.txt
5000 set_06.txt
5000 set_07.txt
35000 total

$ cat set_0* | sort | uniq | sort | wc -l
5278

My hypothesis is that somehow there is something wrong with the cursoring, or my understanding of how it works… The public records for the user in question clearly state that there are 38,000+ followers… however, each of the queries (to /1.1/followers/ids.json? ) return 5,000 record results, with (apparently) valid next_cursor values, but always with “previous_cursor” values of " 0 " … which implies that I’ve reached the “end” of the data set. Ie:

…40864563,25812430,22147469,545062890,94819353,56063200,116919283,7317322,683813,38004645,909083328,160868645,29767325],“next_cursor”:1422051957302770720,“next_cursor_str”:“1422051957302770720”,“previous_cursor”:0,“previous_cursor_str”:“0”}

…8779202,209329470,40864563,25812430,22147469,545062890,94819353,56063200,116919283,7317322,683813,38004645,909083328,160868645],“next_cursor”:1422054807940119867,“next_cursor_str”:“1422054807940119867”,“previous_cursor”:0,“previous_cursor_str”:“0”}

…410293144,177192104,18023268,51829103,14429965,590265034,36491986,48312572,5200101,243198516,871136221,5871032,72137006,80813955,576935034,30245078],“next_cursor”:1422439552921439430,“next_cursor_str”:“1422439552921439430”,“previous_cursor”:0,“previous_cursor_str”:“0”}

Given that after 7 queries of the same user_id, I’m getting MORE than 5,000 total followers (5,278), I’m inclined to think that there really are more than 5,000 records… but perhaps I’m getting some pseudo-random 5,000 record subset of the data, which is always “close” to the beginning of the set, but somehow I’m not getting valid “previous_cursors”, and therefore I’m not able to “step” through the whole data set.

Any thoughts or suggestion are very much appreciated. Thanks.


#2

Somehow the posting mechanism has truncated my example results at the ‘tail’ end of each query…

I was trying to demonstrate that for each query, there is a “next_cursor” value, but that “previous_cursor” is always = 0

“next_cursor”:1422051957302770720,“next_cursor_str”:“1422051957302770720”,“previous_cursor”:0,“previous_cursor_str”:“0”}
“next_cursor”:1422054807940119867,“next_cursor_str”:“1422054807940119867”,“previous_cursor”:0,“previous_cursor_str”:“0”}
“next_cursor”:1422439552921439430,“next_cursor_str”:“1422439552921439430”,“previous_cursor”:0,“previous_cursor_str”:“0”}


#3

grrrrrr!!!

@#$%@ forms… sigh:

"next_cursor":1422051957302770720,"next_cursor_str":"1422051957302770720","previous_cursor":0,"previous_cursor_str":"0"}

#4

I’m not sure because I’m not seeing your sequence of requests and responses, but make sure the cursor that you’re using is complete, because if you keep it in a variable of type integer, for example, will not fit and therefore you will be sending requests with a wrong cursor.

I had a similar problem, and the issue is that I was keeping the cursor in a variable of type integer, and the cursor is larger, so it didn’t fit

Try it, and tell me if you fix your problems


#5

@cgruiz79

Thanks, that’s a good suggestion, but unfortunately this is not an issue of types or interpolation – I get this even when running the queries manually (ie not in any script) from the command line.

In fact, one thing I failed to mention in my original post is that I usually get the SAME cursor over and over again… at first I thought this could be some sort of rate limiting issue. If I run these two commands back to back:

bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=-1 bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422824612524461833

where the value “1422824612524461833” is the “next_cursor” value listed in the response to the first query, I won’t get a new (third) “next_cursor” value… I will get “1422824612524461833” over and over again…

This looks like a known issue:

https://dev.twitter.com/discussions/9698

…And unfortunately there has been no response or traction on this for the last 26+ weeks… ouch. Now I’m wondering whether the “uniqueness” problem I’m seeing, is that by the time I can submit a “new” valid query (ie… when I wait for 10 or 15 minutes between queries, when I actually do get a “next_cursor” value returned, my 5,000 responses are once again, mostly near the beginning of the 38,000 values in the set… so I never get to move through the cursored sections of the data, and I’m really just reporting a variation of this bug:

https://dev.twitter.com/discussions/9698

???


#6

OK good news and bad news:

I seem to have gotten nearer to the bottom of the problem:

I decided to run multiple iterations of the query with the subsequent “next_cursor” values in the Web Developer console. There, I get “proper” next_cursor AND previous_cursor values. Running a second and third time, and copying and pasting the resulting JSON results from 3 total queries yields 15,000 UNIQUE results.

If I run the exact same queries using the Twurl utility, I do NOT get proper “previous_cursor” values – I always get “0” …

So, same queries, two different tools… one tools (console) returns expected behavior, while the cli (twurl) does not. Looks like the Twurl project needs a bug report filed.


#7

Thanks for following up with your findings. I’m able to pull all of the followers using this script, and my initial attempt at doing this using twurl also seemed to work.

In your original post, the approach of awk’ing the response data for IDs doesn’t seem to properly extract the IDs from the response JSON object, where the ids are returned in the ‘ids’ field. This could be due to a difference in awk versions (?), but it could also lead to your output files not having the correct number of IDs.

Also, there is the possibility that the response for the user was returning wonky results as due to an invalid cache. Please let me know if you’re still seeing the issue and I’ll keep investigating.

  • Sean
#!/usr/bin/env python

import oauth2 as oauth
import json
import time

CREDENTIALS = {
‘secret’: ‘’,
‘token’: ‘’,
‘consumer_key’: ‘’,
‘consumer_secret’: ‘’
}

def main():
consumer = oauth.Consumer(CREDENTIALS[‘consumer_key’], CREDENTIALS[‘consumer_secret’])
token = oauth.Token(CREDENTIALS[‘token’], CREDENTIALS[‘secret’])
client = oauth.Client(consumer, token)

followings = []

cursor = '-1'
while str(cursor) != '0':
    url = 'https://api.twitter.com/1.1/followers/ids.json?screen_name=devops_borat&cursor=%s' % cursor
    resp, content = client.request(url, 'GET')
    if resp['status'] == '200':
        data = json.loads(content)
        followings.extend(data['ids'])
        
        print("Cursor %s returned %d results" % (cursor, len(data['ids'])))
        
        if 'next_cursor' in data:
            cursor = data['next_cursor']
        else:
            break


print len(followings)
print len(set(followings))

if name == ‘main’:
main()

Response on my side:

Cursor -1 returned 5000 results Cursor 1423453487261973809 returned 5000 results Cursor 1417335960378559173 returned 5000 results Cursor 1406408345044275028 returned 5000 results Cursor 1397901444972287843 returned 5000 results Cursor 1389640108164191797 returned 5000 results Cursor 1383715518997437910 returned 5000 results Cursor 1374182676824373058 returned 4763 results 39763 39763