Hello,
I’m having trouble understanding why I’m not able to return large data sets (number of query results, or “collections”) even when I’m following the instructions for dealing with the pagination limit of 5,000 records:
https://dev.twitter.com/docs/misc/cursoring
For example, when trying to query all of the followers of a particular user who has 38,000+ followers, I’m attempting to use the “next_cursor” record to request (the first) 5,000 records, another 5,000 records, another 5,000 records… etc. My expectation is that for a user with 38,000 records, I would be able to submit 8 total queries using “/1.1/followers/ids.json” (ie 7 queries with 5,000 results, and 1 final query with ~ 3,000 results).
However, the actual data I’ve gotten back, rather than having 38,000 unique Twitter user id’s, I have only around 5,200 total unique records.
Here’s the code I’ve used thus far (I’m using the Developer console to check my syntax, and executing using Twurl):
- Get valid Oauth (consumer key):
bin/twurl authorize --consumer-key XXXXX --consumer-secret YYYYY
Go to https://api.twitter.com/oauth/authorize?oauth_consumer_key=NNNNNNNNNN&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1359667877&oauth_token=TTTTTTTTTTTTTT&oauth_version=1.0 and paste in the supplied PIN
… select “Authorize App” from the web browser,and get the 7 digit code. Enter back on the terminal session:
XXXXXXX
Authorization successful
- Query the followers of a specific screen name, then make subsequent queries with the appended “cursor” argument:
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422051957302770720
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422054807940119867
…
bin/twurl /1.1/followers/ids.json?screen_name=devops_borat&user_id=295916270900092928&cursor=1422461748394431581
(etc)
- Filter out just the numerical user ids into a flat text file (1 record per line) from the JSON, using Awk:
for i in seq 1 5000; do cat results_1.txt | awk -v num="$i" -F, {‘print $num’}; done >> set_01.txt
head set_01.txt
55991139
763964120
20368419
517788112
42142914
1132114194
28491079
152356046
851879449
254802477
- Look at the ‘unique-ness’ of these records:
$ wc -l set_0*
5000 set_01.txt
5000 set_02.txt
5000 set_03.txt
5000 set_04.txt
5000 set_05.txt
5000 set_06.txt
5000 set_07.txt
35000 total
$ cat set_0* | sort | uniq | sort | wc -l
5278
My hypothesis is that somehow there is something wrong with the cursoring, or my understanding of how it works… The public records for the user in question clearly state that there are 38,000+ followers… however, each of the queries (to /1.1/followers/ids.json? ) return 5,000 record results, with (apparently) valid next_cursor values, but always with “previous_cursor” values of " 0 " … which implies that I’ve reached the “end” of the data set. Ie:
…40864563,25812430,22147469,545062890,94819353,56063200,116919283,7317322,683813,38004645,909083328,160868645,29767325],“next_cursor”:1422051957302770720,“next_cursor_str”:“1422051957302770720”,“previous_cursor”:0,“previous_cursor_str”:“0”}
…8779202,209329470,40864563,25812430,22147469,545062890,94819353,56063200,116919283,7317322,683813,38004645,909083328,160868645],“next_cursor”:1422054807940119867,“next_cursor_str”:“1422054807940119867”,“previous_cursor”:0,“previous_cursor_str”:“0”}
…410293144,177192104,18023268,51829103,14429965,590265034,36491986,48312572,5200101,243198516,871136221,5871032,72137006,80813955,576935034,30245078],“next_cursor”:1422439552921439430,“next_cursor_str”:“1422439552921439430”,“previous_cursor”:0,“previous_cursor_str”:“0”}
Given that after 7 queries of the same user_id, I’m getting MORE than 5,000 total followers (5,278), I’m inclined to think that there really are more than 5,000 records… but perhaps I’m getting some pseudo-random 5,000 record subset of the data, which is always “close” to the beginning of the set, but somehow I’m not getting valid “previous_cursors”, and therefore I’m not able to “step” through the whole data set.
Any thoughts or suggestion are very much appreciated. Thanks.