I two slightly different but related issues but first I should give a bit of background.
We have four months of geo-located data collected from the Streaming API. After some correspondence with Twitter we were warned that we could be in breech of the T&Cs. We stopped the application and then purchased a further 3 months of data. Twitter agreed that the T&Cs for purchasing the data covered the use of the data that we had already collected ourselves giving a dataset covering 7 months.
One of our quality checks involves looking at the date range for first and last tweets as a distribution for all users. We have noted a discontinuity that corresponds to the cutover between the self-collected and the purchased data We have established that this is mostly explained by protected accounts. Specifically, tweets from accounts we have collected ourselves but are subsequently protected are in the self-collected data. However, tweets from protected accounts are not in the purchased data (at the time the purchased data was extracted).
My first question is, is there any way of identifying the userids for protected accounts? The idea would be to use this information to strip out protected accounts. This would not only help with respecting the privacy of users who having retrospectively protected their accounts, it would also provide a more consistent dataset.
Another issue related to this is a second discontinuity that occurs on 16 September? My second question is, was there anything odd that happened with Twitter around that date that might have caused users to stop geolocating tweets?