I have access to the academic Twitter API, and I am studying the set of tweets that contain particular political candidates’ names in different time periods.
My problem is that there are very large numbers of tweets for some of the candidates. If I just do a keyword search for their name, I will very quickly reach the 10 million tweet limit without getting data on all of the candidates. Also, I really don’t need millions of tweets for each candidate, just maybe 100,000 or so.
I was thinking it might be good to just take a random sample of the tweets in each time period for each keyword, and only request those tweets from the API. I need a sample that is roughly evenly distributed throughout the time period. Is there a way to do this? (Obviously, I know how to randomly sample after requesting from the API, but this defeats the purpose of the sampling.)
This is currently not possible to do easily, but soon twarc will have a way to do it hopefully - Random sample option by igorbrigadir · Pull Request #459 · DocNow/twarc · GitHub
For now, I suggest narrowing the time range and keywords down even more, you can use the counts endpoint to give you an idea of exactly how many tweets are in the time range, see twarc2 (en) - twarc for docs
One more question: Is it possible to pre-filter tweets on any variables other than operators at the bottom of this page? For example, it would be nice to request only tweets with a certain number of likes or replies.
Thanks so much!
No unfortunately - apart from the the start and end dates, those operators listed in the docs are the only filters available.