For my research I’ve obtained a dataset from Harvard Dataverse and have hydrated the tweets it contains. My difficulty arises in expanding the hydrated tweets to obtain the cascades within which they reside.

I need to get all of the replies, quote retweets, and retweets for 8M hydrated tweets.

Examining how many replies each hydrated tweet has I estimate that I will pull 5.5M tweets just in replies for every 250k tweets hydrated. At that rate, I’ll blow through the Academic Research cap of 10M/mo extremely quickly, and I won’t have begun collected quote retweets or retweets.

Is there anyway to increase my 10M cap for Academic Research (maybe an Academic Research+) or cleverly write queries more efficiently?

Thanks in advance.

I don’t know about increases to the limits, but given that you have the public_metrics for a tweet, it should be possible to count up exactly how many tweets you need to retrieve, and plan accordingly - unfortunately this may require 2 or 3 months of rate limits if it’s that many. But if you’re estimating 5.5M replies in 8M tweets, it might just about fit into a whole 10M month, quote tweets aren’t more common and numerous than replies in general, and as long as you don’t also have to get replies of the quote tweets, it should be ok.

To give some additional context, my goal is to find the entire cascade that each of the hydrated tweets lives in—a multi-start-node BFS essentially. Each tweet should get all RTs, QRTs, and replies. I went through all my tweets to examine the public_metrics, and the first step outwards seems like it’s going to request 496M tweets.

That number was significantly higher than I expected, and it’s to the point where I feel the 10M is definitely a prohibitive cap on my current approach.

Unless you know of a better way to perform my desired BFS, I think that I’ll have to go back to the drawing board sigh

1 Like

Yeah that seems like a lot. I would randomly subsample the initial 8M tweets maybe until you get something more reasonable and retrievable.