I am writing my thesis on the public debate during the covid-19 pandemic in Denmark.

Based on a live scraped dataset collected during covid-19, I have Identified a sample of danish tweets. To comply with GDPR I need to know which tweets are deleted by the users, in order to remove these from my dataset. However, I have had a hard time identifying which tweets are deleted specifically by the user. Do you have a solution to this issue?

Kind regards
Ida

I need to know which tweets are deleted by the users, in order to remove these from my dataset.

You can use GET statuses/lookup | Docs | Twitter Developer Platform to check 100 tweet ids at a time, any deleted tweets or tweets from suspended accounts will not be retrieved. If it’s a very large dataset it may take a while.

Twarc has a good dehydrate function to extract ids and a hydrate function to look them up again: GitHub - DocNow/twarc: A command line tool (and Python library) for archiving Twitter JSON this is a good way to remove deleted tweets.

There’s also a batch compliance endpoint but that’s not fully available yet i don’t think Introduction | Docs | Twitter Developer Platform

2 Likes

Thank you very much for your useful reply.

I have tried to use the Batch Tweet compliance lookup, but it seems I lack ‘authentication’ - I am not sure if I got something wrong, or if it is due to the endpoint not being available yet?

Do you have any experience with the batch compliance endpoint?

Yeah unfortunately i don’t think it’s available generally yet - it was in trial for a while.

I think you’re stuck with using statuses/lookup to check for deletions. The twarc command line tool i linked is the best way to use that.