Dear community,

I used Academic track full archive search to retrieve a set of tweets that included the same hashtag within a specific time period. Tweets using this hashtag would be heavily moderated by Twitter (rightfully).

I have noticed that a large percentage of Tweets retrieved are not accessible anymore (about 50%). I am speculating that those are:

  • Tweets that were deleted by Twitter moderators
  • Tweets that were deleted by users themselves

I have failed to find exact information as part of API docs or this forum about whether this search includes deleted Tweets, and if so, does it include Tweets deleted by users themselves or just those deleted by Twitter mods (or something else)?

Any clarification would be super helpful! Thank you :slight_smile:

As far as I know v2 API will never return results from suspended accounts, private (locked) accounts, or content that was deleted - by Twitter or by the user.

2 Likes

Thanks a lot for responding! Today I re-ran the query, and for example, it returned the tweet with the following id: 1346231552028549120

When I try to look for it (via browser), I get a “Hmm… that page doesn’t exist” and below a message displays saying “Sorry, that Tweet has been deleted.”

To me, that sort of suggests (strongly) that some deleted tweets do get included.

2 Likes

What was the query? Does it still happen now? Sometimes deleted tweets may slip through - it’s not 100% immediate as far as i know.

1 Like

The same logic applies to API v1.1 and v2. Deleted content is not in the API. The behaviour for protected accounts is dependent on the user context in v1.1 (for timelines, for example), but in search, protected accounts will not be indexed or returned.

Impossible to comment further without specific information on the query and the timing, but even then, we are unable to comment on private user data or to confirm the deletion timing or status beyond what you see from the API and the website/apps.

2 Likes

Thank you @andypiper and @IgorBrigadir!

To give more information, I was querying for tweets that include #QAnon in the period between December 10th 2020 and January 10th 2021 inclusive.

I have originally fetched this data in March, but I have fetched the data again today for the same query, so yes - I can confirm that I am still getting tweets that I am not able to access as a regular user through the platform (an example being the tweet with the ID above, but I can provide hundreds more IDs as its by no means an isolated incident).

2 Likes

To clarify: “I am still getting Tweets…”

I read this that you got a set of Tweet IDs in March but now when you try to access those IDs, you get errors. Is that correct?

If you are running the exact same query you ran in March and getting IDs that are inaccessible, I would be surprised. I would just like to understand the exact circumstance here. Is there any chance that your data from March is being retained and not rewritten, for example?

You, me, and Reviewer 4 for my paper would all love to understand this :slight_smile:

I am running the same query but in a new R workspace, downloading data to a new, separate folder, with a different bearer token, having updated the packages etc etc - so I am almost sure it’s not scrambling up with old data. Another reason to believe that my data is actually freshly obtained is that it doesnt exactly match the previous dataset, but about 97% is the same. If you think it can’t be anything else, I could run from a different computer?

I queried my March data about a month later (in April), tweet by tweet, to see if it’s accessible or not without academic access, and about 50% of it was not accessible. That’s a very large percentage and does not match what I might expect people would have deleted themselves (and why would Twitter moderators delete stuff two months later in those volumes) - hence my theory that I am also getting deleted tweets.

To be explicitly clear: the example tweet ID I have posted above was returned to me today when I ran the query for the aforementioned hashtag in the specified timeframe, alongside many others that are similarly not accessible via a browser but are returned via academic API.

Is this Tweet ID being returned? What is the API call your code is making?

Yes, this is one of 43788 tweets that were returned when I queried the API today - of which I estimate 6755 not to be accessible via a browser. I was including this particular Tweet ID it as an example.

I am using academictwitteR library for R and my query was:

get_all_tweets(
    query = "#qanon",
    start_tweets = "2020-12-07T15:44:33Z",
    end_tweets = "2021-01-10T20:22:15.00Z",
    file = "qanontweets2",
    data_path = "newdata/",
    n = 1000000,
  )
1 Like

In case it is helpful, 5 more tweets that were returned today but are not accessible: 1338250804944396288, 1340798430122225664, 1347242998485151744, 1346971646532452352, 1347497392011141120 (and I am happy to provide the other 6 thousand!)

1 Like

It’s very late here so I will have to look further tomorrow or another day, but this is very intriguing.

2 Likes

I tried to reproduce this, i ran:

twarc2 search --archive --start-time "2020-12-07T15:44:33" --end-time "2021-01-10T20:22:15" --limit 1000000 "#qanon" qanontweets.jsonl

This retrieved 43,806 tweets.

Immediately extracting their IDs (need pip install twarc-ids too),

twarc2 ids qanontweets.jsonl qanontweets_ids.txt

and hydrating them again

twarc2 hydrate qanontweets_ids.txt qanontweets_hydrated.json

and extracting the results of that,

twarc2 ids qanontweets_hydrated.json qanontweets_hydrated_ids.txt

I get 43,797 tweets.

Should technically give the same ids, with only a few minutes between requests - but the differences are just 9 tweets:

diff qanontweets_ids.txt qanontweets_hydrated_ids.txt | grep "^<"
1344803909299302401
1344804172256976896
1344804690874122240
1344805592527880192
1344806569599500288
1344806627673718784
1344806695135027201
1345374048184799235
1345425107104235521

All of these are unavailable in the web interface. (This Tweet is from an account that no longer exists.) and all of them are retweets of a single tweet from a deleted user:

None of these appeared in my dataset - i guess there’s maybe some consistency issues but nowhere near 6 thousand for me - If you can share those 6k ids i can try to retrieve those too, or check if they’re in my results.

I can’t really comment on this replication or why it is returning something different, as I am using a different method to fetch data.

6753 tweets that I think are deleted are in a csv file here.

This is freshly obtained data from today. The way I am getting the data is

require(academictwitteR)

tweets <- get_all_tweets(
    query = "#qanon",
    start_tweets = "2020-12-07T15:44:33Z",
    end_tweets = "2021-01-10T20:22:15.00Z",
    file = "qanontweets3",
    data_path = "data_new/",
    n = 1000000,
  )

I am not sure what these steps are with hydrating (I am by no means an expert in working with Twitter data :)) - but I am simply parsing obtained jsons into a table. The way I queried whether tweet is deleted or not is making separate calls with

curl(paste0("https://api.twitter.com/2/tweets/", df1$post_id[i]), handle = h)

and checking if I get a tweet back, or I get an error message “Could not find tweet with id:”
This is the method that essentially replicates accessing a tweet through the browser. I have checked a very large number of tweet IDs and have quite some confidence that this method works.

1 Like

Yep, this looks like it is pretty much the equivalent of what i’m doing too, except in R.

Doesn’t appear that i get any of the CSV file ids in my results though.

I have managed to debug this and am writing here to let you know what the issue was (while crying in frustration).

The problem was not Twitter, not academictwittR, not twarc, not me… it was Pandas.

Pandas read_json has a bug reported in 2018 but still not fixed, in which large integers get read incorrectly if you don’t specify dtype on import.

I was checking my dtypes, and they were right, but the issue was that pandas reads it first as an object, then converts it to float, and then converts it to an int, and reports it as an int. The result of this was that some, but not all, of my tweet IDs were wrong.

Sometimes they were wrong by one number (e.g. 1342476785078960129 and 1342476785078960128), sometimes by a bit more (e.g. 1345201420283346955 and 1345201420283346944).

What added to confusion is the fact that Twitter reports “Sorry, that Tweet has been deleted.” if querying for the wrong tweet (e.g. 1342476785078960128), which is likely inaccurate (so perhaps should be fed back to dev).

Thank you @IgorBrigadir and @andypiper for trying to resolve my issue and respond to my questions - I appreciate it a lot!

3 Likes

Wow thanks so much for following up, I had no idea about the pandas bug! I must check into that to make sure it’s not corrupting IDs in my own code too!

3 Likes

This sounds similar to the fact that JavaScript cannot parse large integers Twitter IDs | Docs | Twitter Developer Platform - I realise that Pandas is a Python package, but this is worth being aware of, in general. Thanks for sharing what you learned, this will be super helpful for the others in the community!

2 Likes