Creation time of Tweet is not included in the downloadable Twitter archive


#1

I do not know if this also occurs in the REST API, but the JSON files included in the user downloadable Twitter archive no longer include the time that the tweet was created. All times are flattened at 00:00:00 of the date that tweet was created.

Here is diff of the data/js/tweets/2007_05.js file. The older version (revision 2) was downloaded 3 or 4 months ago and the newer one (working copy) only 3 days ago:

— 2007_05.js (revision 2)
+++ 2007_05.js (working copy)
[…]

  • “created_at” : “Thu May 31 19:54:02 +0000 2007”,
  • “created_at” : “2007-05-31 00:00:00 +0000”,

(Diff also available here: http://bit.ly/1bvVNLY)


#2

I’m having this similar issue with the recent Twitter archive download. The created_at timestamps are flattened for some tweets, usually old ones.

Examples, from my archive’s 2010_10.js:

{
  "source" : "\u003Ca href=\"http:\/\/www.destroytwitter.com\" rel=\"nofollow\"\u003EDestroyTwitter\u003C\/a\u003E",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : { },
  "id_str" : "29230328481",
  "text" : "Oh? Google Chrome dev has non-modal dialogs like Opera now? http:\/\/crbug.com\/456",
  "id" : 29230328481,
  "created_at" : "2010-10-31 00:00:00 +0000",
  "user" : {
    "name" : "Lim Chee Aun",
    "screen_name" : "cheeaun",
    "protected" : false,
    "id_str" : "76993",
    "profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/1424189315\/avatar-large_normal.png",
    "id" : 76993,
    "verified" : false
  }
}

And another example:

{
  "source" : "\u003Ca href=\"http:\/\/www.destroytwitter.com\" rel=\"nofollow\"\u003EDestroyTwitter\u003C\/a\u003E",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : { },
  "id_str" : "29245295894",
  "text" : "Ugh, 43 slides.",
  "id" : 29245295894,
  "created_at" : "2010-10-31 00:00:00 +0000",
  "user" : {
    "name" : "Lim Chee Aun",
    "screen_name" : "cheeaun",
    "protected" : false,
    "id_str" : "76993",
    "profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/1424189315\/avatar-large_normal.png",
    "id" : 76993,
    "verified" : false
  }
}

Obviously it doesn’t make sense for the timestamp of two tweets to be exactly the same, down to the second. :confused:


#3

I also have this issue. Are any Twitter engineers aware?


#4

Aware of this thread, yes - I’m not aware of whatever backend change may have caused this date discrepancy for older tweets, though. Will look into that.


#5

@andypiper Thanks; at the risk of submitting a smug report, I believe it’s related to the snowflake ID system.

All of the tweets in my archive with the old IDs have the time as 00:00:00, and as soon as they start having a snowflake ID, the time is correct; the first correct tweet I have is a snowflake tweet at 2010-11-04 21:51:08, and the most recent broken one is a non-snowflake tweet at 2010-11-04 19:59:53.

Hope that helps! :slight_smile:


#6

That is a very good bit of sleuthing and a likely reason for the difference! Again, I can’t confirm right now, but it seems plausible.


#7

Sorry that I’m not a developer, so I can contribute little, and I realize I’m reviving a thread that is more than a year old, but this seemed like the most plausible place to maybe hear some answers on this matter. I just discovered this phenomenon in my own Twitter archive. The magic time/date stamp for me seems to be 2010-11-04 22:08:17 +0000 - that’s the first item in my downloaded CSV file with a date. I presume all tweets from that point forward are chronologically in order. But, prior to that point, the tweets are only chronological by day. Tweets within one particular day all have 00:00:00 as the timestamp and I see no particular rationale as to what order each day’s tweets appear in that are older than 2010-11-04 22:08:17 +0000. Is this something that is resolvable or is that data simply no longer available in the very old portions of archives? PS - I also noticed that the tweet_id jumped 4 digits longer at that 2010-11-04 22:08:17 +0000 timestamp as well.