Twitter and Open Data in academia


#1

Looking for help from anyone with experience releasing Twitter datasets in an academic context, and hopefully guidance from Twitter staff.

In the UK, as in other countries, there is a growing requirement to make research data openly accessible to other academics (and non-academics). Some of the funding that I hold includes this stipulation.

Around the UK General Election debates this year, we collected some 38,000 tweets posted to official hashtags. We used 2% of these tweets—just shy of 800—for a manual qualitative analysis. We’d like to release those 800 tweets as an open dataset alongside the relevant publication.

I’ve been in discussion with my university about the possible legal issues around this, and we’ve reached the following conclusions:

  • We think that releasing a CSV file of 800 tweets as a download is fine in principle (based on I.6.b.i of the Developr Policy: https://dev.twitter.com/overview/terms/agreement-and-policy)
  • We think the copyright aspect is fine. Although tweets may be copyrightable under EU law, the users are licensing them to Twitter, who are sublicensing them to us.

The point where we’re getting stuck is licensing. Normally, you would release your data under some form of Creative Commons license. However, the Twitter Developer terms specifically forbid sublicensing the content (I.B in the Developer Agreement: https://dev.twitter.com/overview/terms/agreement-and-policy).

Does anyone have any idea how we might release the tweets without falling foul of this?

(NB: We’re aware we can bypass all these issue by just releasing Tweet IDs. However, for other researchers to scrape these tweets would be time consuming, and not in the spirit of sharing data.)


#2

I’d love an official response to this too - Maybe @andypiper can clarify?

The way I’m understanding it is it’s fine to offer a CSV of tweets, provided you honour deletions & private accounts:
https://dev.twitter.com/overview/terms/agreement-and-policy#3.Update_Respect_Users_Control_and_Privacy Part b. So you’d have to update your CSV if tweets get deleted or a user becomes private.

No idea about Licencing.

If making data easier to deal with is important, releasing just IDs + a script without API keys might be a good option.
This ensures that others must request tweets and any deletions + private accounts will be handled appropriately, but the downside is anyone who wants to work with the data will need to register an application on twitter & run your script with their own API keys. Downloading 800 tweets won’t take long - you can do it with 8 calls to https://dev.twitter.com/rest/reference/get/statuses/lookup

Maybe a good way to do it is to release both: a CSV with just the IDs with your annotations or labels if any, and the full CSV that gets updated with deletions?


#3

Thanks for your suggestion of a script… this is something we hadn’t considered at all. I don’t often work with this form of data, but of course that would be a common solution in some other fields.

It’s not an ideal solution because a) it would be nice to make the data more accessible to non-technical folk and b) scaled up to the 50,000 maximum that Twitter would allow us to release, I guess that would still take a few hours of API allowance? But it does give us another option to work with.

The deletion issue is another interesting one that we didn’t quite come to a solution over. It seems unlikely that we’d ever get a notification of this, even though I’m sure some of the tweets have already been deleted or hidden. We can’t be expected to periodically check, surely – the data will be stored for 5–10 years!


#4

Yeah, I can understand the need for something that won’t require much technical details & effort to share.

Would it be acceptable to release just IDs + whatever meta data you’ve annotated for anyone familiar with Twitter API, and make the easy to use CSV “available on request”? That might satisfy both data sharing, and twitter TOS requirements.

Twarc is good for this: https://github.com/edsu/twarc omitting api keys, it can be as simple as:

python twarc.py --hydrate ids.txt > tweets.json

Easiest way to deal with deletions is just to redownload all the tweet ids with that. 50,000 tweets using /statuses/lookup will take around 45 minutes (every 15 min window, lookup allows 180 calls, requesting 100 ids per call)

On a related note - good thing you’re actually releasing the data in the first place (is there a preprint of the paper somewhere? i’m always on the look out for twitter data sets)

If you upload to http://figshare.org or http://datadryad.org or follow https://guides.github.com/activities/citable-code/ you’ll get a DOI for your data others can cite.


#5

Thanks for your great question, Nick.

This is in fact what you ought to be doing in this case. This then covers the case where Tweets are deleted or accounts are subsequently made private, since those Tweet ID will not be retrievable later.


Sharing Tweet-Ids in repositories
#6

Thanks Andy, that’s helpfully definitive! We’ll provide just the IDs and I’ll look into Igor’s suggestion of providing a template script to help others access them.

Just out of curiosity, in what circumstances might 6.b.i come into play? It seems like this would always throw up the problems with deletions etc., and was the cause of confusion for us here.

“You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.”


#7

Yeah, I can understand that causing some questions. I’m not behind the authorship of that clause, but I’ll see what I can find out or what I can do to clarify for the future. Thanks!