Tweet data sets for open source projects that can be shared without violating the Twitter terms


#1

Hi,

are there sample Tweet data sets that can be used and shared publicly without violating the Twitter terms and policies?

I am starting work on an open source Twitter library. The code will of course live in a public (Github) repo. In order to run automatic test in my CI routines a sample Tweet data set would be needed. It is of course easy to assemble a test data set, but as far as I understand the Twitter API terms, sharing such a data set in a public repo would violate the Developer terms.

If I only include the Tweet IDs in my test data I would comply with the Twitter terms but would have to rehydrate those Tweets for every build and test execution. Moreover, since some Tweets may become unavailable over time my test data would change and tests may fail due to this.

I am sure that this has been encountered before by other open source projects working with the Twitter API, but I haven’t been able to find any efficient and compliant solution for our project.

Hence, I wonder 1) how this has been approached by others in the community and 2) if there are stable, reusable Tweet data sets for testing that can be shared in an open source project in compliance with the Twitter terms?

Thanks in advance for any feedback.

/Stefan


#2

You’re right - hydrating tweets as part of a test sounds like a bad solution.

For 2)

This is a great idea I think - something like the twitter-text conformance tests https://github.com/twitter/twitter-text/tree/master/conformance but for full JSON objects? Is that what you mean?

For 1)

Just my opinion but i think you can go ahead and include a couple of your own tweets as test cases (You can always make some manual edits to fake the user and tweet ids / text to anonymize the data).

Other libraries just hardcode raw json too https://github.com/Twitter4J/Twitter4J/blob/master/twitter4j-core/src/test/java/twitter4j/StatusJSONImplTest.java

There’s a couple of Twitter’s own repositories that just share the full tweet payload for examples / test cases - so i think it’s ok if you just include a sample json file with your own tweet or something like that:

eg: https://github.com/twitterdev/tweet_parser/tree/master/test
or https://github.com/twitterdev/tweet-updates/tree/master/samples

If you come up with a bunch of test cases with examples for Tweets please do share those! It would be great to have something like that all in one place.


#3

Thanks for the links! This might be a starting point for some very granular test cases, but I am more interested in larger Tweet sets that allow to extract and test the extraction of cumulative stats - as a very simple example consider frequencies of entities such as hashtags. The examples you pointed at


might help, but they are also not very representative of real Twitter data sets.

It would be really nice if @TwitterDev could provide an officially sanctioned set of test data that can be used by the community without getting in conflict with the Twitter terms and policies. A straightforward solution would be to wrap a fixed period of the @Twitter and/or @TwitterDev timeline into a dataset and make it available via Github. It is representative, big enough to provide coverage for a lot of test cases and it is of course owned by Twitter. Would be great to be able to fall back on this!


#4

Yes, I agree, but for that specific example i don’t think it’s worth using a large set of tweets for frequency counts of hashtags, i’d instead break that up and use the existing tests for extraction https://github.com/twitter/twitter-text/blob/master/conformance/extract.yml#L705 and come up with my own separate tests for frequencies (depending on how i’m counting things) If scale is a requirement, you can always keep “replaying” the same tweets over and over.

I remembered another Twitter released dataset that might be relevant: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
(But this would only be useful for some manually run tests, it requires hydrating tweets first)