Public domain twitter sentiment corpus


#1

I want to release a data set useful for training twitter sentiment analysis algorithms. It consists of ~4500 sentiment-labeled tweets. It would be publicly available and distributed without charge.

However, I’m afraid the data may violate the API ToS, specifically term 4A.

The data set consists of:

  • tweet text
  • tweet creation date
  • hand-curated tweet topic
  • hand-curated sentiment label: “positive”, “neutral”, “negative”, or “irrelevant”

Does this violate the ToS? Is there anyone I can contact to get permission to release it?


#2

Hello,

Under our API Terms of Service (https://dev.twitter.com/terms/api-terms ), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.

As such, if you would like to share this data set, you will need to remove the tweet text and creation date from the data set and replace these with the appropriate Tweet ID. Let me know if you have any further questions.

Thanks,
Brian
Twitter API Policy


#3

Brian,

Thanks for the quick reply.

I’ll distribute a script to help my users recreate the dataset from only the twitter ids.

It looks like I’ll have to pull the tweet texts individually using statuses/show. With rate limiting of 350/hr, that comes out to ~13 hours to download 0.5 MB of text. Yipes.

Cheers,
Niek


#4

Hi Niek,

I’m very interested in your dataset. We are currently building twitter sentiment analysis algorithms and this set is exactly what I was looking for.

Much appreciated if you could keep me informed where you’re going to publish your dataset.

Regards, Tim


#5

The dataset should be going up tomorrow. I’ll drop a link here when it does.

Best,
Niek


#6

The corpus is now available for download:
http://www.sananalytics.com/lab/twitter-sentiment/

It includes a small Python script to work around retardation in Twitter’s ToS.


#7

Have not used the corpus yet, but just wanted to express my appreciation for making it available.


#8

thanks for share, maybe i will ask personally about this dataset in future…


#9

Thanks a lot!
I’m working on my PhD dissertation and I need some tweets to train the system I’m developing.
I’ll play with it and I’ll give you my feedback.


#10

Thanks! I’m also working in my PhD with tweets and I need a dataset.


#11

Hi Niek,

I just got your script to download the corpus. Planning to use an existing classifier on your corpus. Will keep you posted. But before that, thank you so much for your contribution.


#12

hello sanders, thank you for dataset. have u made some publication papers about that sentiment analytics? i allready googling and can find any papers related to you. if you allready made, can i read and learning, i wanna add as my papers reference.

many thanks,
ekky


#13

Thanks a lot


#14

anyone can help me to retrieve the corpus, it takes a long time for this.


#15

I can minimize the number of tweets from 5513 to 1000?


#16

Hi Brian,

A team in our college is conducting an online contest where the plan is to provide the participating groups the following:

  1. A list of tweetids(Not tweet text).
  2. For every tweetid, the corresponding hand-curated topic class the tweet text belongs to(say ‘Politics’ or ‘Sports’)
  3. A script(using standard Twitter API) to download the tweet texts from a list of published tweetids.

The participants needs to train a machine learning algorithm to predict the topic class a particular tweet belongs to. For this they need datasets consisting of about 10000 tweet texts. The script we provide will help them download this dataset.

Will this violate the Twitter Terms of Service?

I understand Niek above used a similar method enabling his users to recreate the tweets from a list of tweetids. We are following a similar approach. And our dataset will be limited to only the participants of the contest and will not publicly available.

We assure you we are going to publish only the list of tweetids and not the text. The script we provide will help download the text on a need basis.

Thanks a lot,
Anirban


#17

Hi,

Greetings of the day,

                   Iam searching for positive and negative tweets , i got  u r data set, but i want the which basis u divided positive and negative , do u have any  tweet text  for example suppose take an apple ..

apple positive is apple is good for health—positive feed

 so i want text field also please send me csv or text format .. 

Regards
Gopal


#18

Hi, i was trying out the Free data set provided by you guys and i encountered a problem which i hope you guys can help me with.

I think i might have close the Python script accidentally midway through the download, i should be able to restart the script again and continue my downloads right? However i can’t do so right now. The python script just exit itself after i click on enter thrice. I have only manage to download like 800+ tweet?

The Error was:
return [ tweet_json[‘created_at’ ], tweet_json[ ‘text’ ] ]
KeyError: ‘created_at’

Anyone who have try out the script can help me with it?


#19

I managed to download (3 of 5513) tweet.

Got the same error as Koh Pei Jie

return [ tweet_json[‘created_at’], tweet_json[‘text’] ]
KeyError: ‘created_at’

I would really appreciate if anyone could help with this.

Thank you in advance


#20

hi tim,
i am thinkinig to take up sentiment analysis as my project work have been luking for information on the subject. hope you finished your work on this. can you share the information or the link where you have published the same

thanks