Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

Download Paper

Download Dataset

Download Code

Java code

Experimental Results

Bibtex

Comments

For remarks, critical comments or other thoughts on the paper. Save what you write before you post, then type in the password, post (nothing happens), then copy the text and re-post. It's the only way to prevent spammers.

Add Comment 
Sign as Author 
Enter code:

Gao Pengda?21 May 2013, 15:52

Excese me; When I use this rae to classify Chinese short sentences ,I separated labeled data to two parts, one with 90% for train and the other with 10% for test, I can get a 85% Precision in the train set, but the classifier get very bad result on the test set ,the classifier labels 80% test set to a part and 20% test part to another one ,while in fact it distribution is 50% to 50%. The Confusion Matrix is often like [378,411;112,99] ;while with 500 vs 500 test set.

Could you give me some suggestions to work out it?

Thank you. Best wishes.

RichardSocher13 April 2013, 09:19

Hi Tran Phi?,

I would use a strictly left branching tree, or in other words a simple chain for extending it to paragraphs and documents. This has not been done yet. I actually think it would be an interesting research direction.

By the way, I would also encourage you to just use the Stanford parser to get the right tree instead of searching for it with the reconstruction. Recent experiments showed improvements this way.

Gao Pengda. You can initialize the vocabulary with small random numbers, like x ~ Uniform(-0.001,0.001).

Another thing that can help is if you label more nodes in the trees. That way the model can become better at capturing sentiment changes.

Hope that helps.

PS: Save what you write before you post, then type in the password, then copy the text and re-post. It's the only way to prevent spammers.

TranPhi?12 April 2013, 20:29

Does this method naturally extend to the paragraph and document levels where there are multiple sentences?

Gao Pengda?12 April 2013, 04:58

 how to use the lexica ?

if it can be used in the procedure of initing the vocabulary? and how ?

RichardSocher11 April 2013, 10:30

Hi Xinwei,

rt = rotten tomatoes, a corpus of short movie review snippets.

read_rtPolarity is the original file we used to read and transform the rotten tomatoes movie review dataset into Matlab.

Xinwei Tang?07 April 2013, 05:55

 I want to ask what is the purpose of M file read_rtPolarity and what is rt short for?

RichardSocher18 January 2013, 07:34

Weird, I just tried it again and it works. Maybe github had a temporary problem?

nitin thokare17 January 2013, 11:42

Unable to download Java code zip file..