Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

New Development

Download Paper

Download Dataset

Download Code

Java code

Experimental Results

Bibtex

Comments

For remarks, critical comments or other thoughts on the paper. Save what you write before you post, then type in the password, post (nothing happens), then copy the text and re-post. It's the only way to prevent spammers.

Add Comment 
Sign as Author 
Enter code:

RichardSocher24 July 2014, 09:27

PS: Our EMNLP 2013 paper is more accurate that this model and the code very easy to use and train by yourself (see the Stanford Java Core NLP? package).

RichardSocher24 July 2014, 09:26

@Alisy, can you be a bit more specific?

Alisy?24 April 2014, 02:59

Hi Socher, I cecently read your paper. Very interesting indeed! But I got some confusion here. Hope you have time to answer.

I want to know what's the meaning of the four kinds of variables in the rt-polarity_neg_binarized file, which comes from the data of your code. Besides, how can I achieve the information from a new data?

Thank you. Best wishes.

RichardSocher20 August 2013, 05:31

Hi Vargas, Sorry for the late reply. You first use PCA or t-SNE to project the higher dimensional words into 2d. I then created this visualization in matlab using the text command inside a figure.

RichardSocher20 August 2013, 05:29

Hi Phil, Sorry for the late reply. This code is not very clean. For understanding backprop through RN Ns?, I encourage you to study the code of my ICML 2011 paper, which you can find at Parsing Natural Scenes And Natural Language With Recursive Neural Networks

To your questions: 1. I'm not sure I understand that question. The output of one neuron is the input to another, recursively. 2. that del part is for ignoring the reconstruction error when it comes to the word vectors. 3. That is an interesting idea that we recently explored in our ACL paper: Parsing With Compositional Vector Grammars Another group that also explored this idea in combination with CC Gs? is: www.karlmoritz.com/_media/hermannblunsom_acl2013.pdf

It has not been tried yet for sentiment. We will release a new dataset in ~3 weeks that will make this much more interesting :)

Vargas?30 June 2013, 13:09

Hi Socher, I'm training right now my model with the java implementation, how I can create the PDF file with the red and blue words?

Thanks you! Great job!

Vargas?30 June 2013, 13:09

Hi Socher, I'm training right now my model with the java implementation, how I can create the PDF file with the red and blue words?

Thanks you! Great job!

Phil?23 June 2013, 07:01

Hi Socher,

I'm currently reading your paper. Very interesting indeed! But I got some confusion here. Hope you have time to answer.

1. the calculation of nd1, nd2, parent_d (three deltas for hidden layer and output layer) is given as, say, nd1 = f(y1)' * (y1-c1). Here y1 is the output of a neuron, should it be the input of a neuron rather than the output?

2. In the calculation of parent_d, it is given as parent_d = f'(a1)*(w3*nd1+ w4*nd2 + mat*pd - del), what is the purpose of "mat*pd -del" ? Does it serve as sparsity constraints? 'cause I don't see sparsity constraints in your equations.

3. Will it improve performance, if AE's are given different sets of weights?

Thanks a lot!

Peng Li?06 June 2013, 15:23

Hello, I feel a little confused about the following two lines in computeCostAndGradRAE.m

    L = We(:, words_indexed);
    words_embedded = We_orig(:, words_indexed) + L;

Why do not use We(:, words_indexed) directly? Thanks!

RichardSocher23 May 2013, 10:21

Hi Gao,

Well... This could have many causes. Is your training data skewed? That is, do you have more instances labeled as class 1 in training? If so, one possible fix would be to weight those instances down in the cost function.

Best, Richard

Gao Pengda?21 May 2013, 15:52

Excese me; When I use this rae to classify Chinese short sentences ,I separated labeled data to two parts, one with 90% for train and the other with 10% for test, I can get a 85% Precision in the train set, but the classifier get very bad result on the test set ,the classifier labels 80% test set to a part and 20% test part to another one ,while in fact it distribution is 50% to 50%. The Confusion Matrix is often like [378,411;112,99] ;while with 500 vs 500 test set.

Could you give me some suggestions to work out it?

Thank you. Best wishes.

RichardSocher13 April 2013, 09:19

Hi Tran Phi?,

I would use a strictly left branching tree, or in other words a simple chain for extending it to paragraphs and documents. This has not been done yet. I actually think it would be an interesting research direction.

By the way, I would also encourage you to just use the Stanford parser to get the right tree instead of searching for it with the reconstruction. Recent experiments showed improvements this way.

Gao Pengda. You can initialize the vocabulary with small random numbers, like x ~ Uniform(-0.001,0.001).

Another thing that can help is if you label more nodes in the trees. That way the model can become better at capturing sentiment changes.

Hope that helps.

PS: Save what you write before you post, then type in the password, then copy the text and re-post. It's the only way to prevent spammers.

TranPhi?12 April 2013, 20:29

Does this method naturally extend to the paragraph and document levels where there are multiple sentences?

Gao Pengda?12 April 2013, 04:58

 how to use the lexica ?

if it can be used in the procedure of initing the vocabulary? and how ?

RichardSocher11 April 2013, 10:30

Hi Xinwei,

rt = rotten tomatoes, a corpus of short movie review snippets.

read_rtPolarity is the original file we used to read and transform the rotten tomatoes movie review dataset into Matlab.

Xinwei Tang?07 April 2013, 05:55

 I want to ask what is the purpose of M file read_rtPolarity and what is rt short for?

RichardSocher18 January 2013, 07:34

Weird, I just tried it again and it works. Maybe github had a temporary problem?

nitin thokare17 January 2013, 11:42

Unable to download Java code zip file..