Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
|
We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules.
We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions.
Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.
|
New Development
- This model is superseded by our EMNLP 2013 paper Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank . You can test the model and see more information at http://nlp.stanford.edu/sentiment/
Download Paper
Download Dataset
- For downloading the dataset we provide the following files:
- This first file has the following format:
ConfessionURLofId,Hugs,Rocks,Teehee,Understand,Wow
http://www.experienceproject.com/confessions.php?cid=2,0,3,19,0,3
- The first element in each line is the url for downloading the actual text of the confession. The remaining 5 columns are the number of times users voted for each of the 5 categories.
- Since the confession text does not change, it can be downloaded by a simple script. However, the votes can change so for comparison to our model, please use the votes and train/test sets in the above files.
- We used a random 70/30 split "on the training set" to get the development set.We used a random 70/30 split to get the development set.
- If you have any questions or trouble with the download, feel free to email richard at myLastName.org
- We thank Chris Potts for help with this interesting sentiment dataset.
Download Code
- Download train-test code and dataset for our movie review experiments here: codeDataMoviesEMNLP.zip
- This code can be used in two major ways:
- To train a semi-supervised recursive autoencoder from random word vectors and without sentiment lexica on movie reviews.
- To test using our best trained model on the first movie review fold.
- If you have a multicore machine, the code will be able to use all cores and parallelize.
- To run it, just open matlab and enter trainTestRAE
Java code
Experimental Results
- Here's a visualization of word embeddings learned on the movie reviews data set.
- Notice that the current objective only uses sentiment, not POS tags (but such a constraint could easily be added).
- Words are colored by our model such that words with a high probability of being positive are red, those with low probability are blue.
- Click on the image for a legible pdf file.
-
Bibtex
- Please cite the following paper when you use the data set or code:
@inproceedings{SocherEtAl2011:RAE,
author = {Richard Socher and Jeffrey Pennington and Eric H. Huang and Andrew Y. Ng and Christopher D. Manning},
title = {{Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions}},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = 2011
}
Comments
For remarks, critical comments or other thoughts on the paper.
Save what you write before you post, then type in the password, post (nothing happens), then copy the text and re-post. It's the only way to prevent spammers.
Hi Socher, I have some questions about EP dataset. I can't get the text according to the I Ds? in your file. The ID is updated?
Hi, Socher. I tried download EP dataset by I Ds?, but failed. I acquire 0.1 million data, but i can't find the ids in your file. Something wrong? Can you tell me? Thank you very much.
I have a question that does the source code following the downpaper thinking ? Because the source code have a little annotation.
YH? — 31 August 2015, 10:37
I cant run on my Matlab 7.0, for example,
??? Error: File:
D:Downloadcode Data Movies EMNL Pcoderead?_rtPolarity.m Line: 131 Column: 1
Illegal use of reserved keyword "end".
Error in ==> trainTestRAE at 53
read_rtPolarityt run it on my Matlab 7.0. For example, it has :
it can't recognize "parfor", and if I change parfor to for, it has many other errors
Hi Socher,
When I try to use reabuilder to train the model like this "raebuilder.main(str);",it will report so many exceptions as "java.lang.Unsatisfied Link Error?: org.jblas.Native Blas?.dgemm(CCIIID[DII[DIID[DII)V", but I have add
all the .jar files into build path.How is this happened and what can I do about it?
Thank you. Best Wishes.
Hi Socher,
I assume the code provided is for the first case of neural word representation (section 2.1), where the word embedding matrix weights L are learnt as part of the back propagation process
Am I getting it correct? Is there another code that learns the embeddings matrix from n-gram validity targets like from English Wiki pedia for example?
Thanks
Hi Socher,
I assume the code provided is for the first case of neural word representation (section 2.1), where the word embedding matrix weights L are learnt as part of the back propagation process
Am I getting it correct? Is there another code that learns the embeddings matrix from n-gram validity targets like from English Wiki pedia for example?
Thanks
Finn? — 13 December 2014, 10:09
Hi Richard,
thanks for providing your great work to the community.
Im intend to use your approach for a different domain and I would really like to reproduce the visualisation of the word embedding. Is it possible , to get the code for this?
Kind regards, Finn
Hi socher,I want to use Chinese movie datasets to train RAE, But I hava some questions about the pre-dataset and the input file in Full Run?.java
1. Weather all the .txt in the data/parsed file are the input files? How you get it? Can this java code contains a *.java to parsed the original review data? or I need to process the original data like the parsed file( word.txt, wordmap.txt, labels.txt ... and so on ) by myself?
2. What's the data/parsed/features.dat ? the generated word embedding? and the allSNum.txt is for what? I don't know what it means. So I don't know how to get it.
I'm very confused about it. Thank you very much.
Hi socher,I want to use Chinese movie datasets to train RAE, But I hava some questions about the pre-dataset and the input file in Full Run?.java
1. Weather all the .txt in the data/parsed file are the input files? How you get it? Can this java code contains a *.java to parsed the original review data? or I need to process the original data like the parsed file( word.txt, wordmap.txt, labels.txt ... and so on ) by myself?
2. What's the data/parsed/features.dat ? the generated word embedding? and the allSNum.txt is for what? I don't know what it means. So I don't know how to get it.
I'm very confused about it. Thank you very much.
PS: Our EMNLP 2013 paper is more accurate that this model and the code very easy to use and train by yourself (see the Stanford Java Core NLP? package).
@Alisy, can you be a bit more specific?
Hi Socher, I cecently read your paper. Very interesting indeed! But I got some confusion here. Hope you have time to answer.
I want to know what's the meaning of the four kinds of variables in the rt-polarity_neg_binarized file, which comes from the data of your code. Besides, how can I achieve the information from a new data?
Thank you. Best wishes.
Hi Vargas,
Sorry for the late reply.
You first use PCA or t-SNE to project the higher dimensional words into 2d.
I then created this visualization in matlab using the text command inside a figure.
Hi Phil,
Sorry for the late reply.
This code is not very clean. For understanding backprop through RN Ns?, I encourage you to study the code of my ICML 2011 paper, which you can find at Parsing Natural Scenes And Natural Language With Recursive Neural Networks
To your questions:
1. I'm not sure I understand that question. The output of one neuron is the input to another, recursively.
2. that del part is for ignoring the reconstruction error when it comes to the word vectors.
3. That is an interesting idea that we recently explored in our ACL paper: Parsing With Compositional Vector Grammars Another group that also explored this idea in combination with CC Gs? is: www.karlmoritz.com/_media/hermannblunsom_acl2013.pdf
It has not been tried yet for sentiment. We will release a new dataset in ~3 weeks that will make this much more interesting :)
Hi Socher,
I'm training right now my model with the java implementation, how I can create the PDF file with the red and blue words?
Thanks you! Great job!
Hi Socher,
I'm training right now my model with the java implementation, how I can create the PDF file with the red and blue words?
Thanks you! Great job!
Phil? — 23 June 2013, 07:01
Hi Socher,
I'm currently reading your paper. Very interesting indeed! But I got some confusion here. Hope you have time to answer.
1. the calculation of nd1, nd2, parent_d (three deltas for hidden layer and output layer) is given as, say, nd1 = f(y1)' * (y1-c1). Here y1 is the output of a neuron, should it be the input of a neuron rather than the output?
2. In the calculation of parent_d, it is given as parent_d = f'(a1)*(w3*nd1+ w4*nd2 + mat*pd - del), what is the purpose of "mat*pd -del" ? Does it serve as sparsity constraints? 'cause I don't see sparsity constraints in your equations.
3. Will it improve performance, if AE's are given different sets of weights?
Thanks a lot!
Hello, I feel a little confused about the following two lines in computeCostAndGradRAE.m
L = We(:, words_indexed);
words_embedded = We_orig(:, words_indexed) + L;
Why do not use We(:, words_indexed) directly?
Thanks!
Hi Gao,
Well... This could have many causes. Is your training data skewed? That is, do you have more instances labeled as class 1 in training? If so, one possible fix would be to weight those instances down in the cost function.
Best,
Richard
Excese me;
When I use this rae to classify Chinese short sentences ,I separated labeled data to two parts, one with 90% for train and the other with 10% for test, I can get a 85% Precision in the train set, but the classifier get very bad result on the test set ,the classifier labels 80% test set to a part and 20% test part to another one ,while in fact it distribution is 50% to 50%.
The Confusion Matrix is often like [378,411;112,99] ;while with 500 vs 500 test set.
Could you give me some suggestions to work out it?
Thank you.
Best wishes.
Hi Tran Phi?,
I would use a strictly left branching tree, or in other words a simple chain for extending it to paragraphs and documents.
This has not been done yet. I actually think it would be an interesting research direction.
By the way, I would also encourage you to just use the Stanford parser to get the right tree instead of searching for it with the reconstruction. Recent experiments showed improvements this way.
Gao Pengda.
You can initialize the vocabulary with small random numbers, like x ~ Uniform(-0.001,0.001).
Another thing that can help is if you label more nodes in the trees. That way the model can become better at capturing sentiment changes.
Hope that helps.
PS: Save what you write before you post, then type in the password, then copy the text and re-post. It's the only way to prevent spammers.
Does this method naturally extend to the paragraph and document levels where there are multiple sentences?
how to use the lexica ?
if it can be used in the procedure of initing the vocabulary?
and how ?
Hi Xinwei,
rt = rotten tomatoes, a corpus of short movie review snippets.
read_rtPolarity is the original file we used to read and transform the rotten tomatoes movie review dataset into Matlab.
I want to ask what is the purpose of M file read_rtPolarity and what is rt short for?
Weird, I just tried it again and it works. Maybe github had a temporary problem?
Unable to download Java code zip file..