Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
![]() |
We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines. |
Download Paper
Download Dataset
- For downloading the dataset we provide the following files:
- epconfessions-release.txt - links and votes
- epconfessions-trainIDs.txt - training set
- epconfessions-testIDs.txt - test set
- This first file has the following format: ConfessionURLofId,Hugs,Rocks,Teehee,Understand,Wow
http://www.experienceproject.com/confessions.php?cid=2,0,3,19,0,3 - The first element in each line is the url for downloading the actual text of the confession. The remaining 5 columns are the number of times users voted for each of the 5 categories.
- Since the confession text does not change, it can be downloaded by a simple script. However, the votes can change so for comparison to our model, please use the votes and train/test sets in the above files.
- We used a random 70/30 split \"on the training set\" to get the development set.We used a random 70/30 split to get the development set.
- If you have any questions or trouble with the download, feel free to email richard at myLastName.org
- We thank Chris Potts for help with this interesting sentiment dataset.
Download Code
- Download train-test code and dataset for our movie review experiments here: codeDataMoviesEMNLP.zip
- This code can be used in two major ways:
- To train a semi-supervised recursive autoencoder from random word vectors and without sentiment lexica on movie reviews.
- To test using our best trained model on the first movie review fold.
- If you have a multicore machine, the code will be able to use all cores and parallelize.
- To run it, just open matlab and enter trainTestRAE
Java code
- An implementation in Java is on github: https://github.com/sancha/jrae
- Alternatively, you could download it as a zip file: https://github.com/sancha/jrae/zipball/stable
- Read the USAGE file in the repo for instructions on usage.
- Bug reports are handled on github.
- Use only the stable branch.
Experimental Results
- Here's a visualization of word embeddings learned on the movie reviews data set.
- Notice that the current objective only uses sentiment, not POS tags (but such a constraint could easily be added).
- Words are colored by our model such that words with a high probability of being positive are red, those with low probability are blue.
- Click on the image for a legible pdf file.
-
Bibtex
- Please cite the following paper when you use the data set or code:@inproceedings{SocherEtAl2011:RAE,
author = {Richard Socher and Jeffrey Pennington and Eric H. Huang and Andrew Y. Ng and Christopher D. Manning},
title = {{Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions}},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = 2011
}
Comments
For remarks, critical comments or other thoughts on the paper. Save what you write before you post, then type in the password, post (nothing happens), then copy the text and re-post. It's the only way to prevent spammers.
