Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
![]() |
We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines. |
Download Paper
Download Dataset
- For downloading the dataset we provide the following files:
- epconfessions-release.txt - links and votes
- epconfessions-trainIDs.txt - training set
- epconfessions-testIDs.txt - test set
- This first file has the following format: ConfessionURLofId,Hugs,Rocks,Teehee,Understand,Wow
http://www.experienceproject.com/confessions.php?cid=2,0,3,19,0,3 - The first element in each line is the url for downloading the actual text of the confession. The remaining 5 columns are the number of times users voted for each of the 5 categories.
- Since the confession text does not change, it can be downloaded by a simple script. However, the votes can change so for comparison to our model, please use the votes and train/test sets in the above files.
- We used a random 70/30 split \"on the training set\" to get the development set.We used a random 70/30 split to get the development set.
- If you have any questions or trouble with the download, feel free to email richard at myLastName.org
- We thank Chris Potts for help with this interesting sentiment dataset.
Download Code
- Download train-test code and dataset for our movie review experiments here: codeDataMoviesEMNLP.zip
- This code can be used in two major ways:
- To train a semi-supervised recursive autoencoder from random word vectors and without sentiment lexica on movie reviews.
- To test using our best trained model on the first movie review fold.
- If you have a multicore machine, the code will be able to use all cores and parallelize.
- To run it, just open matlab and enter trainTestRAE
Experimental Results
- Here's a visualization of word embeddings learned on the movie reviews data set.
- Notice that the current objective only uses sentiment, not POS tags (but such a constraint could easily be added).
- Words are colored by our model such that words with a high probability of being positive are red, those with low probability are blue.
- Click on the image for a legible pdf file.
-
Bibtex
- Please cite the following paper when you use the data set or code:@inproceedings{SocherEtAl2011:RAE,
author = {Richard Socher and Jeffrey Pennington and Eric H. Huang and Andrew Y. Ng and Christopher D. Manning},
title = {{Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions}},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = 2011
}
Comments
For remarks, critical comments or other thoughts on the paper.
