Improving Word Representations Via Global Context And Multiple Word Prototypes
Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. However, most of these models are built with only local context and one representation per word. This is problematic because words are often polysemous and global context can also provide useful information for learning word meanings. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Paper Download
Word Vectors and Code
- Our final word vectors and code to disambiguate between multiple word vectors in context vectors and testing code
- If you want to train the model on another language or your own corpus: training code
- Our method can also provide a single vector for each word. You can download these vectors as a text file ACL2012_wordVectorsTextFile.zip (14MB)
- This includes one text file with the vectors, one text file with the word list.
- If you want them in one file, just type paste vocab.txt wordVectors.txt
- This file contains the 10 word vectors for the frequent words in text format: ACL2012_wordVectorsTextFile_multiple.zip (30MB)
Dataset: Stanford's Contextual Word Similarities (SCWS)
- Download Link: SCWS.zip (dataset)
- Description: Each line in ratings.txt consists of a pair of words, their respective
contexts, the 10 individual human ratings, as well as their averages. The
target word is surrounded by <b>...</b> in its context. Each line is tab-
delimited with the following format:
<id> <word1> <POS of word1> <word2> <POS of word2> <word1 in context> <word2 in context> <average human rating> <10 individual human ratings>
Bibtex
- @inproceedings{HuangEtAl2012,
author = {Eric H. Huang and Richard Socher and Christopher D. Manning and Andrew Y. Ng},
title = {{Improving Word Representations via Global Context and Multiple Word Prototypes}},
booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)},
year = 2012
}
Comments
