Download presentation
Presentation is loading. Please wait.
Published byVernon Gallagher Modified over 6 years ago
1
Distributed Representation of Words, Sentences and Paragraphs
朱维希 2016/3/14
2
Outline Explanation Applications Recommended Papers
Word Vector [Mikolov et al.] Paragraph Vector [Quoc Le, Tomas Mikolov] Applications Automatic speech recognition Machine translation NLP tasks Recommended Papers
3
Word Vector
4
Word Vector Assumption : Words in similar contexts have similar meanings. Skip-gram model (trained to predict the nearby words) Softmax function (generalization of the logistic function) Hierarchical softmax (Noise Contrastive Estimation, Negative Sampling)
5
Negative Sampling A Simplified NCE (Noise Contrastive Estimation)
Trivial solution: vc = vw and vc · vw = K , K is large enough (>40) Prevent all vectors from being the same: Disallow some (w, c) pairs generating the set D′ of random (w, c) pairs, assuming they are all incorrect skip-gram model Softmax Negative Sampling
6
Subsampling Discard with probability:
“We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies. “
7
Paragraph Vector
8
Paragraph Vector After being trained, the paragraph vectors can be used as features for the paragraph (e.g., in lieu of or in addition to bag-of-words). We can feed these features directly to conventional machine learning techniques such as logistic regression, support vector machines or K-means. the Distributed Bag of Words version of Paragraph Vector (PV-DBOW) “Another way...ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output...” the Distributed Memory Model of Paragraph Vectors (PV-DM)
9
Paragraph Vector “...Dataset: This dataset was first proposed by (Pang & Lee, 2005) and subsequently extended by (Socher et al., 2013b) as a benchmark for sentiment analysis. It has sentences taken from the movie review site Rotten Tomatoes. The dataset consists of three sets: 8544 sentences for training, 2210 sentences for test and 1101 sentences for validation (or development). Every sentence in the dataset has a label which goes from very negative to very positive in the scale from 0.0 to 1.0. The labels are generated by human annotators using Amazon Mechanical Turk....”
10
Paragraph Vector “...Dataset: The IMDB dataset was first proposed by Maas et al. (Maas et al., 2011) as a benchmark for sentiment analysis. The dataset consists of 100,000 movie reviews taken from IMDB. One key aspect of this dataset is that each movie review has several sentences. The 100,000 movie reviews are divided into three datasets:25,000 labeled training instances, 25,000 labeled test instances and 50,000 unlabeled training instances. There are two types of labels: Positive and Negative. These labels are balanced in both the training and the test set. ...”
11
Paragraph Vector “...Here, we have a dataset of paragraphs in the first 10 results returned by a search engine given each of 1,000,000 most popular queries. Each of these paragraphs is also known as a “snippet” which summarizes the content of a web page and how a web page matches the query. From such collection, we derive a new dataset to test vector representations of paragraphs. For each query, we create a triplet of paragraphs: the two paragraphs are results of the same query, whereas the third paragraph is a randomly sampled paragraph from the rest of the collection (returned as the result of a different query). Our goal is to identify which of the three paragraphs are results of the same query. ...”
12
Recommended Papers Manning C D. Computational Linguistics and Deep Learning[J]. Computational Linguistics, 2016. Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: Glorot X, Bordes A, Bengio Y. Domain adaptation for large-scale sentiment classification: A deep learning approach[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011: Mikolov T, Yih W, Zweig G. Linguistic Regularities in Continuous Space Word Representations[C]//HLT-NAACL. 2013: Schwenk H. Continuous space language models[J]. Computer Speech & Language, 2007, 21(3): Socher R, Lin C C, Manning C, et al. Parsing natural scenes and natural language with recursive neural networks[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 2011: Turney P D, Pantel P. From frequency to meaning: Vector space models of semantics[J]. Journal of artificial intelligence research, 2010, 37(1): Turney P D. Distributional semantics beyond words: Supervised learning of analogy and paraphrase[J]. arXiv preprint arXiv: , 2013. Weston J, Bengio S, Usunier N. Wsabie: Scaling up to large vocabulary image annotation[C]//IJCAI. 2011, 11:
13
References Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: Le Q V, Mikolov T. Distributed representations of sentences and documents[J]. arXiv preprint arXiv: , 2014. Goldberg Y, Levy O. word2vec explained: Deriving mikolov et al.'s negative-sampling word-embedding method[J]. arXiv preprint arXiv: , 2014.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.