Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

Clustering Categorical Data The Case of Quran Verses

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Supervised Learning Recap

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

A Neural Probabilistic Language Model Keren Ye.

Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)

Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Latent Dirichlet Allocation a generative model for text

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin

S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.

Distributed Representations of Sentences and Documents

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

Evaluation of Utility of LSA for Word Sense Discrimination Esther Levin, Mehrbod Sharifi, Jerry Ball

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Efficient Estimation of Word Representations in Vector Space

Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

Semantic Compositionality through Recursive Matrix-Vector Spaces

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Vector Semantics Dense Vectors.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Similarity Measures for Text Document Clustering

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Intro to NLP and Deep Learning

Vector-Space (Distributional) Lexical Semantics

Efficient Estimation of Word Representation in Vector Space

Compact Query Term Selection Using Topically Related Text

Convolutional Neural Networks for sentence classification

General Aspects of Learning

General Aspects of Learning

Word Embedding Word2Vec.

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

Resource Recommendation for AAN

Unsupervised Pretraining for Semantic Parsing

Natural Language to SQL(nl2sql)

Enriching Taxonomies With Functional Domain Knowledge

Word embeddings (continued)

Human-object interaction

Vector Representation of Text

Presentation transcript:

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord Representations via Global Context and Multiple Word Prototypes 2014/10/30 陳思澄

Outline Introduction Global Context-Aware Neural Language Model Multi-Prototype Neural Language Model Experiments Conclusion

Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. However, most of these models are built with only local context and one representation per word. This is problematic because words are often polysemous and global context can also provide useful information for learning word meanings. Introduction

We present a new neural network architecture which (1) Learns word embeddings that better capture the semantics of words by incorporating both local and global document context. (2) accounts for homonymy and polysemy by learning multiple embeddings per word. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models. Introduction

Global Context-Aware Neural Language Model Our model jointly learns word representations while learning to discriminate the next word given a short word sequence (local context) and the document (global context) in which the word sequence occurs. Our goal is to learn useful word representations and not the probability of the next word given previous words (which prohibits looking ahead),our model can utilize the entire document to provide global context.

Given a word sequence s and document d in which the sequence occurs, our goal is to discriminate the correct last word in s from other random words. We compute scores g(s,d) and g(s w,d) where s w is s with the last word replaced by word w, and g(,) is the scoring function that represents the neural networks used. We want g(s,d) to be larger than g(s w,d) by a margin of 1, for any other word w in the vocabulary, which corresponds to the training objective of minimizing the ranking loss for each (s,d) found in the corpus: Global Context-Aware Neural Language Model

We define two scoring components that contribute to the final score of a (word sequence, document) pair. The scoring components are computed by two neural networks, one capturing local context and the other global context. Global Context-Aware Neural Language Model

To compute the score of local context, score l, we use a neural network with one hidden layer: Where [x 1 ;x 2 ;...;x m ] is the concatenation of the m word embeddings representing sequence s, f is an element-wise activation function such as tanh. a 1 ∈ R h*1 is the activation of the hidden layer with h hidden nodes, W 1 ∈ R h*(mn) and W 2 ∈ R 1*h are respectively the first and second layer weights of the neural network, and b 1, b 2 are the biases of each layer. Local Context Score:

For the score of the global context, we represent the document also as an ordered list of word embeddings. d = (d 1,d 2,...,d k ). We first compute the weighted average of all word vectors in the document: Where w(.) can be any weighting function that captures the importance of word t i in the document. We use idf-weighting as the weighting function. We use a two-layer neural network to compute the global context score, score g, similar to the above: Global Context Score:

The final score is the sum of the two scores: The local score preserves word order and syntactic information, while the global score uses a weighted average which is similar to bag-of-words features, capturing more of the semantics and topics of the document. Final Score:

Multi-Prototype Neural Language Model In order to learn multiple prototypes, we first gather the fixed- sized context windows of all occurrences of a word (we use 5 words before and after the word occurrence). Each context is represented by a weighted average of the context words’ vectors, where again, we use idf-weighting as the weighting function. We then use spherical k-means to cluster these context representations, which has been shown to model semantic relations well. Finally, each word occurrence in the corpus is re-labeled to its associated cluster and is used to train the word representation for that cluster.

Similarity between a pair of words (w,w0) using the multi- prototype approach can be computed with or without context, as defined by Reisinger and Mooney: Where p(c,w,i) is the likelihood that word w is in its cluster i given context c, μ i (w) is the vector representing the i-th cluster centroid of w, and d(v,v0) is a function computing similarity between two vectors. The similarity measure can be computed in absence of context by assuming uniform p(c,w,i) over i.

Experiments We chose Wikipedia as the corpus to train all models (total about 2 million articles and 990 million tokens). We use a dictionary of the 30,000 most frequent words in Wikipedia. For all experiments, our models use 50- dimensional embeddings. We use 10-word windows of text as the local context, 100 hidden units, and no weight regularization for both neural networks. For multi-prototype variants, we fix the number of prototypes to be 10.

Table 1 shows the nearest neighbors of some words. In order to show that our model learns more semantic word representations with global context, we give the nearest neighbors of our single-prototype model versus C&W’s, which only uses local context.

Table 2 shows the nearest neighbors of our model using the multi-prototype approach. We see that the clustering is able to group contexts of different meanings of a word into separate groups, allowing our model to learn multiple meaningful representations of a word.

Table 3 shows our results compared to previous methods. Note that our model achieves higher correlation (64.2) than either using local context alone (C&W: 55.3) or using global context alone (Our Model-g: 22.8). We also found that correlation can be further improved by removing stop words (71.3). Thus, each window of text contains more information but still preserves some syntactic information as the words are still ordered in the local context.

Table 5 compares different models’ results on this dataset. We compare against the following baselines: tf-idf represents words in a word-word matrix capturing co-occurrence counts in all 10-word context windows. We report the result of this pruning technique after tuning the threshold value on this dataset, removing all but the top 200 features in each word vector. This suggests that the pruned tf-idf representations might be more susceptible to noise or mistakes in context clustering.

Conclusion We presented a new neural network architecture that learns more semantic word representations by using both local and global context in learning. We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context. Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.