Efficient Estimation of Word Representations in Vector Space

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Kostas Kontogiannis E&CE
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Po-Sen Huang1 Xiaodong He2 Jianfeng Gao2 Li Deng2
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Distributed Representations of Sentences and Documents
Introduction to Machine Learning Approach Lecture 5.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Comp 5013 Deep Learning Architectures Daniel L. Silver March,
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
Authors : Ramon F. Astudillo, Silvio Amir, Wang Lin, Mario Silva, Isabel Trancoso Learning Word Representations from Scarce Data By: Aadil Hayat (13002)
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 A Static Hand Gesture Recognition Algorithm Using K- Mean Based Radial Basis Function Neural Network 作者 :Dipak Kumar Ghosh,
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Text Summarization via Semantic Representation 吳旻誠 2014/07/16.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
E-commerce in Your Inbox Product Recommendations at Scale
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
DeepWalk: Online Learning of Social Representations
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Distributed Representations for Natural Language Processing
Attention Model in NLP Jichuan ZENG.
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Comparison with other Models Exploring Predictive Architectures
One-layer neural networks Approximation problems
Deep learning and applications to Natural language processing
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Distributed Representation of Words, Sentences and Paragraphs
Word Embeddings with Limited Memory
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Recurrent Neural Networks
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Unsupervised Pretraining for Semantic Parsing
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Neural Machine Translation using CNN
CS249: Neural Language Model
Presentation transcript:

Efficient Estimation of Word Representations in Vector Space Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean Google Inc., Mountain View, CA  ICLR, 2013 日期: 2014/10/16 報告者: 陳思澄

Outline Introduction New Log-linear Models Continuous Bag-of-Words Model (CBOW) Continuous Skip-gram Model Results Conclusion

Introduction The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity. Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities.

Introduction Example : vector(“King”) - vector (“Man”) + vector(“Woman”) results in a vector that is closest to the vector representation of the word Queen. We design a new comprehensive test set for measuring both syntactic and semantic regularities , and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

New Log-linear Models We propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. We decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.

Continuous Bag-of-Words Model (CBOW) The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words ; thus, all words get projected into the same position.

Continuous Skip-gram Model The second architecture is similar to CBOW, but instead of predicting the current word based on the context. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. P(wt-2, wt-1 ,wt+1 ,wt+2 |wt)

Result (1/6) We follow previous observation that there can be many different types of similarities between words. We further denote two pairs of words with the same relationship as a question, as we can ask: ”What is the word that is similar to small in the same sense as biggest is similar to big?” These questions can be answered by performing simple algebraic operations with the vector representation of words. vector X = vector("biggest") -vector("big")+vector("small"). Then, we search in the vector space for the word closest to X measured by cosine distance, and use it as the answer to the question (we discard the input question words during this search). When the word vectors are well trained, it is possible to find the correct answer (word smallest) using this method.

Result (2/6) Finally, we found that when we train high dimensional word vectors on a large amount of data, the resulting vectors can be used to answer very subtle semantic relationships between words, such as a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectors with such semantic relationships could be used to improve many existing NLP applications, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented.

Result (3/6) 8869 semantic , 10675 syntactic questions

Result (4/6) Corpus: Word vector: Google News (6B tokens) Vocabulary size: 30k most frequent words Corpus: LDC (320M words , 82k vocabulary)

Result (5/6)

Result (6/6)

Conclusion In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks. We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set. Our ongoing work shows that the word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts. In the future, it would be also interesting to compare our techniques to Latent Relational Analysis and others.

Word2vec Tool 網址: https://code.google.com/p/word2vec/

Word2vec 語料: 台達電EDU000~EDU0003會議內容

Word2vec 語料: PTT教育版語言模型

RNN實驗結果 WordNum =5 Edu0000 Edu0001 Edu0002 Edu0003 Baseline 86.38 80.51 89.04 91.22 Rnn+Ngram 82.0 75.83 83.69 87.92 RNN (PPL) Ngram (PPL) edu0000 33.3438 36.1975 edu0001 39.6715 81.969 edu0002 33.5458 44.613 edu0003 34.0591 41.2404 訓練階段語料: MATBN+ Delta 語言模型。 (約6000句,檔案大小 1481 KB) 發展集語料: edu0000~edu0003. 測試集語料: edu0000 的 Nbest ~ edu0003 的 Nbest. 實驗設定: Hidden_size : 400 Class_size : 200