Word Embeddings with Limited Memory

Slides:



Advertisements
Similar presentations
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Advertisements

Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Yang-de Chen Tutorial: word2vec Yang-de Chen
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
N-best Reranking by Multitask Learning Kevin Duh Katsuhito Sudoh Hajime Tsukada Hideki Isozaki Masaaki Nagata NTT Communication Science Laboratories 2-4.
Efficient Estimation of Word Representations in Vector Space
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
From Paraphrase Database to Compositional Paraphrase Model and Back John Wieting University of Illinois Joint work with Mohit Bansal, Kevin Gimpel, Karen.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.
Second Language Learning From News Websites Word Sense Disambiguation using Word Embeddings.
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Vector Semantics Dense Vectors.
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
DeepWalk: Online Learning of Social Representations
Label Embedding Trees for Large Multi-class Tasks Samy Bengio Jason Weston David Grangier Presented by Zhengming Xing.
This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Unsupervised Sparse Vector Densification for Short Text Similarity
Fill-in-The-Blank Using Sum Product Network
Distributed Representations for Natural Language Processing
Cross-lingual Models of Word Embeddings: An Empirical Comparison
SIMILARITY SEARCH The Metric Space Approach
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Comparison with other Models Exploring Predictive Architectures
Compact Bilinear Pooling
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
An Empirical Study of Learning to Rank for Entity Search
On Dataless Hierarchical Text Classification
Neural Machine Translation by Jointly Learning to Align and Translate
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Natural Language Processing of Knee MRI Reports
Deep learning and applications to Natural language processing
Vector-Space (Distributional) Lexical Semantics
Clustering (3) Center-based algorithms Fuzzy k-means
On the Physical Carrier Sense in Wireless Ad-hoc Networks
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Hidden Markov Models Part 2: Algorithms
Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E
Weakly Learning to Match Experts in Online Community
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Predict Failures with Developer Networks and Social Network Analysis
Word embeddings based mapping
Word embeddings based mapping
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Biased Random Walk based Social Regularization for Word Embeddings
Socialized Word Embeddings
Anastasia Baryshnikova  Cell Systems 
Text Categorization Berlin Chen 2003 Reference:
Vector Representation of Text
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Word embeddings (continued)
Vector Representation of Text
Neural Machine Translation by Jointly Learning to Align and Translate
CS249: Neural Language Model
Presentation transcript:

Word Embeddings with Limited Memory Shaoshi Ling, Yangqiu Song, and Dan Roth, Computer Science Department, University of Illinois at Urbana-Champaign Department of Computer Science and Engineering, Hong Kong University of Science and Technology Method (Post-processing) Results Abstract (More results are shown in the paper) Comparing performance on multiple similarity tasks, with different values of truncation. The y-axis represents the Spearman’s rank correlation coefficient for word similarity datasets, and the cosine value for paraphrase (bigram) datasets. We study the effect of limited precision data representation and computation on word embeddings. We present a systematic evaluation of word embeddings with limited memory and discuss methods that directly train limited precision embeddings with limited memory. We show that it is possible to use and train an 8-bit fixed-point value for word embeddings without performance loss in word & phrase similarity and dependency parsing tasks. Post-processing Rounding. We want to round x to be in the range of [-r, r]. For example, if we want to use 8 bits to represent any value in the vector, then we only have 256 numbers ranging from -128 to 127. Stochastic Rounding. Stochastic rounding introduces some randomness into the rounding mechanism (Gupta et al., 2015). The probability of rounding x to is proportional to the proximity of x to Word Embedding Method (Training) Word Similarity data sets. (Faruqui and Dyer, 2014). Evaluation metric: Spearman’s rank correlation coefficient (Myers and Well., 1995) between the algorithm and the human labeled ranks. Paraphrases (bigrams) datasets (Wieting et al., 2015). Evaluation metric: cosine similarity to evaluate the correlation between the computed similarity and annotated similarity between paraphrases. … word(i-k) word(i-k+1) word(i+k) word(i) sum projection CBOW Skipgram The detailed average results for word similarity and paraphrases of the above figures. https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html There is an accumulation of evidence that the use of dense distributional lexical representations, known as word embeddings, often supports better performance on a range of NLP tasks (Bengio et al., 2003; Turian et al., 2010; Collobert et al., 2011; Mikolov et al., 2013a; Mikolov et al., 2013b; Levy et al., 2015). Comparing the training CBOW models: We set the average value of the original word2vec embeddings to be 1, and the values in the table are relative to the original embeddings baselines. “avg. (w.)” represents the average values of all word similarity datasets. “avg. (b.)” represents the average values of all bigram phrase similarity datasets. “Stoch. (16 b.)” represents the method using stochastic rounding applied to 16-bit precision. “Trunc. (8 b.)” represents the method using truncation with 8-bit auxiliary update vectors applied to 8-bit precision. Evaluation results for dependency parsing (in LAS). (Buchholz and Marsi,2006; Guo et al., 2015) Auxiliary Update Vectors. Problem: reduce memory use for training process Idea: Keep a certain precision of variable, and use auxiliary numbers to trade precision with space. Suppose we know the range of update value in SGD as , we use additional m bits to store all the values less than the limited numerical precision r’. The new precision is For example, if and , then new precision is . Memory Use 1 million words 200 dimensional vectors 1.6 GB Conclusion We systematically evaluated how small can the representation size of dense word embedding be before it starts to impact the performance of NLP tasks that use them. We considered both the final size of the embeddings and the size we allow it while learning it. Our study considers both the CBOW and the skipgram models at 25 and 200 dimensions and showed that 8 bits per dimension (and sometimes even less) are sufficient to represent each value and maintain performance on a range of lexical tasks. We also provided two ways to train the embeddings with reduced memory use. The natural future step is to extend these experiments and study the impact of the representation size on more advanced tasks. Application: billions of tokens and multiple languages Question: what is the impact of representing each dimension of a dense representation with significantly fewer bits than the standard 64 bits? This work was supported by DARPA under agreement numbers HR0011-15-2-0025 and FA8750-13-2-0008.