Presented by Jiaxing Tan

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Improving Distributional Similarity with Lessons Learned from Word Embeddings Presented by Jiaxing Tan Some Slides from the original paper presentation I’d like to talk to you about how word embeddings are really improving distributional similarity. This is joint work with Yoav Goldberg and Ido Dagan.

Outline Background Hyper-parameter to experiment Experiment and Result

Motivation of word vector representation Compare Word Similarity & Relatedness How similar is iPhone to iPad? How related is AlphaGo to Google? In a search engine we use vectors to represent query as vector to compare similarity. Representing words as vectors allows easy computation of similarity

Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD(dense) Word Embeddings (Predict) Inspired by deep learning word2vec (SGNS)(Mikolov et al., 2013) GloVe (Pennington et al., 2014) There are two approaches for creating these vectors: First, we’ve got the traditional count-based approach, where the most common variant is the word-context PMI matrix. The other, cutting-edge, approach, is word embeddings, Which has become extremely popular with word2vec. Now, the interesting thing… …is that both approaches rely on the same linguistic theory: the distributional hypothesis. <pause> Now, in previous work, that I’ll talk about in a minute, we showed that the two approaches are even more related. Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

Contributions Identifying the existence of new hyperparameters Not always mentioned in papers Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms Comparing algorithms across all hyperparameter settings Over 5,000 experiments (3) Finally, we were able to apply almost every hyperparameter to every method, and perform systematic “apples to apples” comparisons of the algorithms.

What is word2vec? So what is word2vec?...

What is word2vec? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram Various training methods Negative Sampling Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words Word2vec is *not* an algorithm. It’s actually a software package With two distinct models Various ways to train these models And a rich preprocessing pipeline

What is word2vec? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram (SG) Various training methods Negative Sampling (NS) Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words Now, we’re going to focus on Skip-Grams with Negative Sampling, Which is considered the state-of-the-art.

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. So, how does SGNS work? Let’s say we have this sentence, (This example was shamelessly taken from Marco Baroni) “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. And we want to understand what a wampimuk is. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in … … 𝐷 (data) What SGNS does is take the context words around “wampimuk”, And in this way, Constructs a dataset of word-context pairs, D. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) SGNS finds a vector 𝑤 for each word 𝑤 in our vocabulary 𝑉 𝑊 Each such vector has 𝑑 latent dimensions (e.g. 𝑑=100) Effectively, it learns a matrix 𝑊 whose rows represent 𝑉 𝑊 Key point: it also learns a similar auxiliary matrix 𝐶 of context vectors In fact, each word has two embeddings 𝑊 𝑑 𝑉 𝑊 Now, using D, SGNS is going to learn a vector for each word in the vocabulary A vector of say… 100 latent dimensions. Effectively, it’s learning a matrix W, Where each row is a vector that represents a specific word. Now, a key point in understanding SGNS, Is that it also learns an auxiliary matrix C of context vectors So in fact, each word has two different embeddings Which are not necessarily similar. 𝐶 𝑉 𝐶 𝑑 𝑤:wampimuk =(−3.1, 4.15, 9.2, −6.5,…) ≠ 𝑐:wampimuk =(−5.6, 2.95, 1.4, −1.3,…) “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Maximize: 𝜎 𝑤 ⋅ 𝑐 𝑐 was observed with 𝑤 words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in Minimize: 𝜎 𝑤 ⋅ 𝑐 ′ 𝑐′ was hallucinated with 𝑤 words contexts wampimuk Australia wampimuk cyber wampimuk the wampimuk 1985 …and minimizes the similarity between word and context vectors hallucinated together. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) SGNS samples 𝑘 contexts 𝑐 ′ at random as negative examples “Random” = unigram distribution 𝑃 𝑐 = #𝑐 𝐷 Spoiler: Changing this distribution has a significant effect Now, the way SGNS hallucinates is really important. It’s actually where it gets its name from. For each observed word-context pair, SGNS samples k contexts at random, As negative examples Now, when we say random, we mean the unigram distribution And Spoiler Alert: changing this distribution has quite a significant effect.

“Neural Word Embeddings as Implicit Matrix Factorization” What is SGNS learning? They prove that for large enough 𝑑 and enough iterations They get the word-context PMI matrix, shifted by a global constant k is the number of negative samples 𝑂𝑝𝑡 𝑤 ⋅ 𝑐 =𝑃𝑀𝐼 𝑤,𝑐 − log 𝑘 𝑑 𝑉 𝐶 𝑊 𝑀 𝑃𝑀𝐼 …with a little twist. Now, recall that k is the number of negative samples generated for each positive example (w,c) \in D. This is of course the optimal value, and not necessarily what we get in practice. 𝑉 𝐶 𝐶 = − log 𝑘 𝑉 𝑊 𝑑 𝑉 𝑊 “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. So dynamic context windows Let’s say we have our sentence from earlier, And we want to look at a context window of 4 words around wampimuk.

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. Now, some of these context words are obviously more related to the meaning of wampimuk than others. The intuition behind dynamic context windows, is that the closer the context is to the target…

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. word2vec: 1 4 2 4 3 4 4 4 4 4 3 4 2 4 1 4 GloVe: 1 4 1 3 1 2 1 1 1 1 1 2 1 3 1 4 The Word-Space Model (Sahlgren, 2006) The probabilities that each specific context word will be included in the training data. …the more relevant it is. word2vec does just that, by randomly sampling the size of the context window around each token. What you see here are the probabilities that each specific context word will be included in the training data. GloVe also does something similar, but in a deterministic manner, and with a slightly different distribution. There are of course other ways to do this, And they were all, apparently, Applied to traditional algorithms about a decade ago.

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Subsampling Randomly removes words that are more frequent than some threshold t with a probability of p, where f marks the word’s corpus frequency t = 10−5 in experiments Remove stop words The removal of tokens is done before the corpus is processed into word-context pairs

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Delete Rare Words Ignore words that are rare in the training corpus Remove these tokens from the corpus before creating context windows Narrow the distance between tokens Insert new word-context pairs that did not exist in the original corpus with the same window size.

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Adding Context Vectors SGNS creates word vectors 𝑤 SGNS creates auxiliary context vectors 𝑐 So do GloVe and SVD Instead of just 𝑤 Represent a word as: 𝑤 + 𝑐 Introduced by Pennington et al. (2014) Only applied to GloVe So, each word has two representations, right? So what if instead of using just the word vectors, We’d also use the information in the context vectors, to represent our words? This trick was introduced as part of GloVe, But it was also only applied to GloVe, and not to the other methods.

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Eigenvalue Weighting Word vector (W)and context vector (C) derived using SVD are typically represented by Add a parameter p to control eigenvalue matrix Σ

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Vector Normalization (nrm) Normalize to unit length (L2 normalization) Different types: Row Column Both

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

k in SGNS learning? k is the number of negative samples 𝑆𝐺𝑁𝑆 𝑤 ⋅ 𝑐 =𝑃𝑀𝐼 𝑤,𝑐 − log 𝑘 Also, k causes the shift of PMI Matrix …with a little twist. Now, recall that k is the number of negative samples generated for each positive example (w,c) \in D. This is of course the optimal value, and not necessarily what we get in practice.

New Hyperparameters Preprocessing Postprocessing Association Metric Dynamic Context Windows (dyn, win) Subsampling (sub) Deleting Rare Words (del) Postprocessing Adding Context Vectors (w+c) Eigenvalue Weighting (eig) Vector Normalization (nrm) Association Metric Shifted PMI (neg) Context Distribution Smoothing (cds) So let’s make a list of all the new hyperparameters we found: The first group of hyperparameters is in the pre-processing pipeline. All of these were introduced as part of word2vec. The second group contains post-processing modifications, And the third group of hyperparameters affects the association metric between words and contexts. This last group is really interesting, because to make sense of it, you have to understand that SGNS is actually factorizing the word-context PMI matrix.

Context Distribution Smoothing For the original calculation of PMI: 𝑃𝑀𝐼 𝑤,𝑐 = log 𝑃(𝑤,𝑐) 𝑃 𝑤 ⋅𝑷 𝒄 If c is rare word, PMI is still high. How to solve? Via P(c)-> add a α(=0.75) Now, what’s neat about context distribution smoothing Is that we can adapt it to PMI! This is done by simply replacing the probability of C with the smoothed probability. Just like this: And this tweak consistently improves performance on every task. So if there’s one technical detail you should remember from this talk, it’s this: Always use context distribution smoothing! 𝑃 𝑐 = #𝑐 𝑐 ′ ∈ 𝑉 𝐶 # 𝑐 ′ 𝑃 α 𝑐 = #𝑐 α 𝑐 ′ ∈ 𝑉 𝐶 # 𝑐 ′ α

Context Distribution Smoothing We can adapt context distribution smoothing to PMI! Replace 𝑃(𝑐) with 𝑃 0.75 (𝑐): 𝑃𝑀 𝐼 0.75 𝑤,𝑐 = log 𝑃(𝑤,𝑐) 𝑃 𝑤 ⋅ 𝑷 𝟎.𝟕𝟓 𝒄 Consistently improves PMI on every task Always use Context Distribution Smoothing! Now, what’s neat about context distribution smoothing Is that we can adapt it to PMI! This is done by simply replacing the probability of C with the smoothed probability. Just like this: And this tweak consistently improves performance on every task. So if there’s one technical detail you should remember from this talk, it’s this: Always use context distribution smoothing!

Experiment and Result Experiment Setup Result

Experiment Setup 9 Hyperparameters 4 Word Representation Algorithms 6 New 4 Word Representation Algorithms PPMI SVD SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments So overall, we ran a huge number of experiments. We had 9 hyperparameters, 6 of which were new, 4 word representation algorithms, And 8 different benchmarks. Now, this ended up in over 5,000 experiments, so I won’t be able to walk you through all the results, but I can give you a taste.

Experiment Setup

Experiment Setup Word Similarity WordSim353 (Finkelstein et al., 2002) partitioned into two datasets, WordSim Similarity and WordSim Relatedness (Zesch et al., 2008; Agirre et al., 2009); Bruni et al.’s (2012) MEN dataset; Radinsky et al.’s (2011) Mechanical Turk dataset Luong et al.’s (2013) Rare Words dataset Hill et al.’s (2014) SimLex-999 dataset All these datasets contain word pairs together with human-assigned similarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation with the human ratings.

Experiment Setup Analogy MSR’s analogy dataset (Mikolov et al., 2013c) Google’s analogy dataset (Mikolov et al., 2013a) The two analogy datasets present questions of the form “a is to a ∗ as b is to b ∗ ”, where b ∗ is hidden, and must be guessed from the entire vocabulary.

Results Time Limited. Let’s jump to Results

Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate If time is available, I will show some details Now, this is only a little taste of our experiments, And you’ll have to read the paper to get the entire picture, But I will try to point out three major observations: First, tuning hyperparameters often has a stronger positive effect than switching to a fancier algorithm. Second, tuning hyperparameters might be even more effective than adding more data. And third, previous claims that one algorithm is better than another, are not 100% true.

Hyper-Parameter

Hyper-Parameter

Results-Hyper-parameter VS Algorithm No dominant Method

SVD

Practical Guide Always use context distribution smoothing (cds = 0.75) to modify PMI. Do not use SVD “correctly” (eig = 1). Instead, use one of the symmetric variants . SGNS is a robust baseline. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption. With SGNS, prefer many negative samples. For both SGNS and GloVe, it is worthwhile to experiment with the w +c variant,

Thanks