Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
2 Sup model Sup data Supervised training
3 Sup model Sup data Supervised training Semi-sup training?
4 Sup model Sup data Supervised training Semi-sup training?
5 Sup model Sup data Supervised training Semi-sup training? More feats
6 Sup model Sup data More feats Sup model Sup data More feats sup task 1 sup task 2
7 Semi-sup model Unsup data Sup data Joint semi-sup
8 Semi-sup model Unsup model Unsup data Sup data unsup pretraining semi-sup fine-tuning
9 Unsup model Unsup data unsup training unsup feats
10 Semi-sup model Unsup data unsup training Sup training Sup data unsup feats
11 Unsup data unsup training unsup feats sup task 1sup task 2 sup task 3
12 What unsupervised features are most useful in NLP?
13 Natural language processing Words, words, words
14 How do we handle words? Not very well
15 “One-hot” word representation |V| = |vocabulary|, e.g. 50K for PTB2 word -1, word 0, word +1 Pr dist over labels (3*|V|) x m 3*|V| m
16 One-hot word representation 85% of vocab words occur as only 10% of corpus tokens Bad estimate of Pr(label|rare word) word 0 |V| x m |V| m
17 Approach
18 Approach Manual feature engineering
19 Approach Manual feature engineering
20 Approach Induce word reprs over large corpus, unsupervised Use word reprs as word features for supervised task
21 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs
22 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs
23 Distributional representations F W (size of vocab) C e.g. F w,v = Pr(v follows word w) or F w,v = Pr(v occurs in same doc as w)
24 Distributional representations F W (size of vocab) C d g(g( ) = f g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d
25 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs
26 Class-based word repr |C| classes, hard clustering word 0 (|V|+|C|) x m |V|+|C| m
27 Class-based word repr Hard vs. soft clustering Hierarchical vs. flat clustering
28 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs
29 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs
30 Brown clustering Hard, hierarchical class-based LM Brown et al. (1992) Greedy technique for maximizing bigram mutual information Merge words by contextual similarity
31 Brown clustering (image from Terry Koo) cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’
32 Brown clustering Hard, hierarchical class-based LM 1000 classes Use prefixes = 4, 6, 10, 20
33 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs
34 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs
35 Distributed word repr k- (low) dimensional, dense representation “word embedding” matrix E of size |V| x k word 0 k x m k m
36 Sequence labeling w/ embeddings word -1, word 0, word +1 (3*k) x m |V| x k, tied weights m “word embedding” matrix E of size |V| x k
37 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)
38 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)
39 Collobert + Weston 2008 w1 w2 w3 w4 w5 50* w1 w2 w3 w4 w5 score > μ + score
40 50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)
41 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)
42 Log bilinear Language Model (LBL) w1 w2 w3 w4 w5 Linear prediction of w5 }
43 HLBL HLBL = hierarchical (fast) training of LBL Mnih + Hinton (2009)
44 Approach Induce word reprs over large corpus, unsupervised –Brown: 3 days –HLBL: 1 week, 100 epochs –C&W: 4 weeks, 50 epochs Use word reprs as word features for supervised task
45 Unsupervised corpus RCV1 newswire 40M tokens (vocab = all 270K types)
46 Supervised Tasks Chunking (CoNLL, 2000) –CRF (Sha + Pereira, 2003) Named entity recognition (NER) –Averaged perceptron (linear classifier) –Based upon Ratinov + Roth (2009)
47 Unsupervised word reprs as features Word = “the” Embedding = [-0.2, …, 1.6] Brown cluster = (cluster 4-prefix = 1010, cluster 6-prefix = , …)
48 Unsupervised word reprs as features Orig X = {pos-2=“DT”: 1, word-2=“the”: 1,...} X w/ Brown = {pos-2=“DT”: 1, word-2=“the”: 1, class-2-pre4=“1010”: 1, class-2-pre6=“101000”: 1} X w/ emb = {pos-2=“DT”: 1, word-2=“the”: 1, word-2-dim00: -0.2, …, word-2-dim49: 1.6,...}
49 Embeddings: Normalization E = σ * E / stddev(E)
50 Embeddings: Normalization (Chunking)
51 Embeddings: Normalization (NER)
52 Repr capacity (Chunking)
53 Repr capacity (NER)
54 Test results (Chunking)
55 Test results (NER)
56 MUC7 (OOD) results (NER)
57 Test results (NER)
Test results Chunking: C&W = Brown NER: C&W < Brown Why? 58
59 Word freq vs word error (Chunking)
60 Word freq vs word error (NER)
61 Summary Both Brown + word emb can increase acc of near-SOTA system Combining can improve accuracy further On rare words, Brown > word emb Scale parameter σ = 0.1 Goodies: Word features! Code!
62 Difficulties with word embeddings No stopping criterion during unsup training More active features (slower sup training) Hyperparameters –Learning rate for model –(optional) Learning rate for embeddings –Normalization constant vs. Brown clusters, few hyperparams
63 HMM approach Soft, flat class-based repr Multinomial distribution over hidden states = word representation 80 hidden states Huang and Yates (2009) No results with HMM approach yet