Download presentation
Presentation is loading. Please wait.
Published byPaula Bryan Modified over 9 years ago
1
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji jih@rpi.edu Oct13, 2015 Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan
2
2 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline
3
relation: a semantic relationship between two entities ACE relation typeexample Agent-Artifact Rubin Military Design, the makers of the Kursk Discourse each of whom Employment/ MembershipMr. Smith, a senior programmer at Microsoft Place-Affiliation Salzburg Red Cross officials Person-Social relatives of the dead Physical a town some 50 miles south of Salzburg Other-Affiliation Republican senators Relation Extraction: Task
4
Test Sample Train Sample K=3 A Simple Baseline with K-Nearest-Neighbor (KNN)
5
Test Sample Train Sample: Employment Train Sample: Physical Train Sample: Employment Train Sample: Physical 1. If the heads of the mentions don’t match: +8 2. If the entity types of the heads of the mentions don’t match: +20 3. If the intervening words don’t match: +10 the president of the United States the previous president of the United States the secretary of NIST US forces in Bahrain Connecticut ’s governor his ranch in Texas 4626 46 36 0 Relation Extraction with KNN
6
Lexical Heads of the mentions and their context words, POS tags Entity Entity and mention type of the heads of the mentions Entity Positional Structure Entity Context Syntactic Chunking Premodifier, Possessive, Preposition, Formulaic The sequence of the heads of the constituents, chunks between the two mentions The syntactic relation path between the two mentions Dependent words of the mentions Semantic Gazetteers Synonyms in WordNet Name Gazetteers Personal Relative Trigger Word List Wikipedia If the head extent of a mention is found (via simple string matching) in the predicted Wikipedia article of another mention References: Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007; Chan and Roth, 2010,2011 Typical Relation Extraction Features
7
7 Using Background Knowledge (Chan and Roth, 2010) Features employed are usually restricted to being defined on the various representations of the target sentences Humans rely on background knowledge to recognize relations Overall aim of this work Propose methods of using knowledge or resources that exists beyond the sentence Wikipedia, word clusters, hierarchy of relations, entity type constraints, coreference As additional features, or under the Constraint Conditional Model (CCM) framework with Integer Linear Programming (ILP) 7
8
8 8 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge
9
9 9 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge
10
10 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge
11
11 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge
12
12 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge David Brian Cone (born January 2, 1963) is a former Major League Baseball pitcher. He compiled an 8–3 postseason record over 21 postseason starts and was a part of five World Series championship teams (1992 with the Toronto Blue Jays and 1996, 1998, 1999 & 2000 with the New York Yankees). He had a career postseason ERA of 3.80. He is the subject of the book A Pitcher's Story: Innings With David Cone by Roger Angell. Fans of David are known as "Cone-Heads." Major League BaseballpitcherWorld Series1992Toronto Blue Jays1996199819992000New York YankeesRoger AngellCone-Heads Cone lives in Stamford, Connecticut, and is formerly a color commentator for the Yankees on the YES Network. [1]Stamford, Connecticut color commentatorYES Network [1] Contents [hide]hide 1 Early years 2 Kansas City Royals 3 New York Mets Partly because of the resulting lack of leadership, after the 1994 season the Royals decided to reduce payroll by trading pitcher David Cone and outfielder Brian McRae, then continued their salary dump in the 1995 season. In fact, the team payroll, which was always among the league's highest, was sliced in half from $40.5 million in 1994 (fourth-highest in the major leagues) to $18.5 million in 1996 (second-lowest in the major leagues)David ConeBrian McRae1995 season1996
13
13 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grained Employment:Staff0.20 Employment:Executive0.15 Personal:Family0.10 Personal:Business0.10 Affiliation:Citizen0.20 Affiliation:Based-in0.25
14
14 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25
15
15 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25
16
16 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25 0.55
17
17 Knowledge 1 : Wikipedia 1 (as additional feature) We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages Introduce a new feature based on: introduce a new feature by combining the above with the coarse- grained entity types of m i, m j 17 mimi mjmj r ?
18
18 Knowledge 1 : Wikipedia 2 (as additional feature) Given m i, m j, we use a Parent-Child system (Do and Roth, 2010) to predict whether they have a parent-child relation Introduce a new feature based on: combine the above with the coarse-grained entity types of m i, m j 18 mimi mjmj parent-child?
19
19 Knowledge 2 : Word Class Information (as additional feature) Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 19
20
20 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 20 IBM
21
21 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 21 IBM011
22
22 Knowledge 2 : Word Class Information All lexical features consisting of single words will be duplicated with its corresponding bit-string representation apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 22 00 0110 11
23
23 weight vector for “local” models collection of classifiers Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)
24
24 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 24 weight vector for “local” models collection of classifiers penalty for violating the constraint how far y is from a “legal” assignment
25
25 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 25 Wikipedia word clusters hierarchy of relations entity type constraints coreference
26
26 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Constraint Conditional Models (CCMs) fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25
27
27 Key steps Write down a linear objective function Write down constraints as linear inequalities Solve using integer linear programming (ILP) packages 27 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)
28
28 Knowledge 3 : Relations between our target relations... personal... employment family bizexecutive staff 28
29
29 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 29 coarse-grained classifier fine-grained classifier
30
30 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 30 mimi mjmj coarse-grained? fine-grained?
31
31 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 31
32
32 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 32
33
33 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 33
34
34 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 34
35
35 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 35
36
36 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 36 coarse-grained prediction probabilities fine-grained prediction probabilities
37
37 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 37 coarse-grained prediction probabilities fine-grained prediction probabilities coarse-grained indicator variable fine-grained indicator variable indicator variable == relation assignment
38
38 Knowledge 3 : Hierarchy of Relations Write down constraints If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc. (Capturing the inverse relationship) If we assign rf to R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label 38
39
39 Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007) Entity types are useful for constraining the possible labels that a relation R can assume 39 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in
40
40 Entity types are useful for constraining the possible labels that a relation R can assume 40 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)
41
41 We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations By improving the coarse-grained predictions and combining with the hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications 41 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)
42
42 Knowledge 5 : Coreference 42 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in
43
43 Knowledge 5 : Coreference In this work, we assume that we are given the coreference information, which is available from the ACE annotation. 43 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in null
44
44 Experiment Results 44 F1% improvement from using each knowledge source All nwire10% of nwire BasicRE50.5%31.0%
45
Consider different levels of syntactic information Deep processing of text produces structural but less reliable results Simple surface information is less structural, but more reliable Generalization of feature-based solutions A kernel (kernel function) defines a similarity metric Ψ(x, y) on objects No need for enumeration of features Efficient extension of normal features into high-order spaces Possible to solve linearly non-separable problem in a higher order space Nice combination properties Closed under linear combination Closed under polynomial extension Closed under direct sum/product on different domains References: Zelenko et al., 2002, 2003; Aron Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Che et al., 2005, Zhang et al., 2006; Qian et al., 2007; Zhou et al., 2007; Khayyamian et al., 2009; Reichartz et al., 2009 Most Successful Learning Methods: Kernel-based
46
, where 1) Argument 2) Local dependency, where Kernel Examples for Relation Extraction K T is a token kernel defined as: (Zhao and Grishman, 2005) 3) Path, where Composite Kernels:
47
Occurrences of seed tuples: Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor. Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Bootstrapping for Relation Extraction
48
’s headquarters in -based, Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Learned Patterns: Bootstrapping for Relation Extraction (Cont’)
49
Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Generate new seed tuples; start new iteration Bootstrapping for Relation Extraction (Cont’)
50
50 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline
51
Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity 51
52
Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) 52 Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”
53
Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… 53
54
What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 54
55
What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 55
56
What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 56
57
What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 57
58
Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 58
59
Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 3)Comparing algorithms across all hyperparameter settings Over 5,000 experiments 59
60
Background 60
61
What is word2vec ? 61
62
What is word2vec ? How is it related to PMI? 62
63
What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram Various training methods Negative Sampling Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 63
64
What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram(SG) Various training methods Negative Sampling(NS) Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 64
65
Demo http://rare-technologies.com/word2vec-tutorial/#app 65
66
Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 66
67
Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 67
68
Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. wordscontexts wampimukfurry wampimuklittle wampimukhiding wampimukin… “word2vec Explained…” Goldberg & Levy, arXiv 2014 68
69
Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 69
70
Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 70
71
Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 71
72
Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 72
73
Skip-Grams with Negative Sampling (SGNS) 73
74
What is SGNS learning? 74
75
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 75
76
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 76
77
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 77
78
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 78
79
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 79
80
What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 80
81
What is SGNS learning? SGNS is doing something very similar to the older approaches SGNS is factorizing the traditional word-context PMI matrix So does SVD! GloVe factorizes a similar word-context matrix 81
82
But embeddings are still better, right? Plenty of evidence that embeddings outperform traditional methods “Don’t Count, Predict!” (Baroni et al., ACL 2014) GloVe (Pennington et al., EMNLP 2014) How does this fit with our story? 82
83
The Big Impact of “Small” Hyperparameters 83
84
The Big Impact of “Small” Hyperparameters word2vec & GloVe are more than just algorithms… Introduce new hyperparameters May seem minor, but make a big difference in practice 84
85
Identifying New Hyperparameters 85
86
New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 86
87
New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 87
88
New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 88
89
New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 89
90
Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 90
91
Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 91
92
Dynamic Context Windows 92
93
Adding Context Vectors 93
94
Adding Context Vectors 94
95
Adapting Hyperparameters across Algorithms 95
96
Context Distribution Smoothing 96
97
Context Distribution Smoothing 97
98
Context Distribution Smoothing 98
99
Comparing Algorithms 99
100
Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter 100
101
Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter* * If transferable 101
102
Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 102
103
Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 103
104
Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI 104
105
Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) Preprocessing Dynamic Context Window Subsampling Postprocessing Association Metric Shifted PMI/PPMI Context Distribution Smoothing 105
106
Experiments 106
107
Experiments: Prior Art 107 Experiments: “Apples to Apples” Experiments: “Oranges to Oranges”
108
Experiments: Hyperparameter Tuning 108 [different settings]
109
Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate 109
110
Re-evaluating Prior Claims 110
111
Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods 111
112
Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods* * Except for one task… 112
113
GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 113
114
GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 114
115
Linguistic Regularities in Sparse and Explicit Word Representations (Levy and Goldberg, 2014) “PPMI vectors perform on par with SGNS on analogy tasks” Holds for semantic analogies Does not hold for syntactic analogies (MSR dataset) Hyperparameter settings account for most of the reported gaps Different context type for PPMI vectors Syntactic Analogies: there is a real gap in favor of SGNS 115
116
Conclusions 116
117
Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust 117
118
Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust & efficient 118
119
Conclusions: Methodology Look for hyperparameters Adapt hyperparameters across different algorithms For good results: tune hyperparameters For good science: tune baselines’ hyperparameters Thank you :) 119
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.