Download presentation
Presentation is loading. Please wait.
Published byBeatrice Johns Modified over 8 years ago
1
Graph-based WSD の続き DMLA 2008-12-10 2016/7/10 小町守.
2
7/10/20162 Word sense disambiguation task of Senseval-3 English Lexical Sample Predict the sense of “bank” … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ), bring this up to … In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden Possibly aligned to water a sort of bank(???) by a rushing river. Training instances are annotated with their sense Predict the sense of target word in the test set
3
WSD with adjacency matrix Assumption Similar examples tend to have the same label Can define (dis-)similarity between examples Prior knowledge, kNN Idea Perform clustering on an adjacency matrix 3
4
Intuition behind using similarity graph Can propagate known labels to unlabeled data without any overlapping (Pictures taken from Zhu 2007) 4
5
Using unlabeled data by similarity graph 5
6
Pros and cons Pros – Mathematically well-founded – Can achieve high performance if the graph is well-constructed Cons – Hard to determine appropriate graph structure (and its edges’ weight) – Relatively large computational complexity – Mostly transductive Transductive learning: (unlabeled) test instances are given when building classification model Inductive: test instances are not known during training 6
7
7/10/20167 Word sense disambiguation by kNN Seed instance = the instance to predict its sense System output = k-nearest neighbor (k=3) Seed instance
8
7/10/20168 Simplified Espresso is HITS Simplified Espresso =HITS in a bipartite graph whose adjacency matrix is A Problem No matter which seed you start with, the same instance is always ranked topmost Semantic drift (also called topic drift in HITS) The ranking vector i tends to the principal eigenvector of A T A as the iteration proceeds regardless of the seed instances!
9
7/10/20169 Convergence process of Espresso Heuristics in Espresso helps reducing semantic drift (However, early stopping is required for optimal performance) Output the most frequent sense regardless of input Original Espresso Simplified Espresso Most frequent sense (baseline) Semantic drift occurs (always outputs the most frequent sense)
10
Learning curve of Original Espresso: per-sense breakdown 7/10/201610 # of most frequent sense predictions increases Recall for infrequent senses worsens even with original Espresso Most frequent sense Other senses
11
Q. What caused drift in Espresso? A. Espresso's resemblance to HITS HITS is an importance computation method (gives a single ranking list for any seeds) Why not use a method for another type of link analysis measure - which takes seeds into account? "relatedness" measure (it gives different rankings for different seeds) 7/10/201611
12
7/10/201612 The regularized Laplacian kernel A relatedness measure Takes higher-order relations into account Has only one parameter Graph Laplacian Regularized Laplacian matrix A :adjacency matrix of the graph D :(diagonal) degree matrix β:parameter Each column of R β gives the rankings relative to a node
13
algorithmF measure Most frequent sense (baseline)54.5 HyperLex64.6 PageRank64.6 Simplified Espresso44.1 Espresso (after convergence)46.9 Espresso (optimal stopping)66.5 Regularized Laplacian ( β =10 -2 )67.1 7/10/201613 WSD on all nouns in Senseval-3 Outperforms other graph-based methods Espresso needs optimal stopping to achieve an equivalent performance
14
More experiments on WSD dataset Niu et al. “Word Sense Disambiguation using LP-based Semi-Supervised Learning” (ACL-2005) Pham et al. “Word Sense Disambiguation with Semi- Supervised Learning” (AAAI-2005) 7/10/201614
15
Dataset Pedersen (2000) line, interest data Line: six senses = 線, 生産物, … Interest: four senses = 利息, 関心, … Features Bag-of-words feature Local collocation feature Parts-of-speech feature 7/10/201615
16
Result 7/10/201616 MFSNiu et al.Pham et al.BBproposed interest54.6%79.8%76.4%75.5%75.6% line53.5%59.4%68.0%62.7%61.3% S3LS (1%)54.5%30.8%42.1% S3LS (10%)54.5%56.5%56.0% S3LS (25%)54.5%64.9%63.2% S3LS (50%)54.5%68.6%66.3% S3LS (75%)54.5%70.3%68.8% S3LS (100%)54.5%71.8%69.8%
17
Discussion Proposed method (simple k-NN) achieved comparable performance to previous semi-supervised WSD systems Does additional data help? 7/10/201617
18
“line” data with 90 labeled instances 7/10/201618
19
“line” data with 150 labeled instances 7/10/201619
20
“interest” data with 60 labeled instances 7/10/201620
21
“interest” data with 300 labeled instances 7/10/201621
22
Discussion (cont.) Additional data doesn’t always help Sometimes gets worse than nothing! Haven’t succeeded to use large-scale data on this task (BNC data can be used) All system suffers from data sparseness problem Needs robust feature selection (smoothing) 7/10/201622
23
Multiple clusters in similarity graphs 23 Generative model of co-occurrence
24
Construction of similarity matrix Let G z be a hidden topic graph The edge between i i and i j has weight P(z|i i,p j ) Adjacency graph A z = A(G z ) is a graph whose (i,j)-th element holds P(z|i i,p j ) and all the other element are set 0 A similarity matrix is computed by A z T A z The (i,j)-th element holds the co-occurrence value between instance i i and i j with respect to topic z 7/10/201624
25
Combination of von Neumann kernels The von Neumann kernel matrix is defined as follows: Final kernel matrix is computed by summing the kernel matrices of all hidden topic 7/10/201625
26
Result 7/10/201626 MFSNiu et al.K-NNpLSI S3LS54.5%71.8%69.8%51.7%
27
Discussion Poor result on proposed method Likely to be caused by mis-implimentation or a bug The number of clusters (hidden variable: z) does not seem to strongly affect the performance (tested |z| = 5, 20. Got 3 points improvement on increasing |z| to 20, but still below most frequent sense baseline) 7/10/201627
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.