Download presentation
Presentation is loading. Please wait.
Published byCharlotte Houston Modified over 9 years ago
1
CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12: WSD approaches (contd.)
2
2 The jury(2 ) praised the administration(3) and operation (8) of Atlanta Police Department(1) Step 1: Make a lattice of the nouns in the context, their senses and hypernyms. Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies). Step 3: The concept with highest CD is selected. Step 4: Select the senses below the selected concept as the correct sense for the respective words. operation division administrative_unit jury committee police department local department government department department juryadministration body CD = 0.256 CD = 0.062 2 CFILT - IITB C ONCEPTUAL D ENSITY (E XAMPLE )
3
3 C ONCEPTUAL D ENSITY F ORMULA 3 CFILT - IITB Wish list The conceptual distance between two words should be proportional to the length of the path between the two words in the hierarchical tree (WordNet). The conceptual distance between two words should be proportional to the depth of the concepts in the hierarchy. where, c = concept nhyp = mean number of hyponyms h = height of the sub-hierarchy m = no. of senses of the word and senses of context words contained in the sub- hierarchy CD= Conceptual Density entity financelocation moneybank-1bank-2 d (depth) h (height) of the concept “location” Sub-Tree Conceptual Distance(S1, S2) = Length of the path between(S1, S2) Height (lowest common ancestor (L1, L2))
4
4 WSD U SING R ANDOM W ALK A LGORITHM S3 Bell ring church Sunday S2 S1 S3 S2 S1 S3 S2 S1 a c b e f g h i j k l 0.46 a 0.49 0.92 0.97 0.35 0.56 0.42 0.63 0.58 0.67 Step 1: Add a vertex for each possible sense of each word in the text. Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method). Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense). Step 4: Select the vertex (sense) which has the highest score. 4 CFILT - IITB
5
Page and Brin, 1998 “We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.” 12/05/08 CFILT - IITB
6
Page and Brin 1998 (contd.) “PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.” 12/05/08 CFILT - IITB
7
KB APPROACHES – COMPARISONS 7 CFILT - IITB AlgorithmAccuracy WSD using Selectional Restrictions44% on Brown Corpus Lesk’s algorithm50-60% on short samples of “Pride and Prejudice” and some “news stories”. WSD using conceptual density54% on Brown corpus. WSD using Random Walk Algorithms54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%. Walker’s algorithm50% when tested on 10 highly polysemous English words.
8
KB APPROACHES –Observations 8 CFILT - IITB Drawbacks of WSD using Selectional Restrictions Needs exhaustive Knowledge Base. Drawbacks of Overlap based approaches Dictionary definitions are generally very small. Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. c igarette and ash never co-occur in a dictionary). Suffer from the problem of sparse match. Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns. E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”. Sachin Tendulkar plays cricket.
9
9 Topics to be covered Knowledge Based Approaches WSD using Selectional Preferences (or restrictions) Overlap Based Approaches Machine Learning Based Approaches Supervised Approaches Semi-supervised Algorithms Unsupervised Algorithms Hybrid Approaches 9 CFILT - IITB
10
10 N AÏVE B AYES 10 CFILT - IITB sˆ= argmax s ε senses Pr(s| V w ) ‘ V w ’ is a feature vector consisting of: POS of w Semantic & Syntactic features of w Collocations (set of words around it) typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's Co-occurrence stats (number of times words occurs in bag of words around it) Applying Bayes rule and naive independence assumption sˆ= argmax s ε senses Pr(s).Π i=1 n Pr(V w i |s)
11
11 E STIMATING P ARAMETERS Parameters in the probabilistic WSD are: Pr(s) Pr(V w i |s) Senses are marked with respect to sense repository (WORDNET) Pr(s)= count(s) / count(all senses) Pr(V w i |s)= Pr(V w i,s)/Pr(s) = c(V w i,s,w)/c(s,w) CFILT - IITB
12
12 D ECISION L IST A LGORITHM Collect a large set of collocations for the ambiguous word. Calculate word-sense probability distributions for all such collocations. Calculate the log-likelihood ratio Higher log-likelihood = more predictive evidence Collocations are ordered in a decision list, with most predictive collocations ranked highest. 12 CFILT - IITB Pr(Sense-A| Collocation i ) Pr(Sense-B| Collocation i ) Log( ) 12 CFILT - IITB Assuming there are only two senses for the word. Of course, this can easily be extended to ‘k’ senses.
13
13 Training DataResultant Decision List D ECISION L IST A LGORITHM (C ONTD.) Classification of a test sentence is based on the highest ranking collocation found in the test sentence. E.g. … plucking flowers affects plant growth … 13 CFILT - IITB
14
EXEMPLAR BASED WSD (K-NN) An exemplar based classifier is constructed for each word to be disambiguated. Step1: From each sense marked sentence containing the ambiguous word, a training example is constructed using: POS of w as well as POS of neighboring words. Local collocations Co-occurrence vector Morphological features Subject-verb syntactic dependencies Step2: Given a test sentence containing the ambiguous word, a test example is similarly constructed. Step3: The test example is then compared to all training examples and the k-closest training examples are selected. Step4: The sense which is most prevalent amongst these “k” examples is then selected as the correct sense. 14 CFILT - IITB
15
WSD USING SVM S SVM is a binary classifier which finds a hyperplane with the largest margin that separates training examples into 2 classes. As SVMs are binary classifiers, a separate classifier is built for each sense of the word Training Phase: Using a tagged corpus, f or every sense of the word a SVM is trained using the following features: POS of w as well as POS of neighboring words. Local collocations Co-occurrence vector Features based on syntactic relations (e.g. headword, POS of headword, voice of head word etc.) Testing Phase: Given a test sentence, a test example is constructed using the above features and fed as input to each binary classifier. The correct sense is selected based on the label returned by each classifier. 15 CFILT - IITB
16
SUPERVISED APPROACHES – COMPARISONS 16 CFILT - IITB ApproachAverage Precision Average RecallCorpusAverage Baseline Accuracy Naïve Bayes64.13%Not reportedSenseval3 – All Words Task 60.90% Decision Lists96%Not applicableTested on a set of 12 highly polysemous English words 63.9% Exemplar Based disambiguation (k- NN) 68.6%Not reportedWSJ6 containing 191 content words 63.7% SVM72.4% Senseval 3 – Lexical sample task (Used for disambiguation of 57 words) 55.2%
17
SUPERVISED APPROACHES – Observations 17 CFILT - IITB General Comments Use corpus evidence instead of relying of dictionary defined senses. Can capture important clues provided by proper nouns because proper nouns do appear in a corpus. Naïve Bayes Suffers from data sparseness. Since the scores are a product of probabilities, some weak features might pull down the overall score for a sense. A large number of parameters need to be trained. Decision Lists A word-specific classifier. A separate classifier needs to be trained for each word. Uses the single most predictive feature which eliminates the drawback of Naïve Bayes.
18
SUPERVISED APPROACHES – Observations 18 CFILT - IITB Exemplar Based K-NN A word-specific classifier. Will not work for unknown words which do not appear in the corpus. Uses a diverse set of features (including morphological and noun- subject-verb pairs) SVM A word-sense specific classifier. Gives the highest improvement over the baseline accuracy. Uses a diverse set of features.
19
Some Indian thought (7 century A.D.) 12/05/08 CFILT - IITB
20
12/05/08 CFILT - IITB
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.