Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

Name: Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011
Uploaded: 2017-08-21T12:10:37+00:00
Duration: PTM35S3
Channel: Naomi Oliver
Description: Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8– Closure on WSD; IWSD) Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

WordNet Sub-Graph house,home hermitage 4/25/2017 study Hyponymy
Dwelling,abode bedroom kitchen house,home A place that serves as the living quarters of one or mor efamilies guestroom veranda bckyard hermitage cottage Meronymy M e r o n y m Hypernymy Gloss CFILT, IIT Bombay

WSD USING CONCEPTUAL DENSITY (Agirre and Rigau, 1996)
Select a sense based on the relatedness of that word-sense to the context. Relatedness is measured in terms of conceptual distance (i.e. how close the concept represented by the word and the concept represented by its context words are) This approach uses a structured hierarchical semantic net (WordNet) for finding the conceptual distance. Smaller the conceptual distance higher will be the conceptual density. (i.e. if all words in the context are strong indicators of a particular concept then that concept will have a higher density.) 3

CONCEPTUAL DENSITY FORMULA
Wish list The conceptual distance between two words should be proportional to the length of the path between the two words in the hierarchical tree (WordNet). The conceptual distance between two words should be proportional to the depth of the concepts in the hierarchy. entity Sub-Tree d (depth) location finance h (height) of the concept “location” bank-2 bank-1 money where, c= concept nhyp = mean number of hyponyms h= height of the sub-hierarchy m= no. of senses of the word and senses of context words contained in the sub-ierarchy CD= Conceptual Density and 0.2 is the smoothing factor 4

CONCEPTUAL DENSITY (cntd)
The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context. The CD formula will yield highest density for the sub-hierarchy containing more senses. The sense of W contained in the sub-hierarchy with the highest CD will be chosen. 5

CONCEPTUAL DENSITY (EXAMPLE)
administrative_unit body CD = 0.062 division CD = 0.256 committee department government department local department jury operation police department jury administration The jury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1) Step 1: Make a lattice of the nouns in the context, their senses and hypernyms. Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies). Step 3: The concept with the highest CD is selected. 6 Step 4: Select the senses below the selected concept as the correct sense for the respective words.

WSD USING RANDOM WALK ALGORITHM (Page Rank) (sinha and Mihalcea, 2007)
0.46 0.97 0.42 S3 S3 S3 a b a c 0.49 e 0.35 0.63 S2 S2 S2 f k g h i 0.58 0.92 0.56 l 0.67 S1 S1 S1 S1 j Bell ring church Sunday Step 1: Add a vertex for each possible sense of each word in the text. Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method). Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense). 7 Step 4: Select the vertex (sense) which has the highest score.

A look at Page Rank (from Wikipedia)
Developed at Stanford University by Larry Page (hence the name Page-Rank) and Sergey Brin as part of a research project about a new kind of search engine The first paper about the project, describing PageRank and the initial prototype of the Google search engine, was published in 1998 Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools

A look at Page Rank (cntd)
PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25. If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A. PR(A)=PR(B)+PR(C)+PR(D) This is 0.75.

A look at Page Rank (cntd)
Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth to page A and a vote worth to page C. Only one third of D's PageRank is counted for A's PageRank (approximately 0.083). PR(A)=PR(B)/2+PR(C)/1+PR(D)/3 In general, PR(U)= ΣPR(V)/L(V), where B(u) is the set of pages u is linked to, and VεB(U) L(V) is the number of links from V

A look at Page Rank (damping factor)
The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. PR(U)= (1-d)/N + d.ΣPR(V)/L(V), VεB(U) N=size of document collection

For WSD: Page Rank Given a graph G = (V,E)
In(Vi) = predecessors of Vi Out(Vi) = successors of Vi In a weighted graph, the walker randomly selects an outgoing edge with higher probability of selecting edges with higher weight. 12

Other Link Based Algorithms
HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com) IBM CLEVER project TrustRank algorithm.

KB Approaches – Comparisons
Algorithm Accuracy WSD using Selectional Restrictions 44% on Brown Corpus Lesk’s algorithm 50-60% on short samples of “Pride and Prejudice” and some “news stories”. Extended Lesk’s algorithm 32% on Lexical samples from Senseval 2 (Wider coverage). WSD using conceptual density 54% on Brown corpus. WSD using Random Walk Algorithms 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%. Walker’s algorithm 50% when tested on 10 highly polysemous English words.

KB Approaches –Conclusions
Drawbacks of WSD using Selectional Restrictions Needs exhaustive Knowledge Base. Drawbacks of Overlap based approaches Dictionary definitions are generally very small. Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc.  cigarette and ash never co-occur in a dictionary). Suffer from the problem of sparse match. Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns.

SUPERVISED APPROACHES

NAÏVE BAYES The Algorithm find the winner sense using
sˆ= argmax s ε senses Pr(s|Vw) ‘Vw’ is a feature vector consisting of: POS of w Semantic & Syntactic features of w Collocation vector (set of words around it)  typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's Co-occurrence vector (number of times w occurs in bag of words around it) Applying Bayes rule and naive independence assumption sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vwi|s) 17

ESTIMATING PARAMETERS
Parameters in the probabilistic WSD are: Pr(s) Pr(Vwi|s) Senses are marked with respect to sense repository (WORDNET) Pr(s) = count(s,w) / count(w) Pr(Vwi|s) = Pr(Vwi,s)/Pr(s) = c(Vwi,s,w)/c(s,w)

Supervised Approaches – Comparisons
Average Precision Average Recall Corpus Average Baseline Accuracy Naïve Bayes 64.13% Not reported Senseval3 – All Words Task 60.90% Decision Lists 96% Not applicable Tested on a set of 12 highly polysemous English words 63.9% Exemplar Based disambiguation (k-NN) 68.6% WSJ6 containing 191 content words 63.7% SVM 72.4% Senseval 3 – Lexical sample task (Used for disambiguation of 57 words) 55.2% Perceptron trained HMM 67.60 73.74%

Supervised Approaches –Observations
General Comments Use corpus evidence instead of relying of dictionary defined senses. Can capture important clues provided by proper nouns because proper nouns do appear in a corpus. Naïve Bayes Suffers from data sparseness. Since the scores are a product of probabilities, some weak features might pull down the overall score for a sense. A large number of parameters need to be trained. Decision Lists A word-specific classifier. A separate classifier needs to be trained for each word. Uses the single most predictive feature which eliminates the drawback of Naïve Bayes.

Parameter Projection and Iterative WSD
Language and Domain Adaptation

Pioneering work at IITB on Multilingual WSD
Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010. Mitesh Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010. Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Domain-Specific Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010. Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Projecting Parameters for Multilingual Word Sense Disambiguation, Empirical Methods in Natural Language Prfocessing (EMNLP09), Singapore, August, 2009.

Motivation Parallel corpora, wordnets and sense annotated corpora are scarce resources. Challenges: Lack of resources, multiplicity of Indian languages. Can we do annotation work in one language and find ways of reusing it for other languages? Can a more resource fortunate language help a less resource fortunate language? CFILT - IITB

Introduction Aim: Perform WSD in a multilingual setting involving Hindi, Marathi, Bengali and Tamil The wordnet and sense marked corpora of Hindi are used for all these languages Methodology rests on a novel multilingual dictionary framework Parameters are projected from Hindi to other languages The domains of interest are Tourism and Health CFILT - IITB

Related Work (1/2) Knowledge Based Approaches Supervised Approaches
Lesk’s Algorithm, Walker’s algorithm, Conceptual density, PageRank Fundamentally overlap based algorithms Suffer from data sparsity, dictionary definitions being generally small Broad-coverage algorithms, but, suffer from poor accuracies Supervised Approaches WSD using SVM, k-NN, Decision Lists Typically word-specific classifiers with high accuracies Need large training corpora - unsuitable for resource scarce languages CFILT - IITB

Related Work (2/2) Semi-Supervised/Unsupervised Approaches
Hyperlex, Decision Lists Do not need large annotated corpora but are word-specific classifiers. Not suited for broad-coverage Hybrid approaches (Motivation for our work) Structural Semantic Interconnections Combine more than one knowledge sources (wordnet as well as a small amount of tagged corpora) Suitable for broad-coverage CFILT - IITB No single existing solution to WSD completely meets our requirements of multilinguality, high domain accuracy and good performance in the face of not-so-large annotated corpora.

Parameters for WSD (1/4) Motivating example
The river flows through this region to meet the sea. S1: (n) sea (a division of an ocean or a large body of salt water partially enclosed by land) S2: (n) ocean, sea (anything apparently limitless in quantity or volume) S3: (n) sea (turbulent water with swells of considerable size) "heavy seas“ CFILT - IITB What are the parameters that influence the choice of the correct sense for the word sea?

Parameters for WSD (2/4) Domain specific distributions
In the Tourism domain the “water-body” sense is more prevalent than the other senses Domain-specific sense distribution information should be harnessed Dominance of senses in a domain {place, country, city, area}, {flora, fauna}, {mode of transport}, {fine arts} are dominant senses in the Tourism domain A sense which belongs to the sub-tree of a dominant sense should be given a higher score than the other senses CFILT - IITB A synset node in the wordnet hypernymy hierarchy is called Dominant if the synsets in the sub-tree below the synset are frequently occurring in the domain corpora.

Parameters for WSD (3/4) Corpus Co-occurrence statistics
Co-occurring monosemous and/or already disambiguated words in the context help in disambiguation. Example: The frequency of co-occurrence of river (monosemous) with “water-body” sense of sea is high Semantic distance Shortest path length between two synsets in the wordnet graph An edge on this shortest path can be any semantic relation (hypernymy, hyponymy, meronymy, holonymy, etc.) Conceptual distance between noun synsets CFILT - IITB

Parameters for WSD (4/4) Summarizing parameters,
Wordnet-dependent parameters belongingness-to-dominant-concept conceptual-distance semantic-distance Corpus-dependent parameters sense distributions corpus co-occurrence CFILT - IITB

Building a case for Parameter Projection
Wordnet-dependent parameters depend on the graph-based structure of wordnet Corpus-dependent parameters depend on various statistics learnt from a sense marked corpora Both the tasks, Constructing a wordnet from scratch Collecting sense marked corpora for multiple languages are tedious and expensive CFILT - IITB Can the effort required in constructing semantic graphs for multiple wordnets and collecting sense marked corpora in multiple languages be avoided?

Synset based Multilingual Dictionary (1/2) Rajat Mohanty, Pushpak Bhattacharyya, Prabhakar Pande, Shraddha Kalele, Mitesh Khapra and Aditya Sharma Synset Based Multilingual Dictionary: Insights, Applications and Challenges. Global Wordnet Conference, Szeged, Hungary, January Unlike traditional dictionary, synsets are linked, and after that the words inside the synsets are linked Hindi is used as the central language – the synsets of all languages link to the corresponding Hindi synset. Concepts L1 (English) L2 (Hindi) L3 (Marathi) 04321: a youthful male person {malechild, boy} {लड़का ladkaa, बालक baalak, बच्चा bachchaa} {मुलगा mulgaa, पोरगा porgaa, पोर por} CFILT - IITB Advantage: The synsets in a particular column automatically inherit the various semantic relations of the Hindi wordnet – the wordnet based parameters thus get projected

Synset based Multilingual Dictionary (2/2)
Cross-linkages are set up manually from the words of a synset to the words of a linked synset of the central language Such cross-linkages actually solve the problem of lexical choice in translating from text of one language to another. मुलगा /MW1 mulagaa, पोरगा /MW2 poragaa, पोर /MW3 pora लड़का /HW1 ladakaa, बालक /HW2 baalak, बच्चा /HW3 bachcha, छोरा /HW4 choraa male-child /HW1, boy /HW2 Marathi Synset Hindi Synset English Synset CFILT - IITB

Snapshot of a Marathi sense tagged paragraph
Sense Marked corpora Snapshot of a Marathi sense tagged paragraph

Parameter Projection using MultiDict - P(Sense|Word) parameter (1/2)
P({water-body}|saagar) is given by Using the cross-liked Hindi words we get P({water-body}|saagar) is In general, Sense_2650 Sense_8231 saagar (sea) {water body} saagar (sea) {abundance} samudra (sea) {water body} CFILT - IITB

Parameter Projection using MultiDict - P(Sense|Word) parameter (2/2)
Sr. No Marathi Word Synset P(S|word) as learnt from sense tagged Marathi corpus P(S|word) as projected from sense tagged Hindi corpus 1 किंमत (kimat) { worth } 0.684 0.714 { price } 0.315 0.285 2 रस्ता (rasta) { roadway } 0.164 0.209 {road, route} 0.835 0.770 3 ठिकाण (thikan) { land site, place} 0.962 0.878 { home } 0.037 0.12 For HindiMarathi Average KL Divergence=0.29 Spearman’s Correlation Coefficient=0.77 For HindiBengali Average KL Divergence=0.05 Spearman’s Correlation Coefficient=0.82 CFILT - IITB There is a high degree of similarity between the distributions learnt using projection and those learnt from the self corpus.

Comparison of projected and true sense distribution statistics for some Marathi words

Parameter Projection using MultiDict - Co-occurrence parameter
Within a domain, the statistics of co-occurrence of senses remain the same across languages. Co-occurrence of the synsets {cloud} and {sky} is almost same in the Marathi and Hindi corpus. Sr. No Synset Co-occurring Synset P(co-occurrence) as learnt from sense tagged Marathi corpus P(co-occurrence) as learnt from sense tagged Hindi corpus 1 {रोप, रोपटे} {small bush} {झाड, वृक्ष, तरुवर, द्रुम, तरू, पादप} {tree} 0.125 2 {मेघ, अभ्र} {cloud} {आकाश, आभाळ, अंबर} {sky} 0.167 0.154 3 {क्षेत्र, इलाक़ा, इलाका, भूखंड} {geographical area} {यात्रा, सफ़र} {travel} 0.0019 0.0017 CFILT - IITB

Comparison of projected and true sense co-occurrences statistics for some Marathi words

Algorithms for WSD – Iterative WSD
Motivated by the Energy expression in Hopfield network CFILT - IITB Neuron  Synset Self-activation Corpus Sense Distribution Weight of connection between two neurons Weight as a function of corpus co-occurrence and Wordnet distance measures between synsets

Algorithms for WSD – Modified PageRank
Modification Instead of using the overlap in dictionary definitions as edge weights, the wordnet and corpus based parameters are used to calculate edge weights CFILT - IITB

# of polysemous words (tokens) # of synsets in MultiDict
Experimental Setup Datasets Tourism corpora in 4 languages (viz., Hindi, Marathi, Bengali and Tamil) Health corpora in 2 languages (Hindi and Marathi) A 4-fold cross validation was done for all the languages in both the domains Language # of polysemous words (tokens) Tourism Domain Health Domain Hindi 50890 29631 Marathi 32694 8540 Bengali 9435 - Tamil 17868 Size of manually sense tagged corpora for different languages Language # of synsets in MultiDict Hindi 29833 Marathi 16600 Bengali 10732 Tamil 5727 Number of synsets for each language CFILT - IITB

Marathi (Health Domain)
Results Tourism Domain Algorithm Language Marathi Bengali Tamil P % R % F % IWSD (training on self corpora; no parameter projection) 81.29 80.42 80.85 81.62 78.75 79.94 89.50 88.18 88.83 IWSD (training on Hindi and reusing parameters for another language) 73.45 70.33 71.86 79.83 79.65 79.79 84.60 73.79 78.82 PageRank (training on self corpora; no parameter projection) 79.61 76.41 - PageRank (training on Hindi and reusing parameters for another language) 71.11 75.05 Wordnet Baseline 58.07 52.25 65.62 CFILT - IITB Algorithm Marathi (Health Domain) P % R % F % IWSD (training on Marathi) 84.28 81.25 82.74 IWSD (training on Hindi and reusing for Marathi) 75.96 67.75 71.62 Wordnet Baseline 60.32

Drop in F-score when using projections (Tourism)
Observations IWSD performs better than PageRank There is a drop in performance when we use parameter projection instead of using self corpora Despite the drop in accuracy the performance is still better than the wordnet baseline The performance is consistent in both the domains One could trade accuracy with the cost of creating sense annotated corpora Language Drop in F-score when using projections (Tourism) IWSD PageRank Marathi 9% 8% Bengali 0.1% 1% Tamil 10% - CFILT - IITB

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

Similar presentations

Presentation on theme: "Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

Similar presentations

Presentation on theme: "Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011"— Presentation transcript:

Similar presentations

About project

Feedback