Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim

Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim (karma@pusan.ac.kr)

2 Contents Introduction Evaluation Measures Verb Subcategorization Attachment Ambiguity Selectinal Prefeences Semantic Similarity

3 Lexical acquisition Develop algorithms and statistical techniques for filling the holes in existing dictionaries and lexical resources by looking at the occurrences of patterns of words in large text corpora –Collocations –Selectional preferences –Subcategorization –Semantic categorization

일반사전 - 표준국어대사전 4 표제어 발음 정보 활용 정보 정의문 예문

전자사전 - 세종전자사전 5

6 The limits of hand-encoded lexical resources Manual construction of lexical resources is very costly Because language keeps changing, these resources have to be continuously updated Quantitative information (e.g., frequencies, counts) has to be computed automatically anyway

7 Lexical acquisition Examples: –“insulin” and “progesterone” are in WordNet 2.1 but “leptin” and “pregnenolone” are not. –“HTML” and “SGML”, but not “XML” or “XHTML”. We need some notion of word similarity to know where to locate a new word in a lexical resource

Evaluation Measures

Performance of system Information retrieval engine – 사용자가 “NLP” 와 관련된 문서를 찾고 싶음 – 실제 “NLP“ 와 관련된 문서는 많이 있음 – 검색 결과로 하나의 문서가 나옴 – 해당 문서는 “NLP” 와 관련된 문서임 – 좋은 검색 시스템인가 ? Speller – 사용자가 5 개의 단어를 입력함 – 이 중 3 개의 단어가 오류어임 – 시스템은 5 개의 단어가 모두 오류라고 판단함 – 좋은 검사 시스템인가 ? Word sense disambiguation system – 이 시스템은 항상 “ 사과 ” 가 “apple” 의 뜻이라고 판단함 – 좋은 시스템인가 ? 9

Evaluation: precision and recall For many problems, we have a set of targets –Information retrieval: relevant documents –Spelling error correction: error words –Word sense disambiguation: sense of ambiguous word Precision RecallPrecision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness. 10 fpfntp tn selected target

11 Evaluation: precision and recall Used in information retrieval –A perfect Precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) –A perfect Recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved). –Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved –Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved).

12 Evaluation: precision and recall Joint distribution of the variables False negatives = Type Ⅰ errors False positives = Type Ⅱ errors SystemActual target ￢ target selectedtp (true positives) fp (false positives) ￢ selected fn (false negatives) tn (true negatives)

13 R=3/6=0.5; P=3/4=0.75 Computing Recall/Precision Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167;P=1/1=1 R=2/6=0.333;P=2/2=1 R=5/6=0.833;p=5/13=0.38 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall

14 Precision and recall: trade-off 1 0 1 Recall Precision The ideal Returns relevant documents but misses many useful ones too Returns most relevant documents but includes lots of junk

F measure Combine precision and recall in a single measure of overall performance. 15 P : Precision R : Recall α : a factor which determines the weighting of P & R. α= 0.5 is chosen often for equal weighting

Verb Subcategorization

17 Finding collocations Frequency –If two words occur together a lot, that may be evidence that they have a special function –But if we sort by frequency of pairs C(w 1, w 2 ), then “of the” is the most frequent pair –Filter by POS patterns –A N (linear function), N N (regression coefficients) etc.. Mean and variance of the distance of the words For not contiguous collocations –She knocked at his door (d = 2) –A man knocked on the metal front door (d = 4) –Hypothesis testing (see page 162 Stat NLP) How do we know it’s really a collocation? Low mean distance can be accidental (new company) We need to know whether two words occur together by chance or not (because they are a collocation) –Hypothesis testing

18 Finding collocations Mutual information measure –A measure of how much a word tells us about the other, i.e. the reduction in uncertainty of one word due to knowing about another 0 when the two words are independent (see Stat NLP page 66 and178)

Main verbs Transitive –requires a direct object (found with questions: what? or whom?) ?The child broke. The child broke a glass. Intransitive – does not require a direct object. The train arrived. Some verbs can be both transitive and intransitive –The ship sailed the seas. (transitive) –The ship sails at noon. (intransitive) –I met my friend at the airport. (transitive) –The delegates met yesterday. (intransitive) 18

20 Verb Phrases VP --> head-verb complements adjuncts Some VPs: –Verb eat. –Verb NP leave Montreal. –Verb NP PP leave Montreal in the morning. –Verb PP leave in the morning. –Verb S think I would like the fish. –Verb VP want to leave. want to leave Montreal. want to leave Montreal in the morning. want to want to leave Montreal in the morning.

Verb Subcategorization Verb Subcategorization : Verbs subcategorize for different semantic categories. Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame. 21

22 Subcategorization frames Some verbs can take complements that others cannot I want to fly. * I find to fly. Verbs are subcategorized according to the complements they can take --> subcategorization frames –traditionally: transitive vs intransitive –nowadays: up to 100 subcategories / frames

Subcategorization frames: pasring Subcategorization is important for parsing a) She told the man where Peter grew up. b) She found the place where Peter grew up. –Tell → NP NP (subject, object) –Find → NP (subject) 23

Brent’s Lerner system There are two steps of this algorithm. ① Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty. ① Certainty is formalized as probability of error ② For a certain cue C j we define error E j ② Hypothesis testing is done by contradiction ① We assume that frame is not appropriate for the verb and call is H o (Null Hypothesis). ② we reject the hypothesis if C j indicates with high probability that our Ho is wrong. 24

Cues Cue for frame “NP NP” (OBJ | SUBJ_OBJ | CAP) (PUNC|CC) OBJ = personal pronouns(me, him) SUBJ_OBJ = personal pronouns(you, it) CAP = capitalized word PUNC = punctuation mark CC = subordinating conjunction(if, before) […] greet-V Peter-CAP,-PUNC […] 25

 If pE < α, then we reject H0 Precision : –close to 100% (when α = 0.02) Recall : 47 ~ 100% Hypotheses testing n : # of times v i occurs in corpus m : # of frame f j occurs v i (f j )=0 : Verb v i does not permit frame f j C(v i,c j ) : # of times that v i occurs with cue c j ε j : error rate for cue f j 26

Attachment Ambiguity

 (8.14) The children ate the cake with a spoon.  I saw the man with a telescope (Log) Likelihood Ratio [a common and good way of comparing between two exclusive alternatives] Problem: ignores preference for attaching phrase “low” in parse tree Attachment ambiguity 28

Chrysler confirmed that it would end its troubled venture with Maserati. Simple model 29

Event space: all V NP PP sequences, How likely for a preposition to attach with a verb or noun VAp: Is there a PP headed by p which attaches to v NAp: Is there a PP headed by p which attaches to n Both can be 1: He put the book on World War II on the table She sent him into the nursery to gather up his toys. Hindle and Rooth(1) 30

Hindle and Rooth(2) 31

 Model’s limitations  Only consider the identity of the preposition, noun and the verb  Consider only the most basic case of PP immediately after an NP object which is modifying either the immediately preceding n or v. The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting]  Other attachment issues Attachment ambiguity in noun compounds (a)[[Door bell] manufacturer] : left-branching (b)[Woman [aid worker]] : right-branching General remarks on PP attachment 32

Selectional Prefences

 Selectional Preference(or Selectional restriction)  Most verbs prefer arguments of a particular type.  Preference ↔ Rules  eat + non-food argument Example) eating one’s words. Selectional Preferences(1) 34

 Acquisition of selectional preference is important in Statistical NLP for a number of reasons Durian is missing in dictionary then we can infer part of its meaning from selection restrictions  Another important use is ranking the parse of a sentence  Give high scores to the parses where verb has natural arguments Selectional Preferences(2) 35

 Resnik’s Model(1993,1996) 1.Selectional Preference Strength  How strongly the verb constrains its direct object.  Two Assumptions ① Take only head noun ② Classes of nouns. Selectional Preferences(3) P(C) : overall probability distribution of noun classes P(C|v): probability distribution of noun classes in the direct object position of v 36

Noun class cP( c)P(c |eat)P(c | see)P(c | find) People0.250.010.250.33 Furniture0.250.010.250.33 Food0.250.970.250.33 Action0.250.010.250.01 SPS S(v)1.760.000.35 Selectional Preferences(4)  Table 8.5 Selectional Preference Strength 37

 The Notion of the Resnik’s Model (conti’) 2. Selectional Association between a verb v and a class c A rule for assigning strength to nouns Ex) (8.31) Susan interrupted the chair. Selectional Preferences(5) 38

Estimate the Probability of P(c|v) = P(v,c) / P(v)  Selectional Preferences(6) N : total number of verb-object pairs in the corpus words(c) : set of all nouns in class c |classes(n)| : number of noun classes that contain n as a member C(v,n) : number of verb-object pairs with v as the verb and n as the head of the object NP 39

 Resnik’s experiments on the Brown corpus (1996) : Table 8.6 Left half : typical objects Right half : atypical objects For most verbs, association strength predicts which object is typical Most errors the model makes are due to the fact that it performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n)  Implicit object alternation a.Mike ate the cake. b.Mike ate. ◦ The more constraints a verb puts on its object, the more likely it is to permit the implicit-object construction ◦ Selectional Preference Strength (SPS) is seen as the more basic phenomenon which explains the occurrence of implicit- objects as well as association strength 8.4 Selectional Preferences(7) 40

Selectional Preferences(8) 41

Semantic Similarity

 Lexical Acquisition  The Acquisition of meaning  Semantic Similarity Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is Most often used for generalization under the assumption that semantically similar words behave similarly ex) Susan had never eaten a fresh durian before. Similarity-based Generalization VS. Class-based Generalization –Similarity-based generalization : Consider the closest neighbors –Class-based generalization : Consider the whole class Usage of Semantic Similarity –Query expansion : astronaut  cosmonaut –k nearest neighbors classification Semantic Similarity (1) 43

 A Notion of Semantic Similarity Extension of synonymy and refers to cases of near- synonymy like the pair dwelling/abode Two words are from the Same domain or topic ex) Doctor, nurse, fever, intravenous Judgements of Semantic Similarity explained by the degree of contextual interchangeability ( Miller and Charles – 1991) Ambiguity presents a problem for all notions of semantic similarity  When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ ex) litigation ≒ suit (≠ clothes)  Similarity Measures Vector space measures Probabilistic measures Semantic Similarity(2) 44

Semantic Similarity(3) 29 45

 The two words whose semantic similarity we want to compute are represented as vectors in a multi-dimensional space. 1.A document-by-word matrix A ( Figure 8.3 ) Entry contains the number of times word j occurs in document i. 2.A word-by-word matrix B ( Figure 8.4 ) Entry contains the number of times word j co-occurs with word i.  3.A modifier-by-head matrix C ( Figure 8.5 ) Entry contains the number of times that head J is modified by modifier i.  Different spaces get at different types of semantic similarity Document-Word, Word-Word spaces capture topical similarity Modifier-Head space captures more fine grained similarity Vector space measures(1) 46

 3 Similarity measures for binary vectors ( Table 8.7 )  Matching coefficient simply counts the number of dimension on which both vectors are non-zero.  Dice coefficient normailizes for length of the vectors and the total number of non zero entries.  Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does. Vector space measures(2) 47

 Similarity measures for binary vectors (conti’)  Overlap coefficient has a value of 1.0 if every dimension with a non- zero value for the first vector is also non-zero for the second vector.  Cosine penalizes less in cases where the number of non-zero entries is very different.  Real-valued vector space  More powerful representation for linguistic objects.  The length of a vector Vector space measures(3) 48

 Real-valued vector space (conti’)  The dot product between two vectors  The cosine measure  The Euclidean distance  The advantage of vector spaces as a representational medium. Simplicity. Computational efficiency.  The disadvantage of vector spaces  Operate on binary data except for cosine  Cosine has its own problem Cosine assumes a Euclidean space Euclidean space is not well-motivated choice if the vectors we are dealing with are vectors of probability or counts Vector space measures(4) 49

 Transform semantic similarity into the similarity of two probability distribution  Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability Ex) (American, Astronaut)  P(American|astronaut) = ½ = 0.5  Measures of (dis-)similarity between probability distributions ( Table 8.9 ) 3 measures of dissimilarity between probability distributions investigated by Dagan et al.(1997) 1.KL divergence – –Measures how much information is lost if we assume distribution q when the true distribution is p –Two Problems for Practical applications »Get value of infinity when q i =0 and p i ≠ 0 »Asymmetric ( D(p||q) ≠ D(q||p) ) Probabilistic measures(1) 50

 Measures of similarity between probability distributions (conti’) 2. Information radius (IRAD) – –Symmetric and no problem with infinite values. –Measures how much information is lost if we describe the two words that correspond to p and q with their average distribution 3. norm. – –A measure of the expected proportion of events that are going to be different between the distributions p and q Probabilistic measures(2) 51

 Measures of similarity between probability distributions (conti’) norm’s example ( by figure 8.5 ) p1 = P(Soviet | cosmonaut) = 0.5 p2 = 0 p3 = P(spacewalking | cosmonaut)=0.5 q1 = 0 q2 = P(American | astronaut) = 0.5 q3 = P(spacewalking | astronaut) = 0.5  Probabilistic measures(3) 52

 Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way. The cost of building lexical resources manually. The quantitative part of lexical acquisition almost always has to be done automatically. Many lexical resources were designed for human consumption.  The best solution : the augmentation of a manual resource by automatic means. The main reason : The inherent productivity of language. The Role of Lexical Acquisition in Statistical NLP(1) 53

 Look harder for sources of prior knowledge that can constrain the process of lexical acquisition.   Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning.  Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP.  Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora.  ( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.)  If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected. What does the future hold for lexical acquisition? 54

Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim

Similar presentations

Presentation on theme: "Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim

Similar presentations

Presentation on theme: "Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim"— Presentation transcript:

Similar presentations

About project

Feedback