Download presentation
Presentation is loading. Please wait.
Published byAndra Campbell Modified over 9 years ago
1
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra
2
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 2 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) By Automatic Lexical Information Acquisition we.. try to find how to build repositories of language dependent lexical information automatically. Many technologies behind applications (MT, IE, Automatic Summarization, Sentiment Analysis, Opinion Mining, Question Answering, etc.) do need this information to work ("paralelo" AST ALO "paralel" ATR POST CL (PF-AS PM-OS SF-A SM-O) FC (NPP) LY AMENTE MC ("a") PLC (NG) PRED (ESTAR SER) TA (OBJ-P REL) AUTHOR "juan" DATE "31-Aug-99" SITE "FB52") ("paralelo" AST ALO "paralel" ATR POST CL (PF-AS PM-OS SF-A SM-O) FC (NPP) LY AMENTE MC ("a") PLC (NG) PRED (ESTAR SER) TA (OBJ-P REL) AUTHOR "juan" DATE "31-Aug-99" SITE "FB52") ("fiesta" NST ALO "fiest" CL (PF-AS SF-A) GD (F) KN MS PLC (NF) TYN (ABS) AUTHOR "juan" DATE "28-Aug-99" SITE "FB52") Entries borrowed from MT system Incyta (Metal family)
3
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 3 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Differences in the distribution of certain contexts separate words of different classes (Harris, 1951). For example: some / *many mud Words (types) can be represented in terms of a collection of contexts where their occurrence or not in these contexts is taken as hints or cues for a word to be classified as being of a particular class. Cue Based Lexical Acquisition
4
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 4 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Word’s occurrences are represented as vectors and used to train a classifier. @data 15,2,8,4,0,8,1,0,1,0,0,0,0,0 Number of times the word has been observed in each of the defined contexts. Non occurrence in particular contexts is as informative as occurrence. We use supervised classifiers (Support Verb Machines, Decision Trees) to predict the class (Abstract, Mass, etc.) of new words.
5
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 5 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Cues, classification and state-of-the- art results Merlo and Stevenson (2001) selected very specific cues for classifying verbs into a number of Levin (1993) based verbal classes: animacy of the subject, passives,... Baldwin (2005) used general features, such as the pos tags of neighboring words for type classification. Joanis et al. (2007) used the frequency of filled syntactic positions or slots, tense and voice of occurring verbs, etc., to describe the whole system of English verbal classes. Difficult to compare the results, but.. an accuracy of about 70%
6
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 6 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) The problem: missing values
7
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 7 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) The Sparse data problem Joanis and Stevenson, 2003; Joanis et al. 2007; Korhonen et al. 2008 mention that they have to face the problem of sparse data, many of the types/words are low in frequency and show up very little information. Most of the words will appear very little (i.e. Zipff distribution) and therefore will show few cues. Yallop et al. (2005) calculated that in the 100M-word British National Corpus, from a total of 124,120 distinct adjectives, 70,246 occur only once. The cues we can use as information are mutually exclusive, i.e. an adjective can be prenominal and postnominal, but if it only occurs once, it will only show one cue, the other ones being a zero value. Even when appearing more frequently, the optional nature and variety of the contexts of occurrence are the origin of missing values also for those types that occur more than once.
8
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 8 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Zero values and learning Zero values create not only a problem of enough information to decide, but a further uncertainty when learning from the data. A zero value could be indeed a negative value, i.e. the cue is that it has not been observed, but it could be that the cue was just not observed in the examined corpus because of various reasons When there are many zero values, the cue loses its predictive power because of the mentioned uncertainty. Katz (1987) and Baayen and Sproat (1996), among others, acknowledged the importance of preprocessing low frequency events and Joanis et al. (2007) also decided to smooth the data, even working with more than 1000 occurrences per verb in the BNC.
9
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 9 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Our smoothing experiment: Harmonization based on linguistic information
10
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 10 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Intuitively: How likely is that a 0 is just an unobserved feature and not a true 0, given the values of other observations? To classify Abstract/Concrete nouns in English: Cue 1 is “suffix “–ness”, “-ism”, …. For Abstracts (Light 1996) Cue 2 is “determiners “such”, “little”, much”.. For Abstracts Cue 3 is “adjectives like “big”, “small”, … For Concrete P(cue_1=1|[0,1,0]) = P(abstract=yes|[0,1,0])* P(cue_1=1|abstract=yes) + P(abstract=no|[0,1,0]) * P(cue_1=1|abstract=no)
11
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 11 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) We use the information of observed features to assess the likelihood of a particular unobserved cue. Harmonization is substituting 0 values by the likelihood of being 1 given the other cues observed. BUT … In order to get P(cue_1=1|[0,1,0]) we need to have P(cue_n|class) and for all cues in the vector.
12
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 12 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) The challenge: how to get P(cue_n|class) with so many 0’s in the data… ? By estimating the P(cue_n|class) with linguistic information AbstractConcrete Suffix=no 0.5 1.0 Suffix=yes 0.5 0.0 SC_Adj=no 1.0 0.5 SC_Adj=yes 0.0 0.5 “The probability of being Concrete and having suffix “ness” is 0”
13
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 13 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Harmonization effects in Spanish Mass experiment HarmonizedFrequencytypes 0,1,0,1,0,1,1,0,1,0,0,1,1,00,3,0,1,0,1,1,0,1,0,0,1,1,0 agua (‘water’) 1,1,0.5,0.5,0.5,1,1,1,1,0,0,0,0,01,2,0,0,0,2,1,1,2,0,0,0,0,0 acero (‘steel’) 0.5,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0,0, 0,0,0 0,0,0,0,0,0,1,0,0,0,0,0,0,0 desabastecimiento (‘shortage’) 0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.47,0.47,0.47,0.47,0.47 0,0,0,0,0,0,0,0,0,0,0,0,0,0 aceptabilidad (‘acceptability’)
14
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 14 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Results of the experiments Spanish Mass English Abstract Experiment DT SVM DT SVM Mean 74.2 63.8 57.8 61.0 Trimmed mead 77.5 67.4 55.6 61.0 Frequency 79.9 79.1 61.4 64.1 Harmonized 82.8 80.7 76.1 70.1 Baseline 74.8 61.5
15
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 15 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Error Analysis & Future work Frequency information to filter noise has been neutralized Future work is about how to handle missing values and noise together.
16
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 16 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Thanks for your attention !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.