1 Supervised Classification of Feature-based Instances.

1 Supervised Classification of Feature-based Instances

2 Simple Examples for Statistics-based Classification Based on class-feature counts Contingency table: We will see several examples of simple models based on these statistics ab cd f ~f~f C~C~C

3 Prepositional-Phrase Attachment Simplified version of Hindle & Rooth (1993) [MS 8.3] Setting: V NP-chunk PP –Moscow sent soldiers into Afghanistan –ABC breached an agreement with XYZ Motivation for the classification task: –Attachment is often a problem for (full) parsers –Augment shallow/chunk parsers

4 Relevant Probabilities P(prep|n) vs. P(prep|v) –The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). –Notice: a single feature for each class Example: P(into|send) vs. P(into|soldier) Decision measured by the likelihood ratio: Positive/negative λ  verb/noun attachment

5 Estimating Probabilities Based on attachment counts from a training corpus Maximum likelihood estimates: How to count from an unlabeled ambiguous corpus? (Circularity problem) Some cases are unambiguous: –The road to London is long –Moscow sent him to Afghanistan

6 Heuristic Bootstrapping and Ambiguous Counting 1.Produce initial estimates (model) by counting all unambiguous cases 2.Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other 3.Consider each remaining ambiguous case as a 0.5 count for each attachment. Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts

7 Example Decision Moscow sent soldiers into Afghanistan Verb attachment is 70 times more likely

8 Hindle & Rooth Evaluation H&R results for a somewhat richer model: –80% correct if we always make a choice –91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.

9 Possible Extensions Consider a-priori structural preference for “low” attachment (to noun) Consider lexical head of the PP: –I saw the bird with the telescope –I met the man with the telescope Such additional factors can be incorporated easily, assuming their independence Addressing more complex types of attachments, such as chains of several PP’s Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]

10 Classify by Best Single Feature: Decision List Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score –Sort all features by decreasing score Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class –Test all features for the class in decreasing score order, until first success  output the relevant class –Default decision: the majority class For multiple classes per example: may apply a threshold on the feature-class entailment score Suitable when relatively few strong features indicate class (compare to manually written rules)

11 Example: Accent Restoration (David Yarowsky, 1994): for French and Spanish Classes: alternative accent restorations for words in text without accent marking Example: côte (coast) vs. côté (side) A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists Similar tasks: –Capitalization restoration in ALL-CAPS text –Homograph disambiguation in speech synthesis (wind as noun and verb)

12 Accent Restoration - Features Word form coloocation features: –Single words in window: ±1, ±k (20-50) –Word pairs at,, (complex features) –Easy to implement

13 Accent Restoration - Features Local syntactic-based features (for Spanish) –Use a morphological analyzer –Lemmatized features - generalizing over inflections –POS of adjacent words as features –Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish)

14 Accent Restoration – Decision Score Probabilities estimated from training statistics, taken from a corpus with accents Smoothing - add small constant to all counts Pruning: –Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w 1 w 2 – w 1 ) –Cross validation: remove features that causes more errors than correct classifications on held-out data

15 “Add-1/Add-Constant” Smoothing

16 Accent Restoration – Results Agreement with accented test corpus for ambiguous words: 98% –Vs. 93% for baseline of most frequent form –Accented test corpus also includes errors Worked well for most of the highly ambiguous cases (see random sample in next slide) Results slightly better than Naive Bayes (weighing multiple features) –Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature –Incorporating many low-confidence features may introduce noise that would override the strong features

17 Accent Restoration – Tough Examples

18 Related Application: Anaphora Resolution Weapon Bombs grenade Actions Cause_movement throw drop Traditional AI-style approach Manually encoded semantic preferences/constraints (Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ?

19 Statistical Approach Corpus (text collection) 20 times 1 time “Semantic” Judgment Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) Semantic confidence combined with syntactic preferences  it  grenade “Language modeling” for disambiguation

20 Word Sense Disambiguation for Machine Translation I bought soap bars I bought window bars sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ?? Corpus (text collection) Sense1: 20 times 15 times Sense2: 17 times 22 times Features: co-occurrence within distinguished syntactic relations “Hidden” senses – manual labeling required(?)

21 Solution: Mapping to Target Language English(-English)-Hebrew Dictionary: bar 1  ‘chafisa’soap  ‘sabon’window  ‘chalon’ bar 2  ‘sorag’ Map ambiguous “relations” to second language (all possibilities):  1 20 times  2 0 times  1 0 times  2 15 times Hebrew Corpus Exploiting ambiguities difference Principle – intersecting redundancies (Dagan and Itai 1994)

22 The Selection Model Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time –since both words in a relation might be ambiguous, having their translations dependent upon each other Assuming a multinomial model, under certain linguistic assumptions –The multinomial variable: a source relation –Each alternative translation of the relation is a possible outcome of the variable

23 An Example Sentence A Hebrew sentence with 3 ambiguous words: The alternative translations to English:

24 Example - Relational Representation

25 Selection Model We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): Estimation is based on smoothed counts A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. –E.g., a count of 3 vs. (smoothed) 0 Solution: using a one sided confidence interval (lower bound) for the odds ratio

26 Confidence Interval (for a proportion) Given an estimate, what is the confidence that the estimate is “correct”, or at least close enough to the true value?

27 Confidence Interval (cont.) Approximating by normal distribution: the distribution of the sampled proportion (across samples) approaches a normal distribution for large n.

28 Confidence Interval (cont.)

29 Selection Model (cont.) The distribution of the log of the odds ratio (across samples) converges to normal distribution Selection “confidence” score for a single relation - the lower bound for the odds-ratio: The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. Notice roles of θ vs. α, and impact of n 1,n 2

30 Handling Multiple Relations in a Sentence: Constraint Propagation 1.Compute Conf(i) for each ambiguous source relation. 2.Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop; Otherwise, select word translations according to target relation i and remove the source relation from the list. 3.Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. 4.Go to step 2. Notice similarity to the decision list algorithm

31 Selection Algorithm Example

32 Evaluation Results Results - Hebrew  English translation: Coverage: ~70% Precision within coverage: ~90% –~20% improvement over choosing most frequent translation (95% statistical confidence for an improvement relative to this common baseline)

33 Analysis Correct selections capture: –Clear semantic preferences: sign/seal treaty –Lexical collocation usage: peace treaty/contract No selection: –Mostly: no statistics for any alternative (data sparseness) investigator/researcher of corruption –Also: similar statistics for several alternatives –Solutions: Consult more features in remote (vs. syntactic) context prime minister … take position/job Class/similarity-based generalizations (corruption-crime)

34 Analysis (cont.) Confusing multiple sources (senses) for the same target relation: –‘sikkuy’ (chance/prospect) ‘kattan’ (small/young) Valid (frequent) target relations: small chance - correct young prospect – incorrect, due to - –“Young prospect” is the translation of another Hebrew expression – ‘tikva’ (hope) ‘zeira’ (young) The “soundness” assumption of the multinomial model is violated: –Assume counting the generated target relations corresponds to sampling the source relation, hence assuming a known 1:n mapping (also completeness – another source of errors) –Potential solutions: bilingual corpus, “reverse” translation

35 Sense Translation Model: Summary Classification instance: a relation with multiple words, rather than a single word at a time, to capture immediate (“circular”) dependencies. Make local decisions, based on a single feature Taking into account statistical confidence of decisions Constraint propagation for multiple dependent classifications (remote dependencies) Decision list style rational – classifying by a single high confidence evidence is simpler, and may work better, than considering all weaker evidence simultaneously –Computing statistical confidence for a combination of multiple events is difficult; easier to perform for each event at a time Statistical classification scenario (model) constructed for the linguistic setting –Important to identify explicitly the underlying model assumptions, and to analyze the resulting errors

36 Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text Applications: –Machine translation –Information retrieval (mixed evidence) –Semantic interpretation of text

37 Compare to POS Tagging? Idea: Treat sense disambiguation like POS tagging, just with “semantic tags” The problems differ: –POS tags depend on specific structural cues - mostly neighboring, and thus dependent, tags –Senses depend on semantic context – less structured, longer distance dependency  many relatively independent/unstructured features

38 Approaches Supervised learning: Learn from a pre-tagged corpus Dictionary-Based Learning Learn to distinguish senses from dictionary entries Unsupervised Learning Automatically cluster word occurrences into different senses

39 Using an Aligned Bilingual Corpus Goal: get sense tagging cheaply Use correlations between phrases in two languages to disambiguate E.g, interest =‘legal share’ (acquire an interest ) ‘attention’(show interest) In GermanBeteiligung erwerben Interesse zeigen For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation Limited to senses that are discriminated by the other language; suitable for disambiguation in translation Gale, Church and Yarowsky (1992)

40 Evaluation Train and test on pre-tagged (or bilingual) texts –Difficult to come by Artificial data – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ –E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which –Useful to develop sense disambiguation methods

41 Performance Bounds How good is (say) 83.2%?? Evaluate performance relative to lower and upper bounds: –Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense –Human performance: what percentage of the time do people agree on classification? Nature of the senses used impacts accuracy levels

1 Supervised Classification of Feature-based Instances.

Similar presentations

Presentation on theme: "1 Supervised Classification of Feature-based Instances."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Supervised Classification of Feature-based Instances.

Similar presentations

Presentation on theme: "1 Supervised Classification of Feature-based Instances."— Presentation transcript:

Similar presentations

About project

Feedback