Download presentation
Presentation is loading. Please wait.
1
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst November 8, 2004
2
2 Today Using Very Very Large Corpora Banko & Brill: Confusion Pair Disambiguation Lapata & Keller: Several NLP Subtasks Cucerzan & Brill: Spelling Correction Clustering: Introduction
3
3 Algorithm or Data? Given a text analysis problem and a limited amount of time to solve it, should we: Try to develop smarter algorithms, or Find more training data? Banko & Brill ’01 Identified a problem with a billion words of training data Compared using more data to: –using different classification algorithms –using voting algorithms –using active learning –using semi-supervised learning
4
4 Banko & Brill ‘01 Example problem: confusion set disambiguation {principle, principal} {then, than} {to, two, too} {whether, weather} Current approaches include: Latent semantic analysis Differential grammars Decision lists A variety of Bayesian classifiers
5
5 Banko & Brill ‘01 Collected a 1-billion-word English training corpus 3 orders of magnitude > than largest corpus used previously for this problem Consisted of: –News articles –Scientific abstracts –Government transcripts –Literature –Etc. Test set: –1 million words of WSJ text (non used in training)
6
6 Training on a Huge Corpus Each learner trained at several cutoff points First 1 million, then 5M, etc. Items drawn by probabilistically sampling sentences from the different sources, weighted by source size. Learners: –Naïve bayes, perceptron, winnow, memory-based Results: Accuracy continues to increase log-linearly even out to 1 billion words of training data BUT the size of the trained model also increases log- linearly as a function of training set size.
7
7 Banko & Brill ‘01
8
8 Comparing to Voting Algorithms What about voting algorithms? Ran naïve bayes, perceptron, and winnow. Voting was done by “combining the normalized score each learner assigned”. Results: Little gain beyond 1M words of training data Starts to perform worse than single classifier
9
9 Banko & Brill ‘01
10
10 Comparing to Active Learning Problem: In most cases huge volumes of labeled text is not available. How can we use large unlabeled corpora? Active Learning: Intelligently select examples for hand-labeling Choose the most uncertain examples Has mainly been looked at for small training sets Determining uncertainty: Look at scores assigned across examples for a given learner; use relative values to find the tough cases Committee-based sampling: choose the ones with the most disagreement among several algorithms
11
11 Comparing to Active Learning Bagging (to create a committee of classifiers): A variant of the original training set is constructed by randomly sampling sentences with replacement Produces N new training sets of size K –Train N models and run them all on the same test set –Compare the classification for each model on each test set –The higher the disagreement, the tougher the training example –Select the M toughest examples for hand-labeling They used a variation in which M/2 of the chosen sentences have high disagreement and M/2 are random –Otherwise the learners become too biased toward the hard cases and don’t work as well in general.
12
12 Banko & Brill ‘01
13
13 Active Learning Results: Sample selection outperforms training sequentially However, the larger the pool of candidate instances, the better the results This is a way to improve results with more unlabeled training data, with a fixed cost for hand-annotation Thus we can benefit from large corpora even if it isn’t all labeled
14
14 Weakly Supervised Learning It would be ideal to use the large unlabeled corpus without additional hand-labeling. Unsupervised learning: Many approaches have been propose Usually start with a “seed” set of labels and then use unlabeled data to continue training Use bagging again This time choose the most certain instances to add to the training set. If all the classifier models agree on a given example, then label the example and put it in the training set.
15
15 Banko & Brill ‘01
16
16 Banko & Brill ‘01
17
17 Weakly Supervised Learning Results: Accuracy improves over the initial seed set However, drops off after a certain point Using the large corpus is better.
18
18 Lapata & Keller ‘04 On 6 different NLP problems, compared: Web-based n-grams (unsupervised) BNC (smaller corpus)-based n-grams The state-of-the-art supervised algorithm. In all cases, the Web-based n-grams were: The same as or better than BNC-based n-grams As good as or nearly as good as the sota algorithm Thus, they propose using the Web as a baseline against which most algorithms should be compared.
19
19 Computing Web ngrams Find # of hits for each term via Altavista This gives doc frequency, not term frequency Smooth 0 counts to 0.5 All terms lower case Three different types: Literal queries –Double-quotes around term NEAR queries –A NEAR b: within a 10-word window Inflected queries –Expand each term to all morphological variations “history change” “histories change” “history changes” …
20
20 Target Word Selection for Machine Translation E.g., decide between 5 alternate translations: Die Geschichte andert sich, nicht jedoch die Geographie. {History, store, tale, saga, strip} changes but geography does not. Looked at a test set of verb-object pairs from Prescher et al.’00 Assume that the target translation of the verb is known Select which noun is semantically most compatible from the candidate translations. Web statistics Retrieve web counts for all possible (inflected) verb-object translations Choose the most frequent one
21
21 Results on MT term selection
22
22 Confusion Set Spelling Correction Confusion sets as before {principle, principal} Method: Use collocation counts as features –Words next to the target word For each word in the confusion set: –Use the web to estimate how frequently it co-occurs with a word or a pair of words immediately to its left or right –Disambiguate by selecting the word in the confusion set with the highest co-occurrence frequency –Ties go to the most frequently occurring term Results Best result obtained with one word to the left, one to the right of the target word: f(w1, t, w2). Not as good as best supervised method, but far simpler, and far better than the baseline. –Many fewer features, doesn’t use POS information
23
23
24
24 Ordering of Pre-nominal Adjectives Which adjective comes before which others? The small, wiggly, cute black puppy. *The small, black, wiggly cute puppy. *The wiggly, small cute puppy. Approach: Tested on Malouf ’00’s adjective pair set Choose the adjective order with the highest frequency, using literal queries Results: No significant difference between this and the state- of-the-art approach –Uses a back-off bigram model, positional probabilities, and a memory-based learner using many features
25
25 Ordering of Pre-nominal Adjectives
26
26 Compound Noun Bracketing Which way do the nouns group? [acute [migraine treatment]] [acute migraine] treatment] Current best model (Lauer’95) uses a thesaurus and taxonomy in an unsupervised manner Method: Compare the probability of left-branching to probability of right-branching (as in Lauer’95) But estimate the probabilities from web counts –Inflected queries and NEAR operator Results: Far better than baseline; no significant difference from best model
27
27
28
28 Noun Compound “Interpretation” Determine semantic relation between nouns onion tears -> CAUSED-BY pet spray -> FOR Method: Look for prepositions that tend to indicate the relation Used inflected queries; inserted determiners before nouns: –Story/stories about the/a/0 war/wars Results: Best scores obtained for f(n1, p, n2) Significantly outperforms best existing algorithm
29
29 Noun Compound “Interpretation”
30
30 Lapata & Keller ’04 Summary Simple, unsupervised models using web counts can be devised for a variety of NLP tasks For 4/6 tasks, web counts better than BNC In all but 1 case, the simple approach does not significantly outperform the best supervised model. Most of these use linguistic or taxonomic information in addition to being supervised. But not significantly different for 3/6 problems. But much better than the baseline for all of these. So: a baseline that must be beat in order to declare a new algorithm to be useful.
31
31 Cucerzan and Brill ‘04 Spelling correction for the web using query logs Harder task than traditional spelling correction Have to handle: –Proper names, new terms, company names, etc blog, shrek, nsync –Multi-word phrases –Frequent and severe spelling errors (10-15%) –Very short contexts Existing approaches: –Rely on a dictionary for comparison –Assume a single “point change” Insertion, deletion, transposition, substitution –Don’t handle word substitution
32
32 Spelling Correction Algorithm Main idea: Iteratively transform the query into other strings that correspond to more likely queries. Use statistics from query logs to determine likelihood. –Despite the fact that many of these are misspelled –Assume that the less wrong a misspelling is, the more frequent it is, and correct > incorrect Example: ditroitigers -> detroittigers -> detroit tigers
33
33 Cucerzan and Brill ‘04
34
34 Spelling Correction Algorithm Algorithm: Compute the set of all possible alternatives for each word in the query –Look at word unigrams and bigrams from the logs –This handles concatenation and splitting of words Find the best possible alternative string to the input –Do this efficiently with a modified Viterbi algorithm Constraints: No 2 adjacent in-vocabulary words can change simultaneously Short queries have further (unstated) restrictions In-vocabulary words can’t be changed in the first round of iteration
35
35 Spelling Correction Algorithm Comparing string similarity Damerau-Levenshtein edit distance: –The minimum number of point changes required to transform a string into another Trading off distance function leniency: A rule that allows only one letter change can’t fix: –dondal duck -> donald duck A too permissive rule makes too many errors: –log wood -> dog food Actual measure: “a modified context-dependent weighted Damerau- Levenshtein edit function” –Point changes: insertion, deletion, substitution, immediate transpositions, long-distance movement of letters –“Weights interactively refined using statistics from query logs”
36
36 Spelling Correction Evaluation Emphasizing recall First evaluation: 1044 randomly chosen queries Annotated by two people (91.3% agreement) 180 misspelled; annotators provided corrections 81.1% system agreement with annotators –131 false positives 2002 kawasaki ninja zx6e -> 2002 kawasaki ninja zx6r –156 suggestions for the misspelled queries 2 iterations were sufficient for most corrections Problem: annotators were guessing user intent
37
37 Spelling Correction Evaluation Second evaluation: Try to find a misspelling followed by its correction –Sample successive pairs of queries from the log Must be sent by same user Differ from one another by a small edit distance –Present the pair to human annotators for verification and placement into the gold standard Paper doesn’t say how many total
38
38 Spelling Correction Evaluation Results on this set: 73.1% accuracy Disagreed with gold standard 99 times; 80 suggestions –40 of these were bad –15 were functionally equivalent (audio file vs. audio files) –17 were different valid suggestions (phone listings vs. telephone listings) –8 found errors in the gold standard (brandy sniffers) 85.5% correct: speller correct or reasonable Sent an unspecified subset of the errors to Google’s spellchecker –Its agreement with the gold standard was slightly lower
39
39 Spell Checking: Summary Can use the collective knowledge stored in query logs Works pretty well despite the noisiness of the data Exploits the errors made by people Might be further improved to incorporate text from other domains
40
40 Very Large Corpora: Summary When building unsupervised NLP algorithms: Simple algorithms applied to large datasets can perform as well as sota algorithms (in some cases) –If your algorithm can’t do better than a simple unsupervised web-based query, don’t publish it. Active learning (choose the most uncertain items to hand-label) can reduce amount of labeling needed However, algorithms still aren’t 100% correct, so it isn’t proven that more data is enough; good algorithms are needed to go the extra mile. A smaller labeled subset can provide close to the same results as a large one, if annotated intelligently
41
41 Adapted froms slide by Iwona Białynicka-Birula Clustering: the act of grouping similar object into sets Classification vs. Clustering: Classification assigns objects to predefined groups Clustering infers groups based on inter-object similarity –Best used for exploration, rather than presentation What is clustering?
42
42 Adapted froms slide bywww.kdnuggets.com/dmcourse Classification vs. Clustering Classification: Supervised learning: Learns a method for predicting the instance class from pre-labeled (classified) instances
43
43 Adapted froms slide bywww.kdnuggets.com/dmcourse Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data
44
44 Adapted froms slide bywww.kdnuggets.com/dmcourse Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
45
45 Adapted froms slide bywww.kdnuggets.com/dmcourse Clusters: exclusive vs. overlapping Simple 2-D representation Non-overlapping Venn diagram Overlapping a k j i h g f e d c b
46
46 Adapted froms slide bywww.kdnuggets.com/dmcourse Clustering Evaluation Manual inspection Benchmarking on existing labels Cluster quality measures distance measures high similarity within a cluster, low across clusters
47
47 Adapted froms slide bywww.kdnuggets.com/dmcourse The distance function Simplest case: one numeric attribute A Distance(X,Y) = A(X) – A(Y) Several numeric attributes: Distance(X,Y) = Euclidean distance between X,Y Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Weighting the attributes might be necessary
48
48 Adapted froms slide bywww.kdnuggets.com/dmcourse Simple Clustering: K-means Works with numeric data only 1)Pick a number (K) of cluster centers (at random) 2)Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3)Move each cluster center to the mean of its assigned items 4)Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
49
49 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 1 k1k1 k2k2 k3k3 X Y Pick 3 initial cluster centers (randomly)
50
50 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 2 k1k1 k2k2 k3k3 X Y Assign each point to the closest cluster center
51
51 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 3 X Y Move each cluster center to the mean of each cluster k1k1 k2k2 k2k2 k1k1 k3k3 k3k3
52
52 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 4 X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? k1k1 k2k2 k3k3
53
53 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 4 … X Y A: three points with animation k1k1 k3k3 k2k2
54
54 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 4b X Y re-compute cluster means k1k1 k3k3 k2k2
55
55 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means example, step 5 X Y move cluster centers to cluster means k2k2 k1k1 k3k3
56
56 Adapted froms slide bywww.kdnuggets.com/dmcourse Discussion Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum Example: To increase chance of finding global optimum: restart with different random seeds instances initial cluster centers
57
57 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
58
58 Adapted froms slide bywww.kdnuggets.com/dmcourse K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
59
59 Adapted froms slide bywww.kdnuggets.com/dmcourse Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters –E.g.two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
60
60 Adapted froms slide by Yair Even-Zohar Step 0 b d c e a a b Step 1Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3Step 2Step 1Step 0 divisive Hierarchical Clustering: Example
61
61 Adapted froms slide by www.kdnuggets.com/dmcourse Incremental clustering Heuristic approach (COBWEB) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility
62
62 Adapted froms slide by www.kdnuggets.com/dmcourse Clustering weather/tennis data IDOutlookTemp.HumidityWindy ASunnyHotHighFalse BSunnyHotHighTrue COvercastHotHighFalse DRainyMildHighFalse ERainyCoolNormalFalse FRainyCoolNormalTrue GOvercastCoolNormalTrue HSunnyMildHighFalse ISunnyCoolNormalFalse JRainyMildNormalFalse KSunnyMildNormalTrue LOvercastMildHighTrue MOvercastHotNormalFalse NRainyMildHighTrue 1 2 3
63
63 Adapted froms slide by www.kdnuggets.com/dmcourse Clustering weather/tennis data IDOutlookTemp.HumidityWindy ASunnyHotHighFalse BSunnyHotHighTrue COvercastHotHighFalse DRainyMildHighFalse ERainyCoolNormalFalse FRainyCoolNormalTrue GOvercastCoolNormalTrue HSunnyMildHighFalse ISunnyCoolNormalFalse JRainyMildNormalFalse KSunnyMildNormalTrue LOvercastMildHighTrue MOvercastHotNormalFalse NRainyMildHighTrue 4 3 Merge best host and runner-up 5 Consider splitting the best host if merging doesn’t help
64
64 Adapted froms slide by www.kdnuggets.com/dmcourse Final hierarchy IDOutlookTemp.HumidityWindy ASunnyHotHighFalse BSunnyHotHighTrue COvercastHotHighFalse DRainyMildHighFalse Oops! a and b are actually very similar
65
65 Clustering Summary Use clustering to find the main groups in the data An inexact method; many parameters must be set Results are not readily understandable enough to show to everyday users of a search system Evaluation is difficult Typical method is to see if the items in different clusters differ strongly from one another This doesn’t tell much about how understandable the clusters are The important thing to do is inspect the clusters to get ideas about the characteristics of the collection.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.