Download presentation
Presentation is loading. Please wait.
1
Brian Nisonger Shauna Eggers Joshua Johanson
Project 4 Brian Nisonger Shauna Eggers Joshua Johanson
2
Our Project Combination systems
Tried to improve Tagging accuracy by combining three different taggers Taggers MaxEnt TBL Trigram Methods Voting Weighted Voting (invented by Joshua) Bayes
3
Observations Our combinations system improved the overall results of the tagging We invented a new type of voting and splitting of the data that shows promise In applying the standard combo methods to tagging, we made a few tweaks
4
Voting
5
Method Our voting algorithm was quite simple we trained three taggers on the data Then applied the following: A= T(n) for W(n) of Trigram B= T(n) for W(n) of TBL C= T(n) for W(n) of MaxEnt If A=C then A else B
6
Voting Method Training Data Output Voting Training Data Output
7
Voting Method Training Data Output Voting Output Training Data Output
8
Splitting the Data Used 83 % of training data to train taggers
After being trained they are tested on the remaining 17%. The output is used to train the combination systems.
9
Weighted Voter Training Data Output Training Data Output Output
10
Weighted Voter Training Data Output Training Data Output Output
11
Weighted Voter Training Data Output Training Data Output Training Data
12
Weighted Voter
13
pull/NN pull/VBP pull/VB
When this is the output, what is the most likely tag? How often is the tagger right when it outputs this tag? What about on this specific word? If two taggers both think its some kind of verb, isn’t it more likely to be a verb? (similarity)
14
Probabilities i) P(t|w, t1,t2,t3) ii) P(t|w,t1) iii) P(t|w,t2)
iv) P(t|w,t3) v) P(t|t1,t2,t3) vi) P(t|t1) vii) P(t|t2) viii) P(t|t3)
15
Example pull/NN pull/VBP pull/VB i) P(t| pull, NN, VBP, VB)
ii) P(t| pull, NN) iii) P(t| pull, VBP) iv) P(t| pull, VB) v) P(t| NN, VBP, VB) vi) P(t| NN) vii) P(t| VBP) viii) P(t| VB)
16
How do you put all of these probabilities together?
trial and error multiply them – huge smoothing issue add them – not very mathematical weight certain probabilities higher
17
Which did I end up with? Add them together with weights
18
How did you get the weights?
Complicated Mathematical Equations Sorry, no can do Train the weights I didn’t have time Try different weights until I get good results. Isn’t that cheating? oh well
19
What were the weights i) P(t|w, t1,t2,t3) - 50 ii) P(t|w,t1) - 1
iii) P(t|w,t2) - 1 iv) P(t|w,t3) - 1 v) P(t|t1,t2,t3) - 6 vi) P(t|t1) - 2 vii) P(t|t2) - 2 viii) P(t|t3) - 2
20
Future Work (not really)
Get more sophisticated, mathematically sound probabilities – for example, look at the Naïve Bayes. I could actually train the weights, instead of just fudging until I get good numbers That would require more splitting of the data.
21
Naïve Bayes
22
Naive Bayes: Model Model: the selected tag t for the input word w is the one that maximizes for k taggers where = t is the correct tag for w = tagger hi produces tag ti for w
23
Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k:
24
Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: =
25
Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: =
26
Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: =
27
Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: = Voila! =
28
Naive Bayes: Comparison to Parsing Model
(Henderson and Brill 1999): Notice: second prob calculated for the correct hypothesis, not the produced hypothesis Applying direct analog of this approach for tagging did not produce any improvement over the baseline taggers consistently produced average of baseline results Why different for H&B? (...or was it?)
29
Naive Bayes: Parameter Estimation
1. Probability that t is correct for w: 2. Probability that when t is correct, tagger hi produces tag ti :
30
Naive Bayes: Unknown Words
When words in the hypotheses were unknown by the taggers, treat them all the same way: Convert all unknown words in tagger outputs (combiner inputs) to “OtherUnk” Use params for OtherUnk This did not make a terribly big difference in output: average improvement (But it didn’t hurt either)
31
Naive Bayes: Smoothing
Witten-Bell-ish: Assign 0-frequency items same value as lowest-frequency items For each parameter, use the smallest value seen in the training data as the smoothing value Not by defn Witten-Bell, as not sure that lowest-frequency item is 1, but basically the same
32
Results 1k 5k 10k 40k Trigram 85.68/ 77.93 92.12/ 88.15 93.44/ 90.55
95.36/ 95.18 TBL 91.56/ 91.50 94.42/ 94.25 94.95/ 95.06 96.25/ 96.30 MaxEnt 88.31/ 87.38 93.55/ 92.75 94.63/ 94.45 96.34/ 96.26 Voting 91.99/ 91.45 94.42 95.55/ 96.74/ 96.64 Weighted Voting 91.70 94.70 95.39 96.83/ 96.80 Bayes 88.98 93.80 95.22 97.18
33
Results
34
Splitting the Data
35
Is splitting the data bad?
Yes. It gives the tagger less data to train on, which means worse results for the tagger. Then why don’t you just split away less data Then you have less data to train the combination system. Can’t you get the input to the combination system from tagging the training data? You can’t run the tagger on the same data you trained on. We are learning the reliability of the tagger and this artificially inflates the reliability. Once you train the combination system on the results from the taggers trained on the split data, can you test it with the results from the taggers trained on the whole data set? Technically no. The combination system learns the reliability of each tagger, and if you change the data the tagger is trained on, you change the reliability of that tagger. Isn’t the reliability of the taggers trained on the split data and the taggers trained on the whole data set close enough? Let’s find out!
36
Weighted Voter Training Data Output Training Data Output Output
37
Weighted Voter Training Data Output Training Data Output Training Data
38
Does it work? Slightly increases the weighted voter
Decreases the Naïve Bayes
39
Why? The accuracy for the taggers goes up, so the overall accuracy goes up. The combination systems are learning the reliability of a tagger, and the taggers were changed. This decreases the ability to predict the right reliability, so accuracy goes down. Naïve Bayes is more sensitive to a change in taggers than the weighted voter.
40
If we can use the output from the tagger trained on one set of data for the combination system and the output from the tagger trained on another set of data to test it, then it doesn’t matter how we split the data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output
41
We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output
42
We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output
43
Each of these segments is the result from the taggers being tested on unseen data, but together they give you how the tagger would have tagged the entire data set if it had never seen it. Output Output Output Output
44
This gives you a lot more data to test the combination system on, increasing accuracy of the combination system. You could then split this data, making it possible to train the weights. Output Output Output Output
45
Increasing training data increases accuracy, but not enough for the Naïve Bayes to recover from the loss of accuracy from using different data to train the tagger.
46
So what can we do about it?
The loss of accuracy was because the reliability of the output from taggers trained on the split data was different from the output from the taggers trained on all of the data. Take smaller slices Training Data Training Data
47
So what can we do about it?
The difference between the split and unsplit training data is smaller, so the taggers should be more similar, helping the combination systems correctly predict the reliability of the tagger. Training Data Training Data
48
So can we do it? Taking smaller slices is very expensive, especially for the TBL. If the tagger were retractable, we might be able to produce the training data without having to rerun the system several times.
49
Trigram Model Train the Trigram model on the whole training data
For each sentence in the training data, calculate what the probabilities would have been if the tagger were not trained on that sentence. Tag the sentence based on the new probabilities.
50
Conclusions Combination methods Data preparation
A simple method like Voting works surprisingly well Naïve Bayes needs a lot of training data to show improvement, but when it does the difference is substantial New methods to improve the basic voting show consistently better results than voting by itself Data preparation The weighted voting method was further improved by more intelligent splitting of the data Applying the new splitting techniques to Naïve Bayes needs some investigation to see if there is any improvement
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.