Brian Nisonger Shauna Eggers Joshua Johanson

Brian Nisonger Shauna Eggers Joshua Johanson
Project 4 Brian Nisonger Shauna Eggers Joshua Johanson

Our Project Combination systems
Tried to improve Tagging accuracy by combining three different taggers Taggers MaxEnt TBL Trigram Methods Voting Weighted Voting (invented by Joshua) Bayes

Observations Our combinations system improved the overall results of the tagging We invented a new type of voting and splitting of the data that shows promise In applying the standard combo methods to tagging, we made a few tweaks

Voting

Method Our voting algorithm was quite simple we trained three taggers on the data Then applied the following: A= T(n) for W(n) of Trigram B= T(n) for W(n) of TBL C= T(n) for W(n) of MaxEnt If A=C then A else B

Voting Method Training Data Output Voting Training Data Output

Voting Method Training Data Output Voting Output Training Data Output

Splitting the Data Used 83 % of training data to train taggers
After being trained they are tested on the remaining 17%. The output is used to train the combination systems.

Weighted Voter Training Data Output Training Data Output Output

Weighted Voter Training Data Output Training Data Output Training Data

Weighted Voter

pull/NN pull/VBP pull/VB
When this is the output, what is the most likely tag? How often is the tagger right when it outputs this tag? What about on this specific word? If two taggers both think its some kind of verb, isn’t it more likely to be a verb? (similarity)

How do you put all of these probabilities together?
trial and error multiply them – huge smoothing issue add them – not very mathematical weight certain probabilities higher

Which did I end up with? Add them together with weights

How did you get the weights?
Complicated Mathematical Equations Sorry, no can do Train the weights I didn’t have time Try different weights until I get good results. Isn’t that cheating? oh well

Future Work (not really)
Get more sophisticated, mathematically sound probabilities – for example, look at the Naïve Bayes. I could actually train the weights, instead of just fudging until I get good numbers That would require more splitting of the data.

Naïve Bayes

Naive Bayes: Model Model: the selected tag t for the input word w is the one that maximizes for k taggers where = t is the correct tag for w = tagger hi produces tag ti for w

Naive Bayes: Model Derivation
Prob that tag t is correct for word w given set of hypotheses 1 through k:

Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: =

Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: =

Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: =

Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: = Voila! =

Naive Bayes: Comparison to Parsing Model
(Henderson and Brill 1999): Notice: second prob calculated for the correct hypothesis, not the produced hypothesis Applying direct analog of this approach for tagging did not produce any improvement over the baseline taggers consistently produced average of baseline results Why different for H&B? (...or was it?)

Naive Bayes: Parameter Estimation
1. Probability that t is correct for w: 2. Probability that when t is correct, tagger hi produces tag ti :

Naive Bayes: Unknown Words
When words in the hypotheses were unknown by the taggers, treat them all the same way: Convert all unknown words in tagger outputs (combiner inputs) to “OtherUnk” Use params for OtherUnk This did not make a terribly big difference in output: average improvement (But it didn’t hurt either)

Naive Bayes: Smoothing
Witten-Bell-ish: Assign 0-frequency items same value as lowest-frequency items For each parameter, use the smallest value seen in the training data as the smoothing value Not by defn Witten-Bell, as not sure that lowest-frequency item is 1, but basically the same

Results 1k 5k 10k 40k Trigram 85.68/ 77.93 92.12/ 88.15 93.44/ 90.55
95.36/ 95.18 TBL 91.56/ 91.50 94.42/ 94.25 94.95/ 95.06 96.25/ 96.30 MaxEnt 88.31/ 87.38 93.55/ 92.75 94.63/ 94.45 96.34/ 96.26 Voting 91.99/ 91.45 94.42 95.55/ 96.74/ 96.64 Weighted Voting 91.70 94.70 95.39 96.83/ 96.80 Bayes 88.98 93.80 95.22 97.18

Results

Splitting the Data

Is splitting the data bad?
Yes. It gives the tagger less data to train on, which means worse results for the tagger. Then why don’t you just split away less data Then you have less data to train the combination system. Can’t you get the input to the combination system from tagging the training data? You can’t run the tagger on the same data you trained on. We are learning the reliability of the tagger and this artificially inflates the reliability. Once you train the combination system on the results from the taggers trained on the split data, can you test it with the results from the taggers trained on the whole data set? Technically no. The combination system learns the reliability of each tagger, and if you change the data the tagger is trained on, you change the reliability of that tagger. Isn’t the reliability of the taggers trained on the split data and the taggers trained on the whole data set close enough? Let’s find out!

Weighted Voter Training Data Output Training Data Output Output

Weighted Voter Training Data Output Training Data Output Training Data

Does it work? Slightly increases the weighted voter
Decreases the Naïve Bayes

Why? The accuracy for the taggers goes up, so the overall accuracy goes up. The combination systems are learning the reliability of a tagger, and the taggers were changed. This decreases the ability to predict the right reliability, so accuracy goes down. Naïve Bayes is more sensitive to a change in taggers than the weighted voter.

If we can use the output from the tagger trained on one set of data for the combination system and the output from the tagger trained on another set of data to test it, then it doesn’t matter how we split the data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

Each of these segments is the result from the taggers being tested on unseen data, but together they give you how the tagger would have tagged the entire data set if it had never seen it. Output Output Output Output

This gives you a lot more data to test the combination system on, increasing accuracy of the combination system. You could then split this data, making it possible to train the weights. Output Output Output Output

Increasing training data increases accuracy, but not enough for the Naïve Bayes to recover from the loss of accuracy from using different data to train the tagger.

So what can we do about it?
The loss of accuracy was because the reliability of the output from taggers trained on the split data was different from the output from the taggers trained on all of the data. Take smaller slices Training Data Training Data

So what can we do about it?
The difference between the split and unsplit training data is smaller, so the taggers should be more similar, helping the combination systems correctly predict the reliability of the tagger. Training Data Training Data

So can we do it? Taking smaller slices is very expensive, especially for the TBL. If the tagger were retractable, we might be able to produce the training data without having to rerun the system several times.

Trigram Model Train the Trigram model on the whole training data
For each sentence in the training data, calculate what the probabilities would have been if the tagger were not trained on that sentence. Tag the sentence based on the new probabilities.

Conclusions Combination methods Data preparation
A simple method like Voting works surprisingly well Naïve Bayes needs a lot of training data to show improvement, but when it does the difference is substantial New methods to improve the basic voting show consistently better results than voting by itself Data preparation The weighted voting method was further improved by more intelligent splitting of the data Applying the new splitting techniques to Naïve Bayes needs some investigation to see if there is any improvement

Brian Nisonger Shauna Eggers Joshua Johanson

Similar presentations

Presentation on theme: "Brian Nisonger Shauna Eggers Joshua Johanson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Brian Nisonger Shauna Eggers Joshua Johanson

Similar presentations

Presentation on theme: "Brian Nisonger Shauna Eggers Joshua Johanson"— Presentation transcript:

Similar presentations

About project

Feedback