Download presentation
Presentation is loading. Please wait.
Published byHarriet Willis Modified over 9 years ago
1
2002-08-19 Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering leifg@gslt.hum.gu.se NordTalk Summer School 20th of August 2002
2
2002-08-19Leif Grönqvist2 Talk outline Why? Why? How? How? Results! Results! What next? What next?
3
2002-08-19Leif Grönqvist3 Background We wanted the Göteborg Spoken Language Corpus (1.3 million tokens) tagged for part-of-speech. We wanted the Göteborg Spoken Language Corpus (1.3 million tokens) tagged for part-of-speech. We had no tagged spoken data at all We had no tagged spoken data at all But SUC was finished at that time But SUC was finished at that time Our old article: “Tagging Spoken Language Using Written Language Statistics” gave promising results Our old article: “Tagging Spoken Language Using Written Language Statistics” gave promising results
4
2002-08-19Leif Grönqvist4 A new statistical tagger then We had experience in statistical methods so the possibilities were the following: Tag a lot of data by hand and train the tagger (boring) Tag a lot of data by hand and train the tagger (boring) Use for example the Baum-Welch algorithm (not as good – Merialdo 1994) Use for example the Baum-Welch algorithm (not as good – Merialdo 1994) Go on with our old approach and fine tune a tagger trained on SUC Go on with our old approach and fine tune a tagger trained on SUC
5
2002-08-19Leif Grönqvist5 Reminder A triclass tagger Lexical probabilities: P(w|c) Lexical probabilities: P(w|c) Contextual probabilities: P(c i |c i-2, c i-1 ) Contextual probabilities: P(c i |c i-2, c i-1 ) A Hidden Markov Model with pairs of parts-of-speech in the states A Hidden Markov Model with pairs of parts-of-speech in the states Words can be emitted in each state Words can be emitted in each state Find the most probable path given a sequence of emitted words – Viterbi Find the most probable path given a sequence of emitted words – Viterbi
6
2002-08-19Leif Grönqvist6 The new statistical model Differences between spoken and written language: Differences between spoken and written language: New parts-of-speech: Feedback (FB and Own communication management (OCM) have to be added to the statistical model New parts-of-speech: Feedback (FB and Own communication management (OCM) have to be added to the statistical model Lexical probabilities: just manual adjustments, examples on slide #9 Lexical probabilities: just manual adjustments, examples on slide #9 Contextual: Bootstrapping – tag the spoken language corpus and use resulting probabilities Contextual: Bootstrapping – tag the spoken language corpus and use resulting probabilities
7
2002-08-19Leif Grönqvist7 Adjustments for lexical probabilities Except for the ordinary smoothed MLE from SUC we estimate probabilities for: Ambiguous words – sum over the possible written forms Ambiguous words – sum over the possible written forms Interrupted words are always tagged as OCM Interrupted words are always tagged as OCM Strings that can be parsed as numerals get high probabilities as numeral Strings that can be parsed as numerals get high probabilities as numeral
8
2002-08-19Leif Grönqvist8 More adjustments Probabilities for a list of manually selected possible FB and OCM are added Probabilities for a list of manually selected possible FB and OCM are added These forms are high frequent and quite few, typically less than 100 These forms are high frequent and quite few, typically less than 100 Most of them are ambiguous only between FB and OCM Most of them are ambiguous only between FB and OCM
9
2002-08-19Leif Grönqvist9 Manual adjustments The manual adjustments are done for some other high-frequent words, for example: “dom” gets high probabilities as pronoun and determiner, but low as noun. “dom” gets high probabilities as pronoun and determiner, but low as noun. “att” pronounced as “å” is always infinitive marker “att” pronounced as “å” is always infinitive marker “jag” as noun is very rare in spoken language, setting P=0 improves the results “jag” as noun is very rare in spoken language, setting P=0 improves the results
10
2002-08-19Leif Grönqvist10 The contextual model The base tagger (the SUC model) uses additive smoothing The base tagger (the SUC model) uses additive smoothing The probabilities from SUC is probably not very suitable for spoken language The probabilities from SUC is probably not very suitable for spoken language OCM and FB has to be added OCM and FB has to be added
11
2002-08-19Leif Grönqvist11 Bootstrapping We tagged our spoken corpus with the old contextual and the new lexical model We tagged our spoken corpus with the old contextual and the new lexical model The results were used to calculate a new contextual model The results were used to calculate a new contextual model This new contextual model includes better estimations for FB and OCM This new contextual model includes better estimations for FB and OCM All probabilities are in fact adjusted to fit the spoken data better All probabilities are in fact adjusted to fit the spoken data better
12
2002-08-19Leif Grönqvist12 Bootstrapping again We tried to make more iterations – tag again with the new model, We tried to make more iterations – tag again with the new model, but the result got slightly worse after one more iteration, and even worse after the next one but the result got slightly worse after one more iteration, and even worse after the next one So we took the model obtained after the first iteration So we took the model obtained after the first iteration
13
2002-08-19Leif Grönqvist13 Results We tagged with the SUC tag set and then mapped it to a smaller tag set including the basic parts-of-speech, plus OCM and FB We tagged with the SUC tag set and then mapped it to a smaller tag set including the basic parts-of-speech, plus OCM and FB On a small test corpus containing ~10 000 words in 819 utterances the tagger scored 96.16% (±0.48%) before our adjustments and 97.44% (±0.32%) after On a small test corpus containing ~10 000 words in 819 utterances the tagger scored 96.16% (±0.48%) before our adjustments and 97.44% (±0.32%) after The difference is significant at the 99% level The difference is significant at the 99% level
16
2002-08-19Leif Grönqvist16 Possible improvements The lexical probabilities for unknown words could probably be improved by some kind of suffix analysis The lexical probabilities for unknown words could probably be improved by some kind of suffix analysis 40% of the errors belongs to five “confusion pairs”, the probabilities for these could be adjusted (the most common is “det” as pronoun or determiner) 40% of the errors belongs to five “confusion pairs”, the probabilities for these could be adjusted (the most common is “det” as pronoun or determiner)
17
2002-08-19Leif Grönqvist17 Possible improvements 2 Adjustments in the model for frequent words or sequences Adjustments in the model for frequent words or sequences Adjust the contextual probabilities to obtain Recall=Precision Adjust the contextual probabilities to obtain Recall=Precision Maybe P(w i |c i-1,c i ) could be used instead of P(w i |c i ) for frequent words Maybe P(w i |c i-1,c i ) could be used instead of P(w i |c i ) for frequent words
18
2002-08-19Leif Grönqvist18 Conclusions The tagger works quite well, and so did our method to create a decent tagger without exactly the right training data The tagger works quite well, and so did our method to create a decent tagger without exactly the right training data The tagger could probably be improved The tagger could probably be improved Out of time? Discussion?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.