Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Similar presentations


Presentation on theme: "Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013."— Presentation transcript:

1 Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

2 Semi-Supervised Training HMM with Expectation-Maximization (EM) Need: Large raw corpus Tag dictionary [Kupiec, 1992] [Merialdo, 1994]

3 Previous Works:  Supervised Learning  Provide high accuracy for POS tagging (Manning, 2011).  Perform poorly when little supervision is available.  Semi-Supervised  Done by training sequence models such as HMM using the EM algorithm.  Work in this area has still relied on relatively large amounts of data. (Kupiec, 1992; Merialdo,1994).

4 Previous Works:  Goldberg et al.(2008)  Manually constructed lexicon for Hebrew to train HMM tagger.  Lexicon was developed over a long period of time by expert lexicographers.  Tackstrom et al. (2013)  Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.  Large parallel corpora required.

5 Low-Resource Languages 6,900 languages in the world ~30 have non-negligible quantities of data No million-word corpus for any endangered language [Maxwell and Hughes, 2006] [Abney and Bird, 2010]

6 Low-Resource Languages Kinyarwanda (KIN) Niger-Congo. Morphologically-rich. Malagasy (MLG) Austronesian. Spoken in Madagascar. Also, English

7 Collecting Annotations Supervised training is not an option. Semi-supervised training: Annotate some data by hand in 4 hours, (in 30-minute intervals) for two tasks. Type supervision. Token supervision.

8 Tag Dict Generalization These annotations are too sparse! Generalize to the entire vocabulary

9 Tag Dict Generalization Haghighi and Klein (2006) do this with a vector space. We don’t have enough raw data Das and Petrov (2011) do this with a parallel corpus. We don’t have a parallel corpus

10 Tag Dict Generalization Strategy: Label Propagation Connect annotations to raw corpus tokens Push tag labels to entire corpus [Talukdar and Crammer. 2009]

11 Morphological Transducers Finite-state transducers are used for morphological analysis. FST accepts a word type and produces a set of morphological features. Power of FSTs: Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

12 Tag Dict Generalization PREV_ NEXT_thug TOK_the_4TOK_the_1 TYPE_the PREV_the TOK_the_9TOK_thug_5 TYPE_thug NEXT_walks TOK_dog_2 TYPE_dog PRE1_tPRE2_th SUF1_e SUF1_gPRE1_d PRE2_do

13 Tag Dict Generalization Type Annotations _ the __ DT ____ _ dog _ NN ____ TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2

14 Tag Dict Generalization Type Annotations _ the ________ _ dog ________ TY DT the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TY NN og NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2

15 Tag Dict Generalization Type Annotations _ the ________ dog TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2 Token Annotations the dog walks DT NN VBZ

16 Tag Dict Generalization Type Annotations _ the ________ dog TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TO DT e_4 TOK_the_1 TOK_thug_5 TOK NN _2 Token Annotations the dog walks ____________

17 Model Minimization [Ravi et al., 2010; Garrette and Baldridge, 2012] LP graph has a node for each corpus token. Each node is labelled with distribution over POS tags. Graph provides a corpus of sentences labelled with noisy tag distributions. Greedily seek the minimal set of tag bigrams that describe the raw corpus. Now use, HMM trained by EM.

18 Overall Accuracy All of these values were achieved using both FST and affix LP features.

19 Results

20 Types versus Tokens

21 Mixing Type and Token Annotations

22 Morphological Analysis

23 Annotator Experience

24 Conclusion Type Annotations are the most useful input from a linguist. We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.


Download ppt "Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013."

Similar presentations


Ads by Google