2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.

2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects of) language modeling –analysis/synthesis pair wordform with lemma+affixes/features (or more) perhaps with disambiguation in textual context –for spelling correction –for pronunciation calculation –for language modeling –for information extraction –for translation All cases can be modeled as relations on strings

2/5/01 Hand-built engines Well-understood FST technology –broad coverage, accurate, efficient –1-2 person-years per language native speaker/linguist/programmer combination –application adaptation is also understood More informal approaches –often less demanding applications such as stemmers –sui generis algorithms developed for single language –typically more human effort

2/5/01 Related technology Probability distribution over alternative analyses Can be listed for common forms Approximated by a weighted transducer for rarer forms Taggers – contextual disambiguation of the output of a morph analyzer –Several well-understood approaches HMM Brill FST

2/5/01 So what’s the problem? Transducers and taggers cost a lot –1-3 person-years of highly-skilled work per language –or more, from a standing start e.g. without electronic dictionaries and tagged texts Each language takes a long time –difficult to parallelize the work Only about 15 languages have been “done” –coverage is uneven for these –natural rate of addition is now slow –only large, wealthy languages are likely candidates

2/5/01 Obvious solution: use machine learning Several approaches have been tried Others look interesting Some results are promising, but… –no overall solutions so far –different languages, training sets, tests if any –no comparisons of different methods on one problem of one method on different problems of quality in different applications

2/5/01 A test bench for morphology learning For a dozen or so diverse languages –Text corpus (~1-10MW) –Tagged subcorpus (.1-1MW) for training and/or testing –Broad-coverage analyzer/synthesizer generating data for (semi-)supervised learning “oracle” for active learning –Tagger generating approximately correct tagged data

2/5/01 TextHand-tagged text TransducerTagger Arabic (MSA) YNXerox*N Arabic (Egypt) YNLDC*N Russian YNXerox*Y Polish NNXerox*Y Czech YYCharles U+Y Hungarian NNXerox*Y Finnish YYU. Helsinki+Y Greek YNXerox*N Dutch YNXerox*Y Spanish YNLDC*Y? Italian YNXerox*Y French YYINTEX+Y Korean YYSeoul U*? Japanese YNLDC*N * in hand + promised

2/5/01 Approaches I Unsupervised learning –lots of text as input (100M words or so) –wordform similarity and context similarities –stem classes or full decomposition –Promising performance on small subtasks Yarowsky (2000) –Overall results (so far) are inadequate Brent (1999), Goldsmith (2000) –100M-word corpus is unrealistic for most languages where induction is needed

2/5/01 Approaches II Supervised learning –No attempts (to our knowledge) to apply general inductive methods to the full problem Some very small experiments with neural nets –Why not? problem is hard (Gold’s theorem) fully supervised case seems unrealistic –though small tagged corpus is cheap –production can be parallelized

2/5/01 Approaches III Semi-supervised learning –Oflazer and Nirenburg (2000) –Elicit/build/test Elicit paradigms Build simple FST rules using Brill learner Test and “see what the problems are” Initial results seem promising –Quantitative evaluation? –Role of human judgment?

2/5/01 Our approach Set up a test bench to compare alternatives –for overall performance –for application suitability Try new things –modifications of existing algorithms –new or untested algorithms Get basic data on the problem –across languages and applications

2/5/01 An interesting idea MAT learning of FSAs –Angluin 1987 –Exact learning in polynomial time and state space given “queries and counterexamples” –Becomes PAC learning given membership queries only –Generalized to “multiplicity automata” (Beimel et al. 1998) –Generalized to two-tape automata (Yokomori 2000) In tests: transducer answers membership queries! In real applications: humans would give the answers Queries can often be run in parallel –might run nearly N times faster given N humans…

2/5/01 Prospective machine morphologists Mark Liberman Scott Weinstein Robin Clark David Embick Steven Bird Ian Ross Paul Kingsbury Na-Rae Han [Michael Kearns] (reading and discussion group)

2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.

Similar presentations

Presentation on theme: "2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.

Similar presentations

Presentation on theme: "2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects."— Presentation transcript:

Similar presentations

About project

Feedback