Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson
Carnegie Mellon Christian Monson 2 Turkish Morphology – Beads on a String takepassivenegative present progressive 2 nd person singular You are not being taken
Carnegie Mellon Christian Monson 3 götürülmsunsunüyor takepassivenegative present progressive You are not being taken 2 nd person singular Turkish Morphology – Beads on a String
Carnegie Mellon Christian Monson 4 Applications of Computational Morphology Machine Translation –Turkish-English (Oflazer, 2007) –Czech-English (Goldwater and McClosky, 2005) Speech Recognition –Finnish (Creutz, 2006) Information Retrieval
Carnegie Mellon Christian Monson 5 Challenges of Computational Morphology Time Consuming for a New Language –Kemal Oflazer estimates 3-4 months to build basic Turkish analyzer Plus lexicon development and maintenance Expertise Needed –Greenlandic Official language of Greenland Agglutinative Inuit language 50,000 speakers Per Langaard
Carnegie Mellon Christian Monson 6 The Solution Raw Text Unsupervised Morphology Induction
Carnegie Mellon Christian Monson 7 ParaMor – Paradigm Morphology ParaMor Identify Search Cluster Filter Segment Evaluation Results ParaMor –Unsupervised morphology induction system Paradigm –The natural structure of morphology
Carnegie Mellon Christian Monson 8 Paradigms – The Structure of Morphology ülmsunsunüyor takepassivenegative present progressive 2 nd person singular StemVoicePolarity Tense & Mood Person & Number götür
Carnegie Mellon Christian Monson 9 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person singular umum götür
Carnegie Mellon Christian Monson 10 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 3 rd person singular umum Ø götür
Carnegie Mellon Christian Monson 11 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person plural umum Ø uzuz götür
Carnegie Mellon Christian Monson 12 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive umum Ø uzuz götür
Carnegie Mellon Christian Monson 13 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative future umum Ø uzuz yecek götür
Carnegie Mellon Christian Monson 14 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative umum Ø uzuz yecek götür
Carnegie Mellon Christian Monson 15 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number umum Ø uzuz yecek
Carnegie Mellon Christian Monson 16 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms
Carnegie Mellon Christian Monson 17 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms Paradigm –Set of mutually replaceable strings
Carnegie Mellon Christian Monson 18 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigm –Set of mutually replaceable strings
Carnegie Mellon Christian Monson 19 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps
Carnegie Mellon Christian Monson 20 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms
Carnegie Mellon Christian Monson 21 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm
Carnegie Mellon Christian Monson 22 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 23 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter Segment words –Using the discovered paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 24 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms All character boundaries are candidate morpheme boundaries
Carnegie Mellon Christian Monson 25 s ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Begin search with the most frequent word-final string Spanish
Carnegie Mellon Christian Monson 26 s ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Ø s 5501 Identify the most frequent mutually replaceable string –Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish
Carnegie Mellon Christian Monson 27 s ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Stop adding suffixes –When the most frequent mutually replaceable string severly decreases the stem count. Ø s 5501 Ø r s 287 autorizaciones buscabamos costas importadoras vallas …
Carnegie Mellon Christian Monson 28 s ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Move on to the next most frequent word-final string Ø s 5501 Ø r s 287 a 8981
Carnegie Mellon Christian Monson 29 a 8981 s a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
Carnegie Mellon Christian Monson 30 n 6051 a 8981 s Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
Carnegie Mellon Christian Monson 31 n 6051 a 8981 s Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
Carnegie Mellon Christian Monson 32 an 1786 n 6051 a 8981 s a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
Carnegie Mellon Christian Monson strado 15 rado 167 an 1786 n 6051 a 8981 s a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 rada radas rado rados 53 rada rado rados 67 rada rado 89 ra rada radas rado rados ran rar raron ró 23 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 strada strado 12 strada strado stró 9 strada strado strar stró 8 strada stradas strado strar stró 7 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms...
Carnegie Mellon Christian Monson 34 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 35 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 36 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 37 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: Covered Types 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
Carnegie Mellon Christian Monson 38 Filter Candidate Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results 2 types of filtering 1.Remove small unclustered candidate paradigms 2.Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)
Carnegie Mellon Christian Monson 39 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas
Carnegie Mellon Christian Monson 40 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó...
Carnegie Mellon Christian Monson 41 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó... administrada
Carnegie Mellon Christian Monson 42 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas administrada a ada adas ado ados an ar aron ó...
Carnegie Mellon Christian Monson 43 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas a as o os administrada
Carnegie Mellon Christian Monson 44 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada Old way: Separate alternative analysis
Carnegie Mellon Christian Monson 45 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada administr +ad +as New way: Augment the current segmentation
Carnegie Mellon Christian Monson 46 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +ad +a +s Ø sØ s administradaØ administr +adas, administrad +as, administrada +s
Carnegie Mellon Christian Monson 47 Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Peer operated competition –For unsupervised morphology induction algorithms 4 languages –English –German –Finnish –Turkish
Carnegie Mellon Christian Monson 48 ParaMor in Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Developed on Spanish –ParaMor’s free parameters were frozen
Carnegie Mellon Christian Monson 49 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem
Carnegie Mellon Christian Monson 50 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem
Carnegie Mellon Christian Monson 51 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 2.Task based Information retrieval –Short two-sentence queries –About international news topics –Binary relevance assessments –About 50 queries and 20K relevance judgements for each language.
Carnegie Mellon Christian Monson 52 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results Morfessor 47.2
Carnegie Mellon Christian Monson 53 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 47.2 MorfessorParaMor 50.6
Carnegie Mellon Christian Monson 54 Linguistic Evaluation F1F1 Bernhard 2 MorfessorParaMorParaMor & Morfessor ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2Morfessor
Carnegie Mellon Christian Monson 55 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 50.7 MorfessorParaMorParaMor & Morfessor 60.8
Carnegie Mellon Christian Monson 56 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorParaMor & Morfessor
Carnegie Mellon Christian Monson 57 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor
Carnegie Mellon Christian Monson 58 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor
Carnegie Mellon Christian Monson 59 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor
Carnegie Mellon Christian Monson 60 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor
Carnegie Mellon Christian Monson 61 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameePar – No Morphological Analysis
Carnegie Mellon Christian Monson 62 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameeParaMor 27.0 – No Morphological Analysis
Carnegie Mellon Christian Monson 63 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M – No Morphological Analysis
Carnegie Mellon Christian Monson 64 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M – No Morphological Analysis
Carnegie Mellon Christian Monson 65 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & MorfessorMorfessor BaselineParaMor & Morfessor 32.0 – No Morphological Analysis
Carnegie Mellon Christian Monson 66 ParaMor: State-of-the-Art Unsupervised Morphology Induction System Combined system among the best in Morpho Challenge 2007 Consistent across languages Better than no morphology –Task based (IR) measure
Carnegie Mellon Christian Monson 67 Many Future Directions Improve Performance –F 1 of 50-60% is state-of-the-art! –Inflection classes –Morphophonology Beyond beads-on-a-string
Carnegie Mellon Christian Monson 68 Thank You!