Download presentation
Presentation is loading. Please wait.
1
RePortS: A Simpler, Intuitive Approach to Morpheme Induction Emily Pitler Samarth Keshava Yale University
2
Goals Segment English words into morphemes Simple algorithm Minimize assumptions and “magic numbers”
3
Approach Identify common morphemes in the language – “prefix” and “suffix” lists Use these to segment the test words
4
Intuition and Motivation The resulting word fragment, after removing a potential morpheme, is often still a word Examples: – training = train+ing – chairman = chair+man – insufferable = insuffer+able Don’t use to segment words
5
Intuition and Motivation Use fluctuations in transitional probabilities (Harris 1955, Hafer and Weiss 1974) Examples: – Expect Pr(t | repor) ≈ 1 – Expect Pr(s | report) < 1 Because there are other words such as reported, reporting, report, etc.
6
Four Steps 1. Preprocessing: build the lexicographic trees 2. Score word fragments to determine morphemes 3. Prune the morpheme lists 4. Segment words using the trees and morpheme lists
7
Step 1: Build the trees We build a “forward tree” and a “backward tree” We use these trees to calculate transitional probabilities in O(1) time
8
Hypothetical section of the forward tree
9
Step 2: Scoring morphemes Example: scoring “s” in “reports” – Check if “report” is a word in the corpus – Check if Pr(t | repor) ≈ 1 – Check if Pr(s | report) < 1 If “s” passes all three tests, we add 19 to its suffix score; otherwise we subtract 1
10
Step 2: Scoring morphemes We declare fragments to be morphemes if they have positive scores +19/-1 scheme – Chosen so that positive score iff pass 5% of tests – More frequent morphemes have higher scores – Any multiple of these numbers would produce same results
11
Step 3: Pruning Don’t want “er”, “s” and “ers” all in the morpheme list Remove any morpheme composed of two other morphemes with higher scores
12
Top English Morphemes Top 10 of the 808 morphemes in the “prefix” list: 1. un 2. re 3. dis 4. non 5. over 6. mis 7. in 8. sub 9. pre 10. inter
13
Top English Morphemes Top 10 of the 987 morphemes in the “suffix” list: 1. s 2. ly 3. ness 4. ing 5. ed 6. al 7. ism 8. less 9. ist 10. able
14
Top English Morphemes Prefixes and suffixes later in the list 101. well 102. water 103. servo 104. make 105. quick 101. ier 102. box 103. town 104. line 105. more
15
Step 4: Segmenting Words politeness = polite+ness or politenes+s ? Use transitional probabilities again – Expect Pr(n | polite) < Pr(s | politenes) Peel off morpheme with smallest probability (unless all probabilities are 1)
16
Results English results – On the provided 532-word Gold Standard – On the organizers’ test data F-scorePrecisionRecall 80.92%82.84%79.10% F-scorePrecisionRecall 76.8%76.2%77.4%
17
Results Breakdown – Contribution of the different intuitions F-scorePrecisionRecall Criteria 1 only57.33%45.22%78.29% Criteria 2 & 3 only 60.58%50.21%76.36% All80.92%82.84%79.10%
18
Results Finnish Turkish F-scorePrecisionRecall 46.62%83.76%32.30% F-scorePrecisionRecall 54.04%72.68%43.01%
19
Simple and Effective Based on intuition, not a complex model – How we personally would segment words Program was relatively short--252 lines of Perl Other variations had slightly better F-scores Best mixture of performance and elegance
20
Thank you for listening. Emily Pitler Samarth Keshava
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.