Download presentation
Presentation is loading. Please wait.
Published byHilary Tyler Modified over 9 years ago
1
October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms
2
October 2009HLT: Conflation Algorithms2 Acknowledgements John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/Sound Ex1.htm Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] Jurafsky & Martin appendix B pp 833-836.
3
October 2009HLT: Conflation Algorithms3 Conflation COMPUTES COMPUTE COMPUTATION COMPUTABILITY COMPUTING COMPUTER COMPUT
4
October 2009HLT: Conflation Algorithms4 Types of Conflation Algorithm Stemming –Process based - e.g. affix stripping Lemmatisation –Attempt to map to same lemma –POS dependent Morphological Analysis –Includes morpho-syntactic information
5
October 2009HLT: Conflation Algorithms5 Word Conflation Algorithms Morphological analysis versus conflation Notion of word class used is application dependent –Genealogy: Phonetic similarity –Information Retrieval: Semantic similarity Based on written language (not phonetic transcription) Well known algorithms –Soundex –Porter
6
October 2009HLT: Conflation Algorithms6 Soundex: Problems with Names Names can be misspelt: Rossner Same name can be spelt in different ways Kirkop; Chircop Same name appears differently in different cultures: Tchaikovsky; Chaicowski To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and are called SoundExes, after the first patented version.
7
October 2009HLT: Conflation Algorithms7 The Soundex Algorithm A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking.
8
October 2009HLT: Conflation Algorithms8 Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: 1.The first character of the word is retained as the first character of the Soundex code. 2.The following letters are discarded: a,e,i,o,u,h,w, and y. 3.Remaining consonants are given a code number. 4.If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")
9
October 2009HLT: Conflation Algorithms9 Code Numbers b, p, f, and v1 c, s, k, g, j, q, x, z2 d, t3 l4 m,n5 r6
10
October 2009HLT: Conflation Algorithms10 Soundex Algorithm: Example The Soundex Algorithm uses the following steps to encode a word: [ROSNER] 1.The first character of the word is retained as the first character of the Soundex code [R] 2.The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] 3.Remaining consonants are given a code number. [R256] 4.If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") [R256]
11
October 2009HLT: Conflation Algorithms11 Soundex Algorithm 2 –The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") –If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")
12
October 2009HLT: Conflation Algorithms12 Uses for the Soundex Code Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. Genealogy - In genealogy, the Soundex code is most often used to avoid problems when dealing with names that might have alternate spellings.
13
October 2009HLT: Conflation Algorithms13 Improvements Preprocessing before applying the basic algorithm, e.g. identification of –DG with G –GH with H –GN with N (not 'ng') –KN with N –PH with F Question: where to stop? Question: how to evaluate?
14
October 2009HLT: Conflation Algorithms14 IR Applications Information Retrieval: Query → → Relevant Documents “Bag of Terms” document model What is a single term?
15
October 2009HLT: Conflation Algorithms15 Why Stemming is Necessary Frequently we get collections of words of the following kind in the same document compute, computer, computing, computation, computability …. Performance of IR system will be improved if all of these terms are conflated. –Less terms to worry about –More accurate statistics
16
October 2009HLT: Conflation Algorithms16 Issues Is a dictionary available? –Stems –Affixes Motivation: linguistic credibility or engineering performance? When to remove a affix versus when to leave it alone Porter (1980): W 1 and W 2 should be conflated if there appears to be no difference between the statements "this document is about W 1 /W 2 " relate/relativity vs. radioactive/radioactivity
17
October 2009HLT: Conflation Algorithms17 Consonants and Vowels A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, (nb. y in toy is not regarded as a consonant). If a letter is not a consonant it is a vowel. A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. For example the word troubles maps to C V C V C Any word or part of a word, therefore has one of the following forms: (CV) n ….C (CV) n ….V (VC) n ….C (VC) n ….V
18
October 2009HLT: Conflation Algorithms18 Measure All the above patterns can be replaced by the following regular expression (C) (VC) m (V) m is called the measure of any word or word part. m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private
19
October 2009HLT: Conflation Algorithms19 Rules Rules for removing a suffix are given in the form (condition) S1 → S2 i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example (m > 1) EMENT → Example: enlargement → enlarg
20
October 2009HLT: Conflation Algorithms20 Conditions *S - stem ends with s *Z - stem ends with z *T – stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies.
21
October 2009HLT: Conflation Algorithms21 Organisation Step 1 Plurals and Third Person Singular Verbs Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup -s -ed, -ingfly/flies
22
October 2009HLT: Conflation Algorithms22 Step 1:Plural Nouns and 3 rd Person Singular Verbs conditionrewriteexample SSES → SScaresses → caress IES → Iponies → poni SS → SScaress → caress S →cats → cat
23
October 2009HLT: Conflation Algorithms23 Step 2a Verbal Past Tense and Progressive Forms conditionrewriteexample (m>1)EED → EEfeed → feed agreed → agree (*v*)ED → εplastered → plaster bled → bled (*v*)ING → εkilling → kill sing → sing
24
October 2009HLT: Conflation Algorithms24 Step 2b: Cleanup If 2 nd or 3 rd of last step succeeds conditionrewriteexample AT → ATEgenerat → generate BL → BLEtroubl → trouble IZ → IZEcapsiz → capsize *d and not (*L or *S or *Z) → single letter hopp → hop hiss → hiss
25
October 2009HLT: Conflation Algorithms25 Step 3: Y to I (*v*)Y → Ihappy → happi cry → cry
26
October 2009HLT: Conflation Algorithms26 STEP 4: Derivational Morphology 1 – Multiple Suffixes (excerpt) ConditionRewriteExample (m > 0)ATIONAL → ATErelational → relate (m > 0)TIONAL → TIONconditional → condition (m > 0)ENCI → ENCEvalenci → valence (m > 0)ABLI → ABLEcomfortabli → comfortable (m > 0)OUSLI → OUSanalagously → analagous (m > 0)IZATION → IZEdigitizer → digitize (m > 0)ATION → ATEgeneration → generate (m > 0)ATOR → ATEoperator → operate (m > 0)ALISM → ALformalism → formal (m > 0)IVENESS → IVEpensiveness → pensive (m > 0)FULNESS → FULhopefulness → hopeful (m > 0)OUSNESS → OUScallousness → callous (m > 0)ALITI → ALformality → formal (m > 0)BILITI → BLEpossibility → possible
27
October 2009HLT: Conflation Algorithms27 Step 6: Derivational Morphology III: Single Suffixes ConditionRewriteExample (m > 1)AL → εrevival → reviv (m > 1)ANCE → εallowance → allow (m > 1)ENCE → εinference → infer (m > 1)ER → εairliner → airlin (m > 1)IC → εCoptic → Copt (m > 1)ABLE → εlaughable → laugh (m > 1)ANT → εirritant → irrit (m > 1)EMENT → εreplacement → replac (m > 1)MENT → εadjustment → adjust (m > 1)ENT → εdependent → depend (m > 0) (*S or *T)ION → εadoption → adopt (m > 1)OU → εcallousness → callous (m > 1)ISM → εformalism→ formal (m > 1)ATE → εactivate → activ ITI → ε
28
October 2009HLT: Conflation Algorithms28 Porter Example INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management
29
October 2009HLT: Conflation Algorithms29 Porter Output Original WordStemmed Word first focusfocu area integratedintegr projectsproject help develop principallyprincip common open platformsplatform Original WordStemmed Word platformsplatform softwaresoftwar servicesservic supportingsupport distributeddistribut informationinform decisiondecis systemssystem risk crisiscrisi managementmanag
30
October 2009HLT: Conflation Algorithms30 Stemming Errors Under-stemming –the error of taking off too small a suffix –croulons croulon –since croulons is a form of the verb crouler Over-stemming –the error of taking off too much –example: croûtons croût –since croûtons is the plural of croûton Miss-stemming –taking off what looks like an ending, but is really part of the stem –reply rep
31
October 2009HLT: Conflation Algorithms31 Summary Conflation serves different purposes Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. This can cause errors in the bag of words model. Soundex and Porter very well established and easily available.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.