December 2007NLP: Conflation Algorithms1 Natural Language Processing Conflation Algorithms.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

November 2003CSA3050 Conflation Algorithms1 CSA305: NLP Algorithms Conflation Algorithms.
Chapter 5: Introduction to Information Retrieval
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
CS 430 / INFO 430 Information Retrieval
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
1 Words and the Lexicon September 10th 2009 Lecture #3.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Introduction to English Morphology Finite State Transducers
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Text Search and Fuzzy Matching
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
Kalyani Patel K.S.School of Business Management,Gujarat University.
Information for Parents November 2011 Welcome
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
SPELLING RULES Back to the basics…. i before e rule  There are actually 925 exceptions to the “i before e rule” * Only 44 words in the English language.
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
FALSE FRIENDS AND SUFFIXES
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.
PHONETICS. What is phonetics? It’s a discipline in linguistics which deals with the transcription of sounds. The units are called phonemes. There are.
Morphology 1 : the Morpheme
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
MORPHOLOGICAL PROCESSES
Basic Text Processing: Morphology Word Stemming
You can’t touch this. Vocabulary/Grammar Connection
CSC 594 Topics in AI – Natural Language Processing
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Unit One: Parts of Speech
عمادة التعلم الإلكتروني والتعليم عن بعد
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
CSC 594 Topics in AI – Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Development of A Stemming Algorithm
Word Formation Ι 영어영문학과 이선화.
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
Basic Text Processing Word tokenization.
Discussion Class 3 Stemming Algorithms.
CSCI 5832 Natural Language Processing
Basic Text Processing: Morphology Word Stemming
Spelling Scheme of Work
Spelling Scheme of Work
Presentation transcript:

December 2007NLP: Conflation Algorithms1 Natural Language Processing Conflation Algorithms

December 2007NLP: Conflation Algorithms2 Acknowledgements John Repici (2002) Ex1.htm Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN [Vince has a copy of this] Jurafsky & Martin appendix B pp

December 2007NLP: Conflation Algorithms3 Conflation COMPUTES COMPUTE COMPUTATION COMPUTABILITY COMPUTING COMPUTER COMPUT

December 2007NLP: Conflation Algorithms4 Word Conflation Algorithms Morphological analysis versus conflation Notion of word class is application dependent –Genealogy: Phonetic similarity –Information Retrieval: Semantic similarity Soundex Porter

December 2007NLP: Conflation Algorithms5 Problems with Names Names can be misspelt: Rossner Same name can be spelt in different ways Kirkop; Chircop Same name appears differently in different cultures: Tchaikovsky; Chaicowski To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and are called SoundExes, after the first patented version.

December 2007NLP: Conflation Algorithms6 The Soundex Algorithm A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking.

December 2007NLP: Conflation Algorithms7 Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: 1.The first character of the word is retained as the first character of the Soundex code. 2.The following letters are discarded: a,e,i,o,u,h,w, and y. 3.Remaining consonants are given a code number. 4.If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")

December 2007NLP: Conflation Algorithms8 Code Numbers b, p, f, and v1 c, s, k, g, j, q, x, z2 d, t3 l4 m,n5 r6

December 2007NLP: Conflation Algorithms9 Soundex Algorithm: Example The Soundex Algorithm uses the following steps to encode a word: [ROSNER] 1.The first character of the word is retained as the first character of the Soundex code [R] 2.The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] 3.Remaining consonants are given a code number. [R256] 4.If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") [R256]

December 2007NLP: Conflation Algorithms10 Soundex Algorithm 2 –The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") –If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")

December 2007NLP: Conflation Algorithms11 Uses for the Soundex Code Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. Genealogy - In genealogy, the Soundex code is most often used to avoid obstacles when dealing with names that might have alternate spellings.

December 2007NLP: Conflation Algorithms12 Improvements Preprocessing before applying the basic algorithm, e.g. identification of –DG with G –GH with H –GN with N (not 'ng') –KN with N –PH with F Question: where to stop? Question: how to evaluate?

December 2007NLP: Conflation Algorithms13 IR Applications Information Retrieval: Query → → Relevant Documents “Bag of Terms” document model What is a single term?

December 2007NLP: Conflation Algorithms14 Why Stemming is Necessary Frequently we get collections of words of the following kind in the same document compute, computer, computing, computation, computability …. Performance of IR system will be improved if all of these terms are conflated. –Less terms to worry about –More accurate statistics

December 2007NLP: Conflation Algorithms15 Issues Is a dictionary available? –Stems –Affixes Motivation: linguistic credibility or engineering performance? When to remove a affix versus when to leave it alone Porter (1980): W 1 and W 2 should be conflated if there appears to be no difference between the statements "this document is about W 1 /W 2 " relate/relativity vs. radioactive/radioactivity

December 2007NLP: Conflation Algorithms16 Consonants and Vowels A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, toy If a letter is not a consonant it is a vowel. A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. For example the word troubles maps to C V C V C Any word or part of a word, therefore has one of the following forms: (CV) n ….C (CV) n ….V (VC) n ….C (VC) n ….V

December 2007NLP: Conflation Algorithms17 Measure All the above patterns can be replaced by the following regular expression (C) (VC) m (V) m is called the measure of any word or word part. m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private

December 2007NLP: Conflation Algorithms18 Rules Rules for removing a suffix are given in the form (condition) S1 → S2 i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example (m > 1) EMENT → Example: enlargement → enlarg

December 2007NLP: Conflation Algorithms19 Conditions *S - stem ends with s *Z - stem ends with z *T – stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies.

December 2007NLP: Conflation Algorithms20 Organisation Step 1 Plurals and Third Person Singular Verbs Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup -s -ed, -ingfly/flies

December 2007NLP: Conflation Algorithms21 Step 1:Plural Nouns and 3 rd Person Singular Verbs conditionrewriteexample SSES → SScaresses → caress IES → Iponies → poni SS → SScaress → caress S →cats → cat

December 2007NLP: Conflation Algorithms22 Step 2a Verbal Past Tense and Progressive Forms conditionrewriteexample (m>0)EED → EEfeed → feed agreed → agree (*v*)ED → εplastered → plaster bled → bled (*v*)ING → εkilling → kill sing → sing

December 2007NLP: Conflation Algorithms23 Step 2b: Cleanup If 2 nd or 3 rd of last step succeeds conditionrewriteexample AT → ATEgenerat → generate BL → BLEtroubl → trouble IZ → IZEcapsiz → capsize *d and not (*L or *S or *Z) → single letter hopp → hop hiss → hiss

December 2007NLP: Conflation Algorithms24 Step 3: Y to I (*v*)Y → Ihappy → happi cry → cry

December 2007NLP: Conflation Algorithms25 STEP 4: Derivational Morphology 1 – Multiple Suffixes (excerpt) ConditionRewriteExample (m > 0)ATIONAL → ATErelational → relate (m > 0)TIONAL → TIONconditional → condition (m > 0)ENCI → ENCEvalenci → valence (m > 0)ABLI → ABLEcomfortabli → comfortable (m > 0)OUSLI → OUSanalagously → analagous (m > 0)IZATION → IZEdigitizer → digitize (m > 0)ATION → ATEgeneration → generate (m > 0)ATOR → ATEoperator → operate (m > 0)ALISM → ALformalism → formal (m > 0)IVENESS → IVEpensiveness → pensive (m > 0)FULNESS → FULhopefulness → hopeful (m > 0)OUSNESS → OUScallousness → callous (m > 0)ALITI → ALformality → formal (m > 0)BILITI → BLEpossibility → possible

December Step 6: Derivational Morphology III: Single Suffixes ConditionRewriteExample (m > 1)AL → εrevival → reviv (m > 1)ANCE → εallowance → allow (m > 1)ENCE → εinference → infer (m > 1)ER → εairliner → airlin (m > 1)IC → εCoptic → Copt (m > 1)ABLE → εlaughable → laugh (m > 1)ANT → εirritant → irrit (m > 1)EMENT → εreplacement → replac (m > 1)MENT → εadjustment → adjust (m > 1)ENT → εdependent → depend (m > 0) (*S or *T)ION → εadoption → adopt (m > 1)OU → εcallousness → callous (m > 1)ISM → εformalism→ formal (m > 1)ATE → εactivate → activ ITI → ε

December 2007NLP: Conflation Algorithms27 Porter Example INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management

December 2007NLP: Conflation Algorithms28 Porter Output Original WordStemmed Word first focusfocu area integratedintegr projectsproject help develop principallyprincip common open platformsplatform Original WordStemmed Word platformsplatform softwaresoftwar servicesservic supportingsupport distributeddistribut informationinform decisiondecis systemssystem risk crisiscrisi managementmanag

December 2007NLP: Conflation Algorithms29 Summary Conflation serves different purposes Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. This can cause errors in the bag of words model. Soundex and Porter very well established and easily available.