Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

Slides:



Advertisements
Similar presentations
Simplifications of Context-Free Grammars
Advertisements

Advanced Piloting Cruise Plot.
Sugar 2.0 Formal Specification Language D ana F isman 1,2 Cindy Eisner 1 1 IBM Haifa Research Laboratory 1 IBM Haifa Research Laboratory 2 Weizmann Institute.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
1 Hyades Command Routing Message flow and data translation.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
1 Term 2, 2004, Lecture 3, NormalisationMarian Ursu, Department of Computing, Goldsmiths College Normalisation 5.
Programming Language Concepts
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Computer Literacy BASICS
Configuration management
ABC Technology Project
Hash Tables.
VOORBLAD.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
© 2012 National Heart Foundation of Australia. Slide 2.
Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.4 Polynomials in Several Variables Copyright © 2013, 2009, 2006 Pearson Education, Inc.
Science as a Process Chapter 1 Section 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Pasewark & Pasewark Microsoft Office XP: Introductory Course 1 INTRODUCTORY MICROSOFT WORD Lesson 8 – Increasing Efficiency Using Word.
25 seconds left…...
Polynomial Functions of Higher Degree
Copyright 2001 Advanced Strategies, Inc. 1 Data Bridging An Overview Prepared for DIGIT By Advanced Strategies, Inc.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Arithmetic of random variables: adding constants to random variables, multiplying random variables by constants, and adding two random variables together.
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Improving Achievement
Intracellular Compartments and Transport
PSSA Preparation.
Chapter 11 Describing Process Specifications and Structured Decisions
Essential Cell Biology
Management Information Systems, 10/e
The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Pumping Lemma for CFL’s
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong 1.
NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
Basic Text Processing Word tokenization.
Discussion Class 3 Stemming Algorithms.
Presentation transcript:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming Alexander Gelbukh

2 Previous Chapter: Conclusions Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed Document partitioning is simple good for distributed computing Term partitioning is good for some data structures Distributed computing is MIMD computing with slow c ommunication SIMD machines are good for Signature files Both are out of favor now

3 Previous Chapter: Research topics How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck Meta search engines Creating large collections with judgements Is recall important?

4 Problem Recall image retrieval: Find images similar in color, size,... Find photos of Korean President ? Find nice girls ? (Dons show ugly ones!) Looks very stupid Lacks understanding Too difficult Text retrieval is no exception Find stories with sad beginning and happy end ? Lacks understanding Difficult but possible

5 Possible? Text is intended to facilitate understanding Supposedly, even partial understanding should help Degrees of understanding: Character strings (what is used now): well, geese, him Words (often used now): goose, he Concepts: hole in the ground (well), Roh Moo-Hyun Complex concepts: oil well, hot dog Situations (sentences, paragraphs) The story (direct meaning) The message (pragmatics, intended impact)

6 Easy? Main problems: Multiple ways to say the same Query does not match the doc Difficult to specify all variants Ambiguity of the text False alarms in matching Lack of implicit knowledge of the computer The computer does not understand the message Difficult to make inferences Natural Language Processing tries to solve them

7 Solutions Multiple ways to say the same? Normalizing: transforming to a standard variant Ambiguity of the text? Ambiguity resolution Normalizing to one of the variants Perhaps the main problem in natural language processing Lack of implicit knowledge of the computer? Dictionaries, grammars Knowledge on language structure is needed in all tasks Knowledge of world is useful for advanced task Knowledge on language use is a substitute

8 Synonymy Multiple ways to say the same Or at least when the difference does not matter Can be substituted in any (many?) context Lexical synonymy Woman / female, professor / teacher Dictionaries Phrase-level or sentence-level synonymy They game a book / I was given a book by them Syntactic analyzers Semantic-level synonymy Reasoning

9 Not only synonymy Multiple ways to say the same (synonymy) less: more general (hypernymy) more: more specific (hyponymy) Complete synonyms are rare professor teacher Abbreviations are usually (almost) complete synonyms When the differences do not matter, can be treated as synonymy But: different data structures and methods

10 Lexical-level synonymy Lexical synonymy Woman / female Mixed-type synonymy: USA / United States Morphology is a kind of synonymy ( actually hyponymy ) geese = goose + many Russian knigu = kniga + dative role the second part of the meaning is either not important or is another term Morphology is a very common problem in IR

11 Lexical synonymy Woman / female Dictionaries Synonym dictionaries WordNet Automatic learning of synonymy Clustering of contexts If the contexts are very similar, then possible synonyms Problem: preserves meaning? Monday / Tuesday An interesting solution: compare dictionary definitions

12 Uses in IR Query expansion Add synonyms of the word to the query and process normally Flexible, slow Best for lexical synonymy: few synonyms, doubtful Reducing at index time When reading the documents, reduce each word to a standard synonym Fast, rigid Best for morphology: many synonyms, less doubtful Hierarchical indexing

13 Hierarchical indexing (Gelbukh, Sidorov, Guzman-Arenas 2002) Tree of concepts Living things Animals 1.a. Cat, b. cats 2.a. Dog, b. dogs Persons 3.a. Professor, b. professors 4.a. Student, b. students Order vocabulary by the order of the leaves of tree Query expansion is done by ranges: cat: 1, living things: 1-4

14 Morphology One of the large concerns in IR Can be done precisely approximately (quick-and-dirty) Level of generalization inflection: student – students derivation: study – student Ambiguity all variants one variant

15... morphology Result is The unique ID The dictionary form A stem: part of the same string

16 Morphological analyzers Precise analysis Ambiguous Give all variants Tables: to table or the table? Spanish charlas: charla talk or charlar to talk Russian dush: dush shower or dusha soul Common in languages with developed morphology For short words, some 3 – 5 – 10 variants Dictionaries are used

17 Morphological system Dictionary specifies: Stem: bak-, ask- POS (part of speech): verb Inflection class (what endings it accepts): 1, 2 Tables of endings specify Paradigms: 1.-e -es -ed -ed -ing 2.-, -s -ed -ed -ing Meanings: participle,...

18... morphological system Algorithm Decompose the word into an existing stem and ending Check compatibility of stem and ending Give the stem ID and ending meaning Ambiguous Many variants of decompositions Many stems with different IDs Many endings with different meaning -ed: past or participle Problem: words absent in dictionary

19 Stemming Substitute for real analysis Both inflection and derivation Quick-and-dirty Only one variant Result: a part of the string gene, genial gen- Cheap development bad results simple description. Standard Often used in academic research Used to be used in real systems, but now less

20 Porter stemmer Martin Porter, 1980 Standard stemmer Provides equal basis for evaluation of different IR programs Uses measure m: [C](VC){m}[V]. m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

21... Porter stemmer Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat

22... Porter stemmer Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing

23... Porter stemmer If 2 nd or 3 rd rule successful AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

24... Porter stemmer Step 1c (*v*) Y -> I happy -> happi sky -> sky

25... Porter stemmer Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

26... Porter stemmer Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

27... Porter stemmer Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

28... Porter stemmer Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

29 Statistical stemmers Take a list of words Construct a model of language that generates it The best one The simplest one? How to find? List of stems, list of endings Determine their probabilities Usage statistics Decompose any input string into a stem and an ending Take the most probable variant

30 Research topics Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological analyzers Development of better stemmers Statistical stemmers?

31 Conclusions Reducing synonyms can help IR Better matching Ontologies are used. WordNet Morphology is a variant of synonymy widely used in IR systems Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers Rule-based stemmers. Porter stemmer Statistical stemmers

32 Thank you! Till May 24? 25?, 6 pm