NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Chapter 5: Introduction to Information Retrieval
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Rules Always answer in the form of a question 50 points deducted for wrong answer.
1.4 Linguistic signs: Morphemes and lexemes.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
THE BED TEST An Adapted Spelling Inventory Implement WEEK 1 or Week 2 A. Street FALL Lit Center Mini Lesson SPELLING INVENTORY.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Reading Comprehension
Overview of Search Engines
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Welcome! First Grade Parent Night. Decoding Strategies.
99 學年度第二學期 英文閱讀 ( 二 ) Reading (2) 周美智老師 日四技應外一甲. Interactions 1 Chapter 6 Cultures of the World “No culture can live, if it attempts to be exclusive.
Prof. Erik Lu. MORPHOLOGY GRAMMAR MORPHOLOGY MORPHEMES BOUND FREE WORDS LEXICAL GRAMMATICAL NOUNS VERBS ADJECTIVES (ADVERBS) PRONOUNS ARTICLES ADVERBS.
 Society:  Community of people bound together by similar traditions, institutions, or behaviors  Media:  The means of communication (like radio, TV,
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Context Clues. When you come to a word you do not understand look at the words surrounding to help give you clues to the unknown word. Underline context.
The Linguistics of Second Language Acquisition
Vocabulary Instruction Ivy Phillips 7th-grade English/Language Arts Hutchison School.
ADVANCED WEB SEARCH TECHNIQUES Computing and Information Technology Ali Robertson 1.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Classroom Strategies Classroom Strategies. Our classroom strategies are the most effective ways to build fluency, vocabulary, comprehension, and writing.
Lecture # 6 Forms, Widgets and Event Handling. Today Questions: From notes/reading/life? Share Personal Web Page (if not too personal) 1.Introduce: How.
Literacy Workshop 2013 Ms Javed. Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
Morphemes Grammar & Language.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Context Clues Write Now:
Chapter 6: Information Retrieval and Web Search
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
How Solvable Is Intelligence? A brief introduction to AI Dr. Richard Fox Department of Computer Science Northern Kentucky University.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Supporting Early Literacy Learning Ballarat March, 2011.
+ Phonics Workshop Tuesday 20 th October Phonics at Little Melton Primary In school, we follow the Letters and Sounds phonics programme. Letters.
Natural Language Processing Chapter 2 : Morphology.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
1 FollowMyLink Individual APT Presentation First Talk February 2006.
Phonics and Word Study Literary Links Phonics Instruction Teaches children the relationship between the letters (graphemes) of written language.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
1. Chapter Preview Part 1 – Listening in the Classroom  Listening Skills: The Problem and the Goal  Listening Tasks in Class Part 2 – Listening outside.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Linguistic Graph Similarity for News Sentence Searching
Morphology: Meaning Matters!
Web News Sentence Searching Using Linguistic Graph Similarity
DATA MODELS.
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
Distributed Representation of Words, Sentences and Paragraphs
Token generation - stemming
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
CSCI 5832 Natural Language Processing
Presentation transcript:

NLP

Text similarity

People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the flight departs at noon” Text similarity is a key component of Natural Language Processing If the user is looking for information about cats, we may want the NLP system to return documents that mention kittens even if the word “cat” is not in them. If the user is looking for information about “fruit dessert”, we want the NLP system to return documents about “peach tart” or “apple cobbler”. A speech recognition system should be able to tell the difference between similar sounding words like the “Dulles” and “Dallas” airports. This set of lectures will teach you how text similarity can be modeled computationally.

[Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, "Placing Search in Context: The Concept Revisited", ACM Transactions on Information Systems, 20(1): , January 2002] tiger cat 7.35 tiger tiger book paper 7.46 computer keyboard 7.62 computer internet 7.58 plane car 5.77 train car 6.31 telephone communication 7.50 television radio 6.77 media radio 7.42 drug abuse 6.85 bread butter 6.19 cucumber potato

[SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation Felix Hill, Roi Reichart and Anna Korhonen. Preprint pubslished on arXiv. arXiv: ] delightful wonderful A 8.65 modest flexible A 0.98 clarify explain V 8.33 remind forget V 0.87 get remain V 1.6 realize discover V 7.47 argue persuade V 6.23 pursue persuade V 3.17 plane airport N 3.65 uncle aunt N 5.5 horse mare N 8.33

Words most similar to “France” Computed using “word2vec” [Mikolov et al. 2013] spain belgium netherlands italy switzerland luxembourg portugal russia germany catalonia 0.534

Many types of text similarity exist: –Morphological similarity (e.g., respect-respectful) –Spelling similarity (e.g., theater-theatre) –Synonymy (e.g., talkative-chatty) –Homophony (e.g., raise-raze-rays) –Semantic similarity (e.g., cat-tabby) –Sentence similarity (e.g., paraphrases) –Document similarity (e.g., two news stories on the same event) –Cross-lingual similarity (e.g., Japan-Nihon)

Text Similarity

Words with the same root: –scan (base form) –scans, scanned, scanning (inflected forms) –scanner (derived forms, suffixes) –rescan (derived forms, prefixes) –rescanned (combinations)

To stem a word is to reduce it to a base form, called the stem, after removing various suffixes and endings and, sometimes, performing some additional transformations Examples –scanned → scan –indication → indicate In practice, prefixes are sometimes preserved, so rescan will not be stemmed to scan

Porter’s stemming method is a rule-based algorithm introduced by Martin Porter in 1980 The paper (“An algorithm for suffix stripping”) has been cited more than 7,000 times according to Google Scholar The input is an individual word. The word is then transformed in a series of steps to its stem The method is not always accurate

Example 1: –Input = computational –Output = comput Example 2: –Input = computer –Output = comput The two input words end up stemmed the same way

The measure of a word is an indication of the number of syllables in it –Each sequence of consonants is denoted by C –Each sequence of vowels is denoted as V –The initial C and the final V are optional –So, each word is represented as [C]VCVC... [V], or [C](VC){k}[V], where k is its measure

m=0: I, AAA, CNN, TO, GLEE m=1: OR, EAST, BRICK, STREET, DOGMA m=2: OPAL, EASTERN, DOGMAS m=3: EASTERNMOST, DOGMATIC

The initial word is then checked against a sequence of transformation patterns, in order. An example pattern is: – (m>0) ATION -> ATE medication -> medicate Note that this pattern matches medication and dedication, but not nation. Whenever a pattern matches, the word is transformed and the algorithm restarts from the beginning of the list of patterns with the transformed word. If no pattern matches, the algorithm stops and outputs the most recently transformed version of the word.

Step 1a SSES -> SS presses -> press IES -> I lies -> li SS -> SS press -> press S -> ø lots -> lot Step 1b (m>0) EED -> EE refereed -> referee (doesn’t apply to bleed since m(‘BL’)=0)

Step 2 (m>0) ATIONAL -> ATE inflational -> inflate (m>0) TIONAL -> TION notional -> notion (m>0) IZER -> IZE nebulizer -> nebulize (m>0) ENTLI -> ENT intelligentli -> intelligent (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE realization -> realize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE indicator -> indicate (m>0) IVENESS -> IVE attentiveness -> attentive (m>0) ALITI -> AL realiti -> real (m>0) BILITI -> BLE abiliti -> able

Step 3 (m>0) ICATE -> IC replicate -> replic (m>0) ATIVE -> ø informative -> inform (m>0) ALIZE -> AL realize -> real (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> ø blissful -> bliss (m>0) NESS -> tightness -> tight Step 4 (m>1) AL -> ø appraisal -> apprais (m>1) ANCE -> ø conductance -> conduct (m>1) ER -> ø container -> contain (m>1) IC -> ø electric -> electr (m>1) ABLE -> ø countable -> count (m>1) IBLE -> ø irresistible -> irresist (m>1) EMENT -> ø displacement -> displac (m>1) MENT -> ø investment -> invest (m>1) ENT -> ø respondent -> respond

Example 1: –Input = computational –Step 2: replace ational with ate: computate –Step 4: replace ate with ø: comput –Output = comput Example 2: –Input = computer –Step 4: replace er with ø: comput –Output = comput The two input words end up stemmed the same way

Online demo – Martin Porter’s official site –

How will the Porter stemmer stem these words? Check the Porter paper (or the code for the stemmer) in order to answer these questions. Is the output what you expected? If not, explain why. construction? increasing? unexplained? differentiable?

constructionconstruct increasingincreas unexplainedunexplain differentiabledifferenti construction? increasing? unexplained? differentiable?

Thorny Stems, NACLO 2008 problem by Eric Breck –

Thorny Stems

NLP