Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Search Engines and Information Retrieval
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
INFO 624 Week 3 Retrieval System Evaluation
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Current Topics in Information Access: IR Background
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Evaluating the Performance of IR Sytems
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Search Engines and Information Retrieval Chapter 1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Evaluation of Information Retrieval Systems
Text Based Information Retrieval
Why the interest in Queries?
Information Retrieval
Representation of documents and queries
Content Analysis of Text
Evaluation of Information Retrieval Systems
Retrieval Performance Evaluation - Measures
Presentation transcript:

Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA

Information Retrieval Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This assumption underlies the field of Information Retrieval.

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed? Evaluate

Terminology Token: A natural language word “Swim”, “Simpson”, “92513” etc Document: Usually a web page, but more generally any file.

Some IR History –Roots in the scientific “Information Explosion” following WWII –Interest in computer-based IR from mid 1950’s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s)

Relevance In what ways can a document be relevant to a query? –Answer precise question precisely. –Who is Homer’s Boss? Montgomery Burns. –Partially answer question. –Where does Homer work? Power Plant. –Suggest a source for more information. –What is Bart’s middle name? Look in Issue 234 of Fanzine –Give background information. –Remind the user of other knowledge. –Others...

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed? Evaluate The section that follows is about Content Analysis (transforming raw text into a computationally more manageable form)

Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –Bike, Biking –Swim, Swimmer, Swimming What about… build, building;

Original Words … consign consigned consigning consignment consist consisted consistency consistent consistently consisting consists … Stemmed Words … consign consign consign consign consist consist consist consist consist consist consist Examples of Stemming (using Porters algorithm) Porters algorithms is available in Java, C, Lisp, Perl, Python etc from ~martin/PorterStemmer/

Errors Generated by Porter Errors Generated by Porter Stemmer (Krovetz 93) Homework!! Play with the following URL

Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution

8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, tokens, unique

Plotting Word Frequency by Rank Main idea: count –How many times tokens occur in the text Over all texts in the collection Now rank these according to how often they occur. This is called the rank.

Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form The Corresponding Zipf Curve

Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…

Illustration by Jacob Nielsen Zipf Distribution (linear and log scale)

What Kinds of Data Exhibit a Zipf Distribution? Words in a text collection –Virtually any language usage Library book checkout patterns Incoming Web Page Requests Outgoing Web Page Requests Document Size on Web City Sizes …

Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR English examples: to, from, on, and, the,... There are always a large number of tokens that occur once and can mess up algorithms. Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

Statistical Independence Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

Interesting Associations with “Doctor” Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

Un-Interesting Associations with “Doctor Un-Interesting Associations with “Doctor” ( AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun.

Associations Are Important Because… We may be able to discover that phrases that should be treated as a word. I.e. “data mining”. We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”

Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is about Index Construction Evaluate

Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

How Inverted Files are Created Dictionary Postings

Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is about Querying (and ranking) Evaluate

Simple query language: Boolean –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases –connectors AND OR NOT NEAR (Pseudo Boolean) Word Doc Catx Dog Collarx Leash

Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term?

Boolean Model Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial Information Retrieval systems until the WWW Since the Boolean model is limited, lets consider a generalization…

Vector Model Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse Smithers secretly loves Monty Burns Monty Burns secretly loves Smithers Both map to… [ Burns, loves, Monty, secretly, Smithers]

Document Vectors One location for each word novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Illustration from Jurafsky & Martin Documents in 3D Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

Vector Space Model Note that the query is projected into the same vector space as the documents. The query here is for “Marge”. We can use a vector similarity model to determine the best match to our query (details in a few slides). But what weights should we use for the terms?

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector We have already seen and discussed this model.

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector This model is open to exploitation by websites… sex sex sex sex sex sex sex sex sex sex Counts can be normalized by document lengths.

tf * idf Weights tf * idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

tf * idf

Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Cosine

Vector Space Similarity Measure

Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization that having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilities

Relevance Feedback Main Idea: –Modify existing query based on relevance judgements Query Expansion: Extract terms from relevant documents and add them to the query Term Re-weighing: and/or re-weight the terms already in the query –Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents –Users/system select terms from an automatically- generated list

Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Term Vector [Jordan, Bank, Bull, River] Term Weights [ 1, 1, 1, 1 ] Term Vector [Jordan, Bank, Bull, River] Term Weights [ 1.1, 0.1, 1.3, 1.2 ] Search Display Results Gather Feedback Update Weights Suppose you are interested in bovine agriculture on the banks of the river Jordan…

Rocchio Method

Rocchio Illustration Although we usually work in vector space for text, it is easier to visualize Euclidian space Original QueryTerm Re-weighting Note that both the location of the center, and the shape of the query have changed Query Expansion

Rocchio Method Rocchio automatically –re-weights terms –adds in new terms (from relevant docs) Most methods perform similarly –results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback Known to improve results People don’t seem to like giving feedback!

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is aboutEvaluation Evaluate

Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

Why Evaluate? Determine if the system is desirable Make comparative assessments

What to Evaluate? How much of the information need is satisfied. How much was learned about a topic. Incidental learning: –How much was learned about the collection. –How much was learned about other topics. How inviting the system is.

What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevant effectiveness

Relevant vs. Retrieved Relevant Retrieved All docs

Precision vs. Recall Relevant Retrieved All docs

Why Precision and Recall? Intuition: Get as much good stuff while at the same time getting as little junk as possible.

Retrieved vs. Relevant Documents Relevant Very high precision, very low recall

Retrieved vs. Relevant Documents Relevant Very low precision, very low recall (0 in fact)

Retrieved vs. Relevant Documents Relevant High recall, but low precision

Retrieved vs. Relevant Documents Relevant High precision, high recall (at last!)

Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

Document Cutoff Levels Another way to evaluate: –Fix the number of documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 top 500 –Measure precision at each of these levels –Take (weighted) average over results This is a way to focus on how well the system ranks the first k documents.

Problems with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –Assumes a strict rank ordering matters.

Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: a/(a+c) Why don’t we use Accuracy for IR? –(Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd Doc is Relevant Doc is NOT relevant Doc is retrieved Doc is NOT retrieved

The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

How to Evaluate? Test Collections

TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –2004 (November) will be 13 th year Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) –Web “subsets”

TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved -- not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents

TREC Benefits: –made research systems scale to large collections (pre- WWW) –allows for somewhat controlled comparisons Drawbacks: –emphasis on high recall, which may be unrealistic for what most users want –very long queries, also unrealistic –comparisons still difficult to make, because systems are quite different on many dimensions –focus on batch ranking rather than interaction –no focus on the WWW

TREC is changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance

Homework…