Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel

Agenda Bag of words representation Boolean matching Similarity-based matching Probabilistic matching

Retrieval System Model Query Formulation Detection Delivery Selection Examination Index Docs User Indexing

Detection Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

“Bag of Words” Representation Simple strategy for representing documents Count how many times each term occurs A “term” is any lexical item that you chose –A fixed-length sequence of characters (an “n-gram”) –A word (delimited by “white space” or punctuation) –Some standard “root form” of each word (e.g., a stem) –A phrase (e.g., phrases listed in a dictionary) Counts can be recorded in any consistent order

Bag of Words Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men tocome jump aid of their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Indexed Term Document 1Document 2 Stopword List

Boolean “Free Text” Retrieval Limit the bag of words to “absent” and “present” –“Boolean” values, represented as 0 and 1 Represent terms as a “bag of documents” –Same representation, but rows rather then columns Combine the rows using “Boolean operators” –AND, OR, NOT Any document with a 1 remaining is “detected”

Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1 Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

Controlled Vocabulary Example Canine AND Fox –Doc 1 Canine AND Political action –Empty Canine OR Political action –Doc 1, Doc 2 The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. Volunteerism Political action Fox Canine0 0 1 1 1 1 0 0 Descriptor Doc 1Doc 2 [Canine] [Fox] [Political action] [Volunteerism]

Ranked Retrieval Paradigm Exact match retrieval often gives useless sets –No documents at all, or way too many documents Query reformulation is one “solution” –Manually add or delete query terms “Best-first” ranking can be superior –Select every document within reason –Put them in order, with the “best” ones first –Display them one screen at a time

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Ranked Retrieval Challenges “Best first” is easy to say but hard to do! –Probabilistic retrieval tries to approximate it How can the user understand the ranking? –It is hard to use a tool that you don’t understand Efficiency may become a concern –More complex computations take more time

Document Similarity How similar are two documents? –In particular, how similar is their bag of words? 1 1 1 1: Nuclear fallout contaminated Montana. 2: Information retrieval is interesting. 3: Information retrieval is complicated. 1 1 1 1 1 1 nuclear fallout siberia contaminated interesting complicated information retrieval 1 123

Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first

A Simple Ranking Strategy information AND retrieval Readings in Information Retrieval Information Storage and Retrieval Speech-Based Information Retrieval for Digital Libraries Word Sense Disambiguation and Information Retrieval information NOT retrieval The State of the Art in Information Filtering Inference Networks for Document Retrieval Content-Based Image Retrieval Systems Video Parsing, Retrieval and Browsing An Approach to Conceptual Text Retrieval Using the EuroWordNet … Cross-Language Retrieval: English/Russian/French retrieval NOT information

The Coordination Measure 1 1 1 1 1 1 1 1 1 nuclear fallout siberia contaminated interesting complicated information retrieval 1 123 Query: complicated retrieval Result: 3, 2 Query: information retrieval Result: 2, 3 Query: interesting nuclear fallout Result: 1, 2

1 1 1 1 1 1 1 1 1 nuclear fallout siberia contaminated interesting complicated information retrieval 1 123 The Vector Space Model Model similarity as inner product –Line up query and doc vectors –Multiply weights for each term –Add up the results With binary term weights, same results as coordination match

Term Frequency Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “collection frequency” makes it stronger still

Incorporating Term Frequency High term frequency is evidence of meaning –And high IDF is evidence of term importance Recompute the bag-of-words –Compute TF * IDF for every element

TF*IDF Example 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602

The Document Length Effect Humans look for documents with useful parts –But term weights are computed for the whole Document lengths vary in many collections –So term weight calculations could be inconsistent Two strategies –Adjust term weights for document length –Divide the documents into equal “passages”

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects –Related somehow to maximum term frequency –But also sensitive to the of number of terms

“Cosine” Normalization Compute the length of each document vector –Multiply each weight by itself –Add all the resulting values –Take the square root of that sum Divide each weight by that length

Cosine Normalization Example 0.29 0.37 0.53 0.13 0.62 0.77 0.57 0.14 0.19 0.79 0.05 0.71 123 0.69 0.44 0.57 4 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 1.700.972.670.87 Length

Why Call It “Cosine”?  d2 d1

Interpreting the Cosine Measure Think of a document as a vector from zero Similarity is the angle between two vectors –Small angle = very similar –Large angle = little similarity Passes some key sanity checks –Depends on pattern of word use but not on length –Every document is most similar to itself

The Okapi BM25 Formula Discovered mostly through trial and error –Requires a large IR test collection

Summary So Far Find documents most similar to the query Optionally, Obtain query term weights –Given by the user, or computed from IDF Compute document term weights –Some combination of TF and IDF Normalize the document vectors –Cosine is one way to do this Compute inner product of query and doc vectors –Multiply corresponding elements and then add

Probability Ranking Principle Claim: –Documents should be ranked in order of decreasing probability of relevance to the query Binary relevance & independence assumptions –Each document is either relevant or it is not –Relevance of one doc reveals nothing about another

Probabilistic Retrieval Strategy Estimate how terms contribute to relevance –Use term frequency as evidence Make “binary independence” assumptions –Listed on the next slide Compute document relevance probability –Combine evidence from all terms Order documents by decreasing probability

Inference Networks A flexible way of combining term weights –Boolean model –Binary independence model –Probabilistic models with weaker assumptions Efficient large-scale implementation –InQuery text retrieval system from U Mass

Binary Independence Assumptions No prior knowledge about any document Each document is either relevant or it is not Relevance of one doc tells nothing about others Presence of one term tells nothing about others

A Binary Independence Network bat d1d2d3d4 catfathatmatpatratvatsat query

Probability Computation Turn on exactly one document at a time –Boolean: Every connected term turns on –Binary Ind: Connected terms gain their weight Compute the query value –Boolean: AND and OR nodes use truth tables –Binary Ind: Fraction of the possible weight

A Boolean Inference Net bat d1d2d3d4 catfathatmatpatratvat ANDOR sat AND

Critique of Probabilistic Retrieval Most of the assumptions are not satisfied! –Searchers want utility, not relevance –Relevance is not binary –Terms are clearly not independent –Documents are often not independent The best known term weights are quite ad hoc –Unless some relevant documents are known

A Response Ranked retrieval paradigm is powerful –Well suited to human search strategies Probability theory has explanatory power –At least we know where the weak spots are Inference networks are extremely flexible –Easily accommodates newly developed models Implemented by InQuery –Effective, efficient, and large-scale

Summary Three fundamental models –Boolean, probabilistic, vector space Common characteristics –Bag of words representation –Choice of indexing terms Probabilistic & vector space share commonalties –Ranked retrieval –Term weights –Term independence assumption

Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

Similar presentations

Presentation on theme: "Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

Similar presentations

Presentation on theme: "Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel."— Presentation transcript:

Similar presentations

About project

Feedback