Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
A Whirlwind Tour of Search Engine Design Issues CMSC 498W April 4, 2006 Douglas W. Oard.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Ch 4: Information Retrieval and Text Mining
Evaluating the Performance of IR Sytems
Advance Information Retrieval Topics Hassan Bashiri.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Information Retrieval and Web Design
Structure of IR Systems
Presentation transcript:

Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases

Agenda How to do it How it works The “A” Team

Database Server-side Programming Interchange Language Client-side Programming Web Browser Client Hardware Server Hardware (PC, Unix) (MySQL) (PHP) (HTML, XML) (JavaScript) (IE, Firefox) (PC) Business rules Interaction Design Interface Design Relational normalization Structured programming Software patterns Object-oriented design Functional decomposition

Full-Text Indexing in MySQL Create a MyISAM table (not InnoDB!) –Include a CHAR, VARCHAR, or TEXT field –Text fields can hold a bit over 10,000 words Create a FULLTEXT index –ALTER TABLE x ADD FULLTEXT INDEX y; Issue a (ranked) query –SELECT y FROM x WHERE MATCH y AGAINST (‘cat’);

Other Types of Queries Automatic (ranked) vocabulary expansion –SELECT y FROM x WHERE MATCH y AGAINST (‘cat’ WITH QUERY EXPANSION); Boolean (unranked) search –SELECT y FROM x WHERE MATCH y AGAINST (‘+cat -dog’ IN BOOLEAN MODE);

Query Details No more than 254 characters (~40 words) –Longer queries take more time Multiple words are implicitly joined by “OR” Boolean queries can use (unnested) operators –Words preceded by “+” must occur (AND) –Words preceded by “-” must not occur (AND NOT)

What’s a “Word?” Delimited by “white space” or “-” –White-space includes space, tab, newline, … Not case sensitive Exact string match –No “stemming” (automatic truncation) Boolean search has additional options –Truncation (e.g., time*) –Phrases (e.g., “cats and dogs”)

Unsearchable Words Very common words –Those that appear in more than 50% of docs Words of 3 or fewer characters –Rarely are topically specific Other “stopwords” –able about above according accordingly across actually after afterwards again against ain't …

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search

Search Goal Choose the same documents a human would –Without human intervention (less work) –Faster than a human could (less time) –As accurately as possible (less accuracy) Humans start with an information need –Machines start with a query Humans match documents to information needs –Machines match document & query representations

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge We seek utility, but relevance is what we get!

Problems With Word Matching Word matching suffers from two problems –Synonymy: paper vs. article –Homonymy: bank (river) vs. bank (financial) Disambiguation in IR: seek to resolve homonymy –Index word senses rather than words Synonymy usually addressed by –Thesaurus-based query expansion –Latent semantic indexing

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [ ]

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Term Document 1Document 2 Stopword List

Boolean IR Strong points –Accurate, if you know the right strategies –Efficient for the computer Weaknesses –Often results in too many documents, or none –Users must learn Boolean logic –Sometimes finds relationships that don’t exist –Words can have many meanings –Choosing the right words is sometimes hard

Proximity Operators More precise versions of AND –“NEAR n” allows at most n-1 intervening terms –“WITH” requires terms to be adjacent and in order Easy to implement, but less efficient –Store a list of positions for each word in each doc Stopwords become very important! –Perform normal Boolean computations Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Ranked Retrieval Challenges “Best first” is easy to say but hard to do! –The best we can hope for is to approximate it Will the user understand the process? –It is hard to use a tool that you don’t understand Efficiency becomes a concern –Only a problem for long queries, though

Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first Surprisingly, this works pretty well! –Especially for very short queries

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “collection frequency” makes it stronger still

The Document Length Effect Humans look for documents with useful parts –But probabilities are computed for the whole Document lengths vary in many collections –So probability calculations could be inconsistent Two strategies –Adjust probability estimates for document length –Divide the documents into equal “passages”

Incorporating Term Frequency High term frequency is evidence of meaning –And high IDF is evidence of term importance Recompute the bag-of-words –Compute TF * IDF for every element

TF*IDF Example nuclear fallout siberia contaminated interesting complicated information retrieval Unweighted query: contaminated retrieval Result: 2, 3, 1, 4 Weighted query: contaminated(3) retrieval(1) Result: 1, 3, 2, 4 IDF-weighted query: contaminated retrieval Result: 2, 3, 1, 4

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects –Related somehow to maximum term frequency –But also sensitive to the of number of terms

“Okapi” Term Weights TF componentIDF component

MySQL Term Weights local weight = (log(tf)+1)/sumtf * U/( *U) global weight = log((N-nf)/nf) query weight = local weight * global weight * qf tf How many times the term appears in the row sumtf The sum of "(log(tf)+1)" for all terms in the same row U How many unique terms are in the row N How many rows are in the table nf How many rows contain the term qf How many times the term appears in the query

Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Optionally, get query term weights from the user –Estimate of term importance Compute inner product of query and doc vectors –Multiply corresponding elements and then add

The Indexing Process quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

How Big Is the Postings File? Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used! Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

How Big is the Inverted Index? Typically smaller than the postings file –Depends on number of terms, not documents Eventually, most terms will already be indexed –But the postings file will continue to grow Postings dominate asymptotic space complexity –Linear in the number of documents

Summary Slow indexing yields fast query processing –Key fact: most terms don’t appear in most documents We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk block reads are the critical resource –This makes index compression a big win