Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Data Mining Introductory and Advanced Topics by Margaret H. Dunham Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation
CSE 8337 Spring Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.
CSE 8337 Spring Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need (example): Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)
CSE 8337 Spring DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles
CSE 8337 Spring Unstructured data Typically refers to free text Allows Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents
CSE 8337 Spring Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure
CSE 8337 Spring DB vs IR (cont’d) Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
CSE 8337 Spring Motivation IR in the last 20 years: classification and categorization systems and languages user interfaces and visualization Still, area was seen as of narrow interest Advent of the Web changed this perception once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions!
CSE 8337 Spring Unstructured (text) vs. structured (database) data in 1996
CSE 8337 Spring Unstructured (text) vs. structured (database) data in 2006
CSE 8337 Spring Basic Concepts The User Task Retrieval information or data purposeful Browsing glancing around Feedback Retrieval Browsing Database Response Feedback
CSE 8337 Spring Basic Concepts Logical view of the documents structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms
CSE 8337 Spring User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database / WWW Text The Retrieval Process
CSE 8337 Spring Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task
CSE 8337 Spring Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function
CSE 8337 Spring Fuzzy Sets
CSE 8337 Spring IR is Fuzzy SimpleFuzzy Not Relevant Relevant
CSE 8337 Spring Information Retrieval Metrics Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|
CSE 8337 Spring IR Query Result Measures IR
CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation
CSE 8337 Spring Text Processing TOC Simple Text Storage String Matching String-to-String Correction (Approximate matching)
CSE 8337 Spring Text storage EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp
CSE 8337 Spring Pattern Matching(Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.
CSE 8337 Spring Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measures measure how unlike or dissimilar objects are.
CSE 8337 Spring String Matching Problem Input: Pattern – length m Text string – length n Find one (next, all) occurrences of string in pattern Ex: String: Pattern:
CSE 8337 Spring String Matching Algorithms Brute Force Knuth-Morris Pratt Boyer Moore
CSE 8337 Spring Brute Force String Matching Brute Force Handbook of Algorithms and Data Structures Space O(m+n) Time O(mn)
CSE 8337 Spring FSR
CSE 8337 Spring Creating FSR Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments to failure bus, initial bus, or other failure links.
CSE 8337 Spring Knuth-Morris-Pratt Apply FSM to string by processing characters one at a time. Accepting state is reached when pattern is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data Structures
CSE 8337 Spring Boyer-Moore Scan pattern from right to left Skip many positions on illegal character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structures
CSE 8337 Spring String-to-String Correction Measure of similarity between strings Can be used to determine how to convert from one string to another Cost to convert one to the other Transformations Match: Current characters in both strings are the same Delete: Delete current character in input string Insert: Insert current character in target string into string
CSE 8337 Spring Distance Between Strings
CSE 8337 Spring Approximate String Matching Find patterns “close to” the string Fuzzy matching Applications: Spelling checkers IR Define similarity (distance) between string and pattern
CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation
CSE 8337 Spring Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity
CSE 8337 Spring Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.
CSE 8337 Spring Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.
CSE 8337 Spring Term-document incidence 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia
CSE 8337 Spring Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND AND AND =
CSE 8337 Spring Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar What happens if the word Caesar is added to document 14?
CSE 8337 Spring Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings lists Sorted by docID (more later on why). Posting
CSE 8337 Spring Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman More on these later. Documents to be indexed. Friends, Romans, countrymen.
CSE 8337 Spring Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps
CSE 8337 Spring Sort by terms. Core indexing step.
CSE 8337 Spring Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.
CSE 8337 Spring The result is split into a Dictionary file and a Postings file.
CSE 8337 Spring Where do we pay in storage? Pointers Terms Will quantify the storage, later.
CSE 8337 Spring The index we just built How do we process a query? Later - what kinds of queries can we process? Today’s focus
CSE 8337 Spring Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: Brutus Caesar
CSE 8337 Spring The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
CSE 8337 Spring Example: WestLaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence
CSE 8337 Spring Boolean queries: More general merges Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)? What can we achieve?
CSE 8337 Spring Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “linear” time? Linear in what? Can we do better?
CSE 8337 Spring Query optimization What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND them together. Brutus Calpurnia Caesar Query: Brutus AND Calpurnia AND Caesar
CSE 8337 Spring Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. Brutus Calpurnia Caesar This is why we kept freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.
CSE 8337 Spring More general optimization e.g., (madding OR crowd) AND (ignoble OR strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its freq’s (conservative). Process in increasing order of OR sizes.
CSE 8337 Spring Exercise Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
CSE 8337 Spring Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.
CSE 8337 Spring Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.
CSE 8337 Spring Phrasal Search Algorithm 1. Find set of documents D in which all keywords (k 1 …k m ) in phrase occur (using AND query processing). 2. Intitialize empty set, R, of retrieved documents. 3. For each document, d, in D do 4. Get array, P i, of positions of occurrences for each k i in d 5. Find shortest array P s of the P i ’s 6. For each position p of keyword k s in P s do 7. For each keyword k i except k s do 8. Use binary search to find a position (p – s + i ) in the array P i 1. If correct position for every keyword found, add d to R 2. Return R
CSE 8337 Spring Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.
CSE 8337 Spring Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance.
CSE 8337 Spring Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.
CSE 8337 Spring Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.
CSE 8337 Spring Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: Edit distance (cost of insert/delete/match) Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity.
CSE 8337 Spring Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: “misspell” to “mispell” is 7 “misspelled” to “misinterpretted” is 7 “mis…p…e…ed”
CSE 8337 Spring Regular Expressions Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e 1 and e 2 are regexes, then (e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 Repetition: If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1
CSE 8337 Spring Regular Expression Examples (u|e)nabl(e|ing) matches unable unabling enable enabling (un|en)*able matches able unable unenable enununenable
CSE 8337 Spring Enhanced Regex’s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s
CSE 8337 Spring Perl Regex Examples U.S. phone number with optional area code: /\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/ address: Note: Packages available to support Perl regex’s in Java
CSE 8337 Spring Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: chapter titlesectiontitlesection titlesubsection chapter book
CSE 8337 Spring Queries with Structure Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title SFQL: Relational database query language SQL enhanced with “full text” search. Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950
CSE 8337 Spring Ranking search results Boolean queries give inclusion or exclusion of docs. Often we want to rank/group results Need to measure proximity from query to each doc. Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.
CSE 8337 Spring The web and its challenges Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams... How do search engines work? And how can we make them better?
CSE 8337 Spring More sophisticated information retrieval Cross-language information retrieval Question answering Summarization Text mining …
CSE 8337 Spring Perl Regex’s Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S). (wildcard) Anything Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string