Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

Slides:



Advertisements
Similar presentations
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
CS 430 / INFO 430 Information Retrieval
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Scripting Languages Chapter 8 More About Regular Expressions.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Last Updated March 2006 Slide 1 Regular Expressions.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Tutorial 13 Validating Documents with Schemas
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Information Retrieval
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
© 2006 Lawrenceville Press Slide 1 Chapter 6 The Post-Test Do…Loop Statement  Loop structure that executes a set of statements as long as a condition.
Restrictions Objectives of the Lecture : To consider the algebraic Restrict operator; To consider the Restrict operator and its comparators in SQL.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
© 2010 Lawrenceville Press Slide 1 Chapter 5 The Do…Loop Statement  Loop structure that executes a set of statements as long as a condition is true. 
CS 430: Information Discovery
Query Languages.
Chapter 5 The Do…Loop Statement
Matcher functions boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. boolean lookingAt() Attempts to.
Query Languages Berlin Chen 2003 Reference:
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Prof. Raymond J. Mooney in CS378 at University of Texas Introduction to Modern Information Retrieval by Gerald Salton and Michael J. McGill, 1983, McGraw-Hill.

CSE 8337 Spring Query Languages TOC Keyword Based Boolean Weighted Boolean Context Based (Phrasal & Proximity) Pattern Matching Structural Queries

CSE 8337 Spring Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity

CSE 8337 Spring Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

CSE 8337 Spring Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

CSE 8337 Spring Weighted Boolean Queries Extension to boolean queries which adds weights to terms. Weights indicate the degree of importance that that term has in the operation. Traditional Boolean Interpretation:

CSE 8337 Spring Processing Weighted Boolean Queries If both query and document terms are weighted, then retrieve a document if the document term weight is greater than the query term weight. If documents are not weighted rank based on some similarity measure. The weights in the query determines which (number) documents to be included. The next few slides show how this can be done in a system with no document term weights.

CSE 8337 Spring Weighted Boolean Interpretation A 0 A 0.33 A 0.5 A 0.66 A 1 To perform Boolean operations, must be able to determine distance from items in A(B) to B(A). Centroid based distances

CSE 8337 Spring Weighted OR A a OR B b As b gets closer to 1, include more items in B which are closest to A. As a gets closer to 1, include more items in A which are closest to B. A B A OR B 0.33 – All items in A plus 1/3 of those in B A 0.66 OR B 0.33 – 2/3 the items in A and 1/3 of those in B A B

CSE 8337 Spring Weighted AND A a AND B b As b gets closer to 1, items in A-B farthest from intersection are removed. As a gets closer to 1, items in B-A farthest from intersection are removed. A AND B 0.33 – Remove all of B and 1/3 of A outside of intersection. A B A 0.66 AND B 0.33 – Remove 1/3 of A and 2/3 of B. A B

CSE 8337 Spring Weighted NOT A a NOT B b As b gets closer to 1, remove more items from intersection that are farthest from A-B. a indicates how much of A to include. A B A NOT B 0.33 – All items in A-B and 2/3 of items in intersection. A 0.66 NOT B 0.33 – 2/3 the items in A-B and 2/3 of items in intersection. A B

CSE 8337 Spring Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.

CSE 8337 Spring Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

CSE 8337 Spring Phrasal Search Algorithm 1. Find set of documents D in which all keywords (k 1 …k m ) in phrase occur (using AND query processing). 2. Intitialize empty set, R, of retrieved documents. 3. For each document, d, in D do 4. Get array, P i, of positions of occurrences for each k i in d 5. Find shortest array P s of the P i ’s 6. For each position p of keyword k s in P s do 7. For each keyword k i except k s do 8. Use binary search to find a position (p – s + i ) in the array P i 1. If correct position for every keyword found, add d to R 2. Return R

CSE 8337 Spring Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

CSE 8337 Spring Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance.

CSE 8337 Spring Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 8337 Spring Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

CSE 8337 Spring Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: Edit distance (cost of insert/delete/match) Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity.

CSE 8337 Spring Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: “misspell” to “mispell” is 7 “misspelled” to “misinterpretted” is 7 “mis…p…e…ed”

CSE 8337 Spring Regular Expressions Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e 1 and e 2 are regexes, then (e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 Repetition: If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1

CSE 8337 Spring Regular Expression Examples (u|e)nabl(e|ing) matches unable unabling enable enabling (un|en)*able matches able unable unenable enununenable

CSE 8337 Spring Enhanced Regex’s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s

CSE 8337 Spring Perl Regex’s Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S). (wildcard) Anything Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string

CSE 8337 Spring Perl Regex Examples U.S. phone number with optional area code: /\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/ address: Note: Packages available to support Perl regex’s in Java

CSE 8337 Spring Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: chapter titlesectiontitlesection titlesubsection chapter book

CSE 8337 Spring Queries with Structure Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title SFQL: Relational database query language SQL enhanced with “full text” search. Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950