1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Longest Common Subsequence
Analysis of Algorithms
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Text Search and Fuzzy Matching
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Tutorial 14 Working with Forms and Regular Expressions.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet:
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Tutorial 13 Validating Documents with Schemas
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 Information Retrieval LECTURE 1 : Introduction.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
7 Copyright © 2009, Oracle. All rights reserved. Regular Expression Support.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Restrictions Objectives of the Lecture : To consider the algebraic Restrict operator; To consider the Restrict operator and its comparators in SQL.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Query Languages.
Matcher functions boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. boolean lookingAt() Attempts to.
Query Languages Berlin Chen 2003 Reference:
Information Retrieval and Web Design
Languages Fall 2018.
Presentation transcript:

1 Query Languages

2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

3 Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

4 “Natural Language” Queries Full text queries as arbitrary strings. Typically just treated as a bag-of-words for a vector-space model. Typically processed using standard vector- space retrieval methods.

5 Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) –“information theory” May allow intervening stop words and/or stemming. –“buy camera” matches: “buy a camera” “buying the cameras” etc.

6 Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

7 Phrasal Search Find set of documents D in which all keywords (k 1 …k m ) in phrase occur (using AND query processing). Intitialize empty set, R, of retrieved documents. For each document, d, in D: Get array, P i,of positions of occurrences for each k i in d Find shortest array P s of the P i ’s For each position p of keyword k s in P s For each keyword k i except k s Use binary search to find a position (p – s + i) in the array P i If correct position for every keyword found, add d to R Return R

8 Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

9 Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance.

10 Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

11 Simple Patterns Prefixes: Pattern that matches start of word. –“anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: –“ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. – “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. –“tin” to “tix” matches “tip”, “tire”, “title”, etc.

12 Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: –Edit distance (Levenstein distance) –Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity.

13 Edit (Levenstein) Distance Minimum number of character deletions, additions, or replacements needed to make two strings equivalent. –“misspell” to “mispell” is distance 1 –“misspell” to “mistell” is distance 2 –“misspell” to “misspelling” is distance 3 Can be computed efficiently using dynamic programming in O(mn) time where m and n are the lengths of the two strings being compared.

14 Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: –“misspell” to “mispell” is 7 –“misspelled” to “misinterpretted” is 7 “mis…p…e…ed”

15 Searching for Similar Words When spell-correcting a word, it is inefficient to serially search every word in the dictionary, compute the edit distance or LCS for each, and then take the most similar word. Use indexing to find most similar dictionary word without doing a linear search.

16 k-gram Index An inverted index for sequences of k characters contained in a word. –3-grams for “index”: $in, ind, nde, dex, ex$ (where $ is a special char denoting start or end of a word) For each k-gram encountered in the dictionary, the k-gram index has a pointer to all words that contain that k-gram. –dex → {index, dexterity, ambidextrous}

17 Using a k-gram Index Given a word, generate its “bag of k-grams” and use the k-gram index like a normal inverted index to find a word that contains many of the same k- grams. Like normal document retrieval except: –words → k-grams –documents → words Example: –Query: endex →{$en, end, nde, dex, ex$} –Retrieval Result: 1) index, 2) ended, 3) endear…. –Compute detailed score just for top retrievals and take final top-scoring candidate.

18 Regular Expressions Language for composing complex patterns from simpler ones. –An individual character is a regex. –Union: If e 1 and e 2 are regexes, then (e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. –Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 –Repetition (Kleene closure): If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1

19 Regular Expression Examples (u|e)nabl(e|ing) matches –unable –unabling –enable –enabling (un|en)*able matches –able –unable –unenable –enununenable

20 Enhanced Regex’s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. –A{1,5} One to five A’s. –A{5,} Five or more A’s –A{5} Exactly five A’s

21 Perl Regex’s Character classes: –\w (word char) Any alpha-numeric (not: \W) –\d (digit char) Any digit (not: \D) –\s (space char) Any whitespace (not: \S) –. (wildcard) Anything Anchor points: –\b (boundary) Word boundary –^ Beginning of string –$ End of string

22 Perl Regex Examples U.S. phone number with optional area code: –/\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/ address: Note: Perl regex’s supported in java.util.regex package

23 Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: –Fixed set of fields, e.g. title, author, abstract, etc. –Hierarchical (recursive) tree structure: chapter titlesectiontitlesection titlesubsection chapter book

24 Queries with Structure Allow queries for text appearing in specific fields: –“nuclear fusion” appearing in a chapter title SFQL: Relational database query language SQL enhanced with “full text” search. –Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950