CS 430 / INFO 430 Information Retrieval

Slides:



Advertisements
Similar presentations
CQL – a Common Query LanguageMike Taylor CQL – a Common Query Language 1. What CQL is 2. Motivation 3. Examples and explanation 4. Applications 5. Implementation.
Advertisements

? CQL – a Common Query LanguageMike Taylor CQL – a Common Query Language 1. What CQL is 2. Motivation 3. Examples and explanation 4.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Space-for-Time Tradeoffs
15-853Page : Algorithms in the Real World Suffix Trees.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
CS 430 / INFO 430 Information Retrieval
Modern Information Retrieval Chapter 8 Indexing and Searching.
CS 430 / INFO 430 Information Retrieval
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Aki Hecht Seminar in Databases (236826) January 2009
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing.
CSE3201/CSE4500 Information Retrieval Systems
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Indexing and Searching
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
ASP.NET Programming with C# and SQL Server First Edition
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Querying Structured Text in an XML Database By Xuemei Luo.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
CS 430: Information Discovery
Data Structure. Two segments of data structure –Storage –Retrieval.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Introduction to Digital Libraries Information Retrieval.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Information Retrieval LECTURE 1 : Introduction.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Why indexing? For efficient searching of a document
SQL Query Getting to the data ……..
Data Structures and Design in Java © Rick Mercer
CHP - 9 File Structures.
Text Based Information Retrieval
CS 430: Information Discovery
External Methods Chapter 15 (continued)
Indexing and Searching (File Structures)
Query Languages.
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 7 Space and Time Tradeoffs
CS 430 / INFO 430 Information Retrieval
Space-for-time tradeoffs
Recuperação de Informação B
Indexing and Searching
Presentation transcript:

CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing

Course administration Assignment 1 Dump of Files 1a and 1b Extra words added to assignment: For each file, list out the data in the first few records, with the values in the various fields. The definitions of the fields and the data structures used to store the records should be described in the report.

Course administration Porter Stemming Algorithm Complex suffixes Complex suffixes are removed bit by bit in the different steps. Thus: GENERALIZATIONS becomes GENERALIZATION (Step 1) becomes GENERALIZE (Step 2) becomes GENERAL (Step 3) becomes GENER (Step 4).

Query Languages: the Common Query Language The Common Query Language: a formal language for queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. Objective: human readable and human writable; intuitive while maintaining the expressiveness of more complex languages. Traditionally, query languages have fallen into two camps: (a) Powerful and expressive languages which are not easily readable nor writable by non-experts (e.g. SQL and XQuery). (b) Simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL or Google's query language).

The Common Query Language The Common Query Language is maintained by the Z39.50 International Maintenance Agency at the Library of Congress. http://www.loc.gov/z3950/agency/zing/cql/ The following examples are taken from the CQL Tutorial, A Gentle Introduction to CQL.

The Common Query Language: Examples Simple queries dinosaur comp.sources.misc "complete dinosaur" "the complete dinosaur" "ext->u.generic" "and" Booleans dinosaur or bird dinosaur and bird or dinobird (bird or dinosaur) and (feathers or scales) "feathered dinosaur" and (yixian or jehol) (((a and b) or (c not d) not (e or f and g)) and h not i) or j

The Common Query Language: Examples Indexes [fielded searching] title = dinosaur title = ((dinosaur and bird) or dinobird) dc.title = saurischia bath.title="the complete dinosaur" srw.serverChoice=foo srw.resultSet=bar Index-set mapping [definition of fields] >dc="http://www.loc.gov/srw/index-sets/dc" dc.title=dinosaur and dc.author=farlow

The Common Query Language: Examples Proximity The prox operator: prox/relation/distance/unit/ordering Examples: complete prox dinosaur [adjacent] (caudal or dorsal) prox vertebra ribs prox//5 chevrons [near 5] ribs prox//0/sentence chevrons [same sentence] ribs prox/>/0/paragraph chevrons [not adjacent]

The Common Query Language: Examples Relations year > 1998 title all "complete dinosaur" title any "dinosaur bird reptile" title exact "the complete dinosaur" publicationYear < 1980 numberOfWheels <= 3 numberOfPlates = 18 lengthOfFemur > 2.4 bioMass >= 100 numberOfToes <> 3

The Common Query Language: Examples Relation Modifiers title all/stem "complete dinosaur" title any / relevant "dinosaur bird reptile" title exact/fuzzy "the complete dinosaur" author = /fuzzy tailor The implementations of relevant and fuzzy are not defined by the query language.

The Common Query Language: Examples Pattern Matching dinosaur* [zero or more characters] *sauria man?raptor [exactly one character] man?raptor* "the comp*saur" char\* [literal "*"] Word Anchoring title="^the complete dinosaur" [beginning of field] author="bakker^" [end of field] author all "^kernighan ritchie" author any "^kernighan ^ritchie ^thompson"

The Common Query Language: Examples A complete example dc.author=(kern* or ritchie) and (bath.title exact "the c programming language" or dc.title=elements prox///4 dc.title=programming) and subject any/relevant "style design analysis" Find records whose author (in the Dublin Core sense) includes either a word beginning kern or the word ritchie, and which have either the exact title (in the sense of the Bath profile) the c programming language or a title containing the words elements and programming not more the four words apart, and whose subject is relevant to one or more of the words style, design or analysis.

Regular Expressions in Java Package java.util.regex Classes for matching character sequences against patterns specified by regular expressions. An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl. Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.

String Searching: Naive Algorithm Objective: Given a pattern, find any substring of a given text that matches the pattern. p pattern to be matched m length of pattern p (characters) t the text to be searched n length of t (characters) The naive algorithm examines the characters of tx in sequence. for j from 1 to n-m+1 if character j of t matches the first character of p (compare following characters of t and p until a complete match or a difference is found)

String Searching: Knuth-Morris-Pratt Algorithm Concept: The naive algorithm is modified, so that whenever a partial match is found, it may be possible to advance the character index, j, by more than 1. Example: p = "university" t = "the uniform commercial code ..." j=5 after partial match continue here To indicate how far to advance the character pointer, p is preprocessed to create a table, which lists how far to advance against a given length of partial match. In the example, j is advanced by the length of the partial match, 3.

Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards many of the non-qualifying items. Advantages • Much faster than full text scanning -- 1 or 2 orders of magnitude • Modest space overhead -- 10% to 15% of file • Insertion is straightforward Disadvantages • Sequential searching is no good for very large files • Some hits are false hits

Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.

Signature Files Example Word Signature free 001 000 110 010 text 000 010 101 001 block signature 001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block

Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd . Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.

String Matching Find File: Find all files whose name includes the string q. Simple algorithm: Build an inverted index of all substrings of the file names of the form *f, Example: if the file name is foo.txt, search terms are: foo.txt oo.txt o.txt .txt txt xt t Lexicographic processing allows searching by any q.

Search for Substring In some information retrieval applications, any substring can be a search term. Tries, using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k null ning

Tries: Sistrings A binary example String: 01 100 100 010 111 2 11 001 000 101 11 3 10 010 001 011 1 4 00 100 010 111 5 01 000 101 11 6 10 001 011 1 7 00 010 111 8 00 101 11

Tries: Lexical Ordering 7 00 010 111 4 00 100 010 111 8 00 101 11 5 01 000 101 11 1 01 100 100 010 111 6 10 001 011 1 3 10 010 001 011 1 2 11 001 000 101 11 Unique string indicated in blue

Trie: Basic Concept 1 1 1 2 1 1 7 5 1 1 6 3 1 4 8

Patricia Tree 1 1 2 2 1 1 00 3 3 4 2 1 1 10 7 5 5 1 6 3 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.