Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
CS 430 / INFO 430 Information Retrieval
Modern Information Retrieval Chapter 8 Indexing and Searching.
IR Models: Overview, Boolean, and Vector
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Models: Structural Models
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
+ Atomic Design By Pattern Lab Delaney Metzger. + Goal of Atomic Design Atomic Design is an idea that can be used to translate the creative side into.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Querying Structured Text in an XML Database By Xuemei Luo.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Structured Text Retrieval Models. Str. Text Retrieval Text Retrieval retrieves documents based on index terms. Observation: Documents have implicit structure.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Chapter 13. Structured Text Retrieval With Mounia Lalmas 무선 / 이동 시스템 연구실 김민혁.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
Searching EIT, Author Gay Robertson, 2017.
Data Mining Chapter 6 Search Engines
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Models for Retrieval and Browsing - Structural Models and Browsing
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Advanced information retrieval
Presentation transcript:

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body of the document n But, the document structure is one additional piece of information which can be taken advantage of n For instance, words appearing in the title or in sub-titles within the document could receive higher

Introduction n Consider the following information need: u Retrieve all documents which contain a page in which the string “atomic holocaust” appears in italic in the text surrounding a Figure whose label contains the word earth n The corresponding query could be: u same-page( near( “atomic holocaust”, Figure( label( “earth” ))))

Introduction n Advanced interfaces that facilitate the specification of the structure are also highly desirable n Models which allow combining information on text content with information on document structure are called structured text models n Structured text models include no ranking (open research problem)

Basic Definitions n Match point: the position in the text of a sequence of words that match the query u Query: “atomic holocaust in Hiroshima” u Doc dj: contains 3 lines with this string u Then, doc dj contains 3 match points n Region: a contiguous portion of the text n Node: a structural component of the text such as a chapter, a section, etc

Non-Overlapping Lists n Due to Burkowski, n Idea: divide the text in non-overlapping regions which are collected in a list n Multiple ways to divide the text in non- overlapping parts yield multiple lists: u a list for chapters u a list for sections u a list for subsections n Text regions from distinct lists might overlap

Non-Overlapping Lists L0L0 L1L1 L2L2 Sections SubSections SubSubSections L3L3 Chapter

Non-Overlapping Lists n Implementation: u single inverted file that combines keywords and text regions u to each entry in this inverted file is associated a list of text regions u lists of text regions can be merged with lists of keywords

Non-Overlapping Lists n Regions are non-overlapping which limits the queries that can be asked n Types of queries: u select a region that contains a given word u select a region A that does not contain a region B (regions A and B belong to distinct lists) u select a region not contained within any other region

Conclusions n The non-overlapping lists model is simple and allows efficient implementation n But, types of queries that can be asked are limited n Also, model does not include any provision for ranking the documents by degree of similarity to the query n What does structural similarity mean?

Proximal Nodes n Due to Navarro and Baeza-Yates, 1997 n Idea: define a strict hierarchical index over the text. This enrichs the previous model that used flat lists. n Multiple index hierarchies might be defined n Two distinct index hierarchies might refer to text regions that overlap

Definitions n Each indexing structure is a strict hierarchy composed of u chapters u sections u subsections u paragraphs u lines n Each of these components is called a node n To each node is associated a text region

Proximal Nodes Sections SubSections SubSubSections Chapter holocaust ,324

Proximal Nodes n Key points: u In the hierarchical index, one node might be contained within another node u But, two nodes of a same hierarchy cannot overlap u The inverted list for keywords complements the hierarchical index u The implementation here is more complex than that for non-overlapping lists

Proximal Nodes n Queries are now regular expressions: u search for strings u references to structural components u combination of these n Model is a compromise between expressiveness and efficiency n Queries are simple but can be processed efficiently n Further, model is more expressive than non- overlapping lists

Proximal Nodes n Query: find the sections, the subsections, and the subsubsections that contain the word “holocaust” u [(*section) with (“holocaust”)] n Simple query processing: u traverse the inverted list for “holocaust” and determine all match points u use the match points to search in the hierarchical index for the structural components

Proximal Nodes n Query: [(*section) with (“holocaust”)] n Sophisticated query processing: u get the first entry in the inverted list for “holocaust” u use this match point to search in the hierarchical index for the structural components u Innermost matching component: smaller one u Check if innermost matching component includes the second entry in the inverted list for “holocaust” u If it does, check the third entry and so on u This allows matching efficiently the nearby (or proximal) nodes

Conclusions n Model allows formulating queries that are more sophisticated than those allowed by non- overlapping lists n To speed up query processing, nearby nodes are inspected n Types of queries that can be asked are somewhat limited (all nodes in the answer must come from a same index hierarchy!) n Model is a compromise between efficiency and expressiveness