Querying Structured Text in an XML Database By Xuemei Luo.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Chapter 5: Introduction to Information Retrieval
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Tries Standard Tries Compressed Tries Suffix Tries.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
IR Models: Structural Models
Ch 4: Information Retrieval and Text Mining
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Evaluating the Performance of IR Sytems
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Query Processing Presented by Aung S. Win.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Concepts of Database Management, Fifth Edition
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Processing In Multimedia Databases Dheeraj Kumar Mekala Devarasetty Bhanu Kiran.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.
ITGS Databases.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Database Management System
Probabilistic Data Management
Chapter 12: Query Processing
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
Filtering Properties of Entities By Class
Recuperação de Informação B
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

Querying Structured Text in an XML Database By Xuemei Luo

2 Introduction Data retrieval (DR)  provide means to formulate queries based on exact matches of data. Information retrieval (IR):  based on the notion of relevance of documents within a document collection.

3 Introduction Traditional databases (XML)  efficiently deal with data retrieval  not good at dealing with information retrieval XML provides a unified view to all kinds of structured and semi-structured data as well as loosely structured documents. It is important to integrate information retrieval into standard database query.

4 Introduction Relevance ranking it is central to information retrieval it becomes more complex in XML

5 Introduction An algebra called TIX for querying Text In XML was developed to integrate information retrieval techniques into a standard database query evaluation engine. New evaluation strategies were also developed to obtain good performance.

6 articles.xml: #a1 #a2 Internet Technologies #a3 Jane #a4 Doe #a5 #a10 Search and Retrieval #a11 #a12 #a13 Search Engine Basics... #a14 #a15 Information Retrieval Techniques... #a16 Examples #a17... Here are some IR based search engines:... #a18... search engine NewsInEssence uses a new information retrieval technology... #a19... semantic information retrieval techniques are also being incorporated into some search engines... #a20 Figure 1: Example XML Database

7 Query 1: simple IR-style query Find document components in articles.xml that are about ‘search engine’. Relevance to ‘internet’ and ‘information retrieval’ is desirable but not necessary. Query 2: structured IR-style query Find document components in articles.xml that are part of an article written by an author with last name ‘Doe’ and are about ‘search engine’. Relevance to ‘internet’ and ‘information retrieval’ is desirable but not necessary. Figure 2: Example IR-style Queries

8 Motivation Problems of a boolean specification:  OR: retrieve components relevant only to the two secondary terms but not to the primary term (#a15).  AND: lose the relevant paragraph (#a18).  AND and OR: hard to determine a suitable query expression applicable to all possible database instances. Weighting and ranking support in the boolean query engine are required

9 Algebra - scored data tree Definition: It is a rooted ordered tree, such that each node carries data in the form of a set of attribute- value pairs, including at least a tag and a real number valued score. The score of the tree is the score of the root node.

10 Algebra - scored pattern tree Definition: It is a triple P = (T,F,S), where T = (V,E) is a node and edge labeled tree: each node in V has a distinct integer as its label. each edge is labeled pc (for parent child relationship), ad (for ancestor descendant relationship), or ad* (for self-or-descendant relationship). F is a formula of boolean combination. S is a set of scoring functions specifying how to calculate the scores of each node.

11 Figure 3: Scored Pattern Tree for Query 2

12 Scored pattern tree Nodes are constrained in the normal ways: the pattern imposes structural requirements on the nodes. the formula imposes value-based constraints. the scoring function defines how the scores of nodes are calculated.

13 Scored pattern tree Primary IR-node:  Defined by a scoring function and  Relevance finding is applied to the node Secondary IR-node:  A node that has primary IR-nodes in its sub-tree or  A node defined by a scoring function based on the scores of other IR-nodes.

14 Extension of existing operators Scored selection Scored projection

15 Scored selection Input: data trees Parameter: a scored pattern tree Output: scored data trees  Each scored data tree matches the scored pattern tree  The score of each data IR-node is calculated using the corresponding scoring function

16 Figure 5: Three Representative Result Trees of Query 2 with Selection The figure shows three of the results obtained by applying query 2 to the example database in Figure 1. The score of the IR-nodes are calculated using functions defined in Figure 9 and are indicated in the square bracket.

17 Scored projection Input: data trees Parameters: scored pattern tree, projection list PL Output:scored data trees  The nodes not matching the scored pattern tree or not being preserved in the PL are eliminated in the output.

18 Figure 6: Result Tree of Query 2 with Projection PL = {$1, $3, $4}

19 New operators Threshold Pick

20 Threshold Input: scored data trees Parameters: a scored pattern tree P, a threshold condition TC. TC is either a real number value V or an integer K.

21 Threshold The output scored data trees satisfy: at least one data IR-node matching the query IR- node in the result data tree has a score higher than V. at least one data IR-node has a rank higher than K, where the rank is obtained by sorting the data IR- nodes based on the score.

22 Pick Input: scored data trees Parameters: a scored pattern tree, a pick-criterion PC It is a key operator to remove the redundancy

23 Pick Pick is different from projection: Projection only needs information local to the node being projected, e.g., the tag name. Pick needs information that may reside elsewhere in the data tree, e.g., the ancestor nodes. Pick operator is usually applied after the projection operator to eliminate the redundancy

24 Figure 8: Result of Query 2 with Projection Followed by Pick PC condition (PickFoo): any data IR-node with a score at least 0.8 is considered relevant; for any data IR-node (starting with the one highest in the tree hierarchy), if more than 50% of its child nodes are relevant; its direct parent node is not picked or it has no parent node, then the data IR-node is picked (parent/child redundancy elimination).

25 Figure 9: Example User Functions

26 Example Using example database and scored pattern tree, to obtain the top result (#a10): Projection: generate Figure 6 Pick: generate Figure 8 Selection: generate a collection of five trees corresponding to five primary data IR-node. Threshold: select the highest scored result. The subtree rooted at #a10 can then be retrieved.

27 Extension of XQuery Figure 10: XQuery Expression of IR-style Queries

28 Access methods Score generating methods:  TermJoin  PhraseFinder Score modifying methods Score utilizing methods

29 Score generating methods More than one term for relevance scoring Term matching is the most common IR predicate. A node is scored based on how many terms it has in itself and its descendant nodes. Phrase matching

30 Score generating methods TermJoin algorithm  Implement score generation based on term matching  Find all ancestors that are common among the terms in a query.  Terms are read from an inverted index. PhraseFinder algorithm  Use word offset information in the index to verify phrase occurrence.  Use phrase occurrences to generate appropriate score values.

31 Score modifying methods EXAMPLE: Consider the value join access method. It takes in two sets of scored witness trees and outputs a set of scored witness trees where each witness tree is the merging of two input witness trees that satisfied the join condition. c is the join condition A and B are the non-scored versions of input sets A and B. s is a score assigned to an output tree x.

32 Score utilizing methods Properties of PC condition:  A notion of relevance score threshold for data IR-nodes in the input collection.  Removing the redundancy either in along the ad relationship or along the sibling relationship. Challenge of ad redundancy  Need to examine all nodes Pick algorithm:  use a stack-based strategy to eliminate redundancy

33 Figure 12: Algorithm Pick

34 Experiment evaluation To evaluate the performance of the new access methods Use an XML database system Run each experiment five times Ignore the lowest and the highest readings, and average the remaining three

35 Experiment evaluation TermJoin and PhraseFinder improve the performance by two times Pick efficiently eliminate the redundancy

36 Table 1: Performance (in seconds) of the different techniques using queries with different number of terms

37 Table 2: Performance (in seconds) of the PhraseFinder and Composite of Access Methods

38 Conclusion A new algebra TIX has been developed to integrate information retrieval into standard database query Advantages of TIX  Manage the relevance score  Manage result granularity New access methods have been developed to manipulate scores, and they effectively improve the performance.

39 Q&A