HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.

Slides:



Advertisements
Similar presentations
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Search Engines and Information Retrieval
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.
CSCI 4550/8556 Computer Networks Comer, Chapter 19: Binding Protocol Addresses (ARP)
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
XML Language Family Detailed Examples Most information contained in these slide comes from: These slides are intended.
Database Systems and XML David Wu CS 632 April 23, 2001.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Design and Implementation of a Server Director Project for the LCCN Lab at the Technion.
Overview of Search Engines
XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015.
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
Cellular IP: Proxy Service Reference: “Incorporating proxy services into wide area cellular IP networks”; Zhimei Jiang; Li Fung Chang; Kim, B.J.J.; Leung,
® IBM Software Group © 2006 IBM Corporation How to read/write XML using EGL This Learning Module shows how to utilize an EGL Library to read/write an XML.
Delivery, Forwarding and
Search Engines and Information Retrieval Chapter 1.
CMPE 421 Parallel Computer Architecture
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
SAX. What is SAX SAX 1.0 was released on May 11, SAX is a common, event-based API for parsing XML documents Primarily a Java API but there implementations.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Dr Alexiei Dingli XML Technologies XML Advanced.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Large-Scale Information Filtering Systems Fatma Ozcan May 9, 2000 University of Maryland, College Park.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
XML and Database.
Internet Protocol: Routing IP Datagrams Chapter 8.
SAX2 and DOM2 Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Finding What We Want: DNS and XPath-Based Pub-Sub Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 12, 2008.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Introduction to Business Information Systems by Mark Huber, Craig Piercy, Patrick McKeown, and James Norrie Tech Guide D: The Details of SQL, Data Modelling,
Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.
Range Hash for Regular Expression Pre-Filtering Publisher : ANCS’ 10 Author : Masanori Bando, N. Sertac Artan, Rihua Wei, Xiangyi Guo and H. Jonathan Chao.
Creating Databases for Web applications
Behrouz A. Forouzan TCP/IP Protocol Suite, 3rd Ed.
Unit 4 Representing Web Data: XML
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Pastry Scalable, decentralized object locations and routing for large p2p systems.
High-Performance XML Filtering with YFilter
Efficient Filtering of XML Documents with XPath Expressions
CHAPTER 3 Architectures for Distributed Systems
(b) Tree representation
XML Data Introduction, Well-formed XML.
Query Processing for High-Volume XML Message Brokering
XML Data DTDs, IDs & IDREFs.
Towards an Internet-Scale XML Dissemination Service
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Compact routing schemes with improved stretch
Presentation transcript:

HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin VLDB2000 Speaker: Eric Lo

Introduction Increasing volume of data available in electronic forms and the proliferation of Internet have accelerated the development of SDI (Selective Dissemination of Information) Selective dissemination of information is to avoid sending users/subscribers unnecessary information The SDI applications: - - timely received/collected new data such as stock quotes, traffic news, sports tickers and music - - filter against subscribers profile - - delivering relevant data to interested subscribers

Introduction Current SDI… … - - based of simple keyword matching and typical IR techniques - - e.g. a subscriber profile has the keyword “NBA” will match all those news with the keyword “NBA” exists HOWEVER… … - Still suffering from typical problems: Subscriber will also receive irrelevant information such as news with headline “Bill Gate loves to watch NBA” Even the current system drawn large concern on improving the effectiveness, they miss out the EFFICIENCY!

Introduction One of the usage of XML is to be a standard information exchange mechanism XML allows encoding of structural information within documents and can create more focused and accurate profiles of user interests. “XFilter” in this paper addressed the mentioned concerns

XML-based SDI Architecture Subscribers has a GUI interface to specify the profiles The underlying language is XPath E.g. /sports/nba//news Input

XFilter Architecture 4 major components 1. Event-base parser for XML document 2. XPath parser for user profiles 3. Filter engine, matching between profile and XML documents 4. Dissemination engine, for delivery the filtered data

Generally, how the system work? … New_incoming_document.xml Q1: /sports / nba //news [Q1-1] [Q1-2] [Q1-3] Q2: //nba/*/ news [Q2-1] [Q2-2] Q3: /stocks/quotes/PCCW [Q3-1] [Q3-2] [Q3-3] 3 subscribers sports nba news stocks quotes PCCW Q1-1 Q2-1 Q1-2 Q1-3Q2-2 Q3-1 Q3-2 Q3-3 Candidate List Wait List Q1-1 Q1-2

Filter Engine of XFilter XFilter convert the XPath query to a Finite State Machine A subscriber XPath (Profile) is MATCH with the XML document WHEN the FSM of the XPath query reach its final state A Query Index is built over the states of the (FSM) XPath queries.

Inside Filter Engine

Path Nodes XPath parser decompose XPath to set of path nodes Elements are nodes (no attribute) and act as state of FSM /sports/nba//news Wildcard (*) is ignored sports nbanews

Path Nodes Information Query ID Position Relative Position: =0 for 1 st node if 1 st node is not follow by “//” =-1 if any node followed by “//” Else =1+ (no of “*” nodes between itself and predecessor node) Level: If 1 st node and have absolute distance from the root, then level = 1+ distance from root If Rel. Pos. is –1, it is also –1, else =0 Q1=/sports/nba//news Q Q1-1Q1-2Q1-3 Q Q2-1Q2-2Q2-3 Q2=//nba/*/news/Bulls

Query Index All the nodes added to the Query Index(a hash table based on element names) Each unique element name associate with two lists: Candidate List and Wait List The current node of each query is placed in CL, others are in WL The FSM will move to next state when a path node promote to CL from WL sports nba news stocks quotes PCCW Q1-1 Q2-1 Q1-2 Q1-3Q2-2 Q3-1 Q3-2 Q3-3 Candidate List Wait List

XML Parsing and Filtering When a XML document arrives, it run thru the SAX XML Parser (event-driven) and will check with the Query Index when encountering: A begin element tag An end element tag Data internal to an element Input XMLSAX API Michael Jordan … Start document Start element: sports Start element: news Start element: ball games Start element: nba Characters: Michael Jordon End element: nba …

XML Parsing and Filtering (cont) Start_Element_Handler (element_name, element level, attribute name, attribute values) { Lookup the element name in the Query Index and examines all nodes in the CL and perform LEVEL CHECK and ATTRIBUTE FILTER CHECK } Q Q1-1

Level Check and Attribute Check Level check is to ensure the element appears in the document matches the expected level in the user query Recall: - the level of a path node is –1  relative pos is –1  a “//” is before this node  unrestricted - else the level of path node must = the level of the input element The attribute filter check applies any simple predicates that reference the attributes of the element

Level Check and Attribute Check If both level check and attribute check succeed, that node is pass. If that node is the final path node (final state) of the query (e.g. Q1-3) then the document is match the query, if that node is not the final path node, the query is then moved the next state. State move is done by copying the next node of the query from WL to CL and update the corresponding relative position and level

End element handler and character handler When an end element is encounter in SAX parser, the path node of that element is deleted from CL When element data is encounter in SAX parser, it works like start element handler except it performs a content check rather than attribute check

List Balancing Recall: The first path node of the XPath query is placed on the CL and remaining path node are placed on WL Inefficient for many situations as the 1 st element usually have poor selectively Some CL has long length, some CL has short length, and not balancing! (e.g. the length of CL of element “news” usually much longer than the length of CL of element “NBA”

List Balancing List balancing introduce a “pivot” node When a new query is adding to the index, the element node of the query whose entry in the index has shortest CL is chosen as pivot and placed it on the CL (instead of the 1 st node) E.g. When a new subscriber add /sports/worldcup//news, if the length of “worldcup” element is shortest compare with “sports” and “news”, “worldcup” is the pivot and add to CL The prefix “sports” will then be a precondition and use a stack to hold it, the filter will stop is the precondition for the node fails

List Balancing Q3=/*/sports/news//bulls Q Q1-1Q1-2Q1-3 Q Q1-1Q1-2 Assume the element “news” has the shortest CL among the 3 elements Stack: “sport”

List Balancing

Prefiltering Prefiltering is to eliminate from consideration, any query that contains an element name that is not present in the input document to avoid unnecessary work done Done before order and filter checking (thus every incoming XML is parsed twice)

Prefiltering A “key” element is chosen for each query when initially parsed The key is chosen like List Balancing whereas a hash table(call occurrence table) containing an entry of is constructed when a document arrives The queries referenced by the table are checked to see if all of the element names exist in the document, only the successful queries would go further

Prefiltering Assume the key is in blue color Q1: /sports/nba//news/scores Q2: /sports/NHL//news Q3: /sports/nba/Bulls//news Q4: /sports//Bulls/ranking O’ Neal… Bulls beat Lakers Sports xml sports nbaQ1 Lakers news BullsQ3,Q4 Occurrence Table Q3 All elements in Queries exists in The document?

Performance evaluation Evaluate the performance by varying: Number of subscribers profile Depth of subscribers queries and incoming XML document Probability of wildcards Filter placement and selectively List Balance with Prefiltering has the best performance

Related Work Enhance XFilter by considering not only element but also attributes Enhance XFilter by reordering the input profiles (XPath queries of subscribers) when building the index so as to have more well- balance Candidates List Refer to “Indexing Attributes and Reordering Profiles for XML Document Filtering and Information Devliery” by Wang Lian, David Cheung and S.M. Yiu, WAIM 2001

End