HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.

HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin VLDB2000 Speaker: Eric Lo

Introduction Increasing volume of data available in electronic forms and the proliferation of Internet have accelerated the development of SDI (Selective Dissemination of Information) Selective dissemination of information is to avoid sending users/subscribers unnecessary information The SDI applications: - - timely received/collected new data such as stock quotes, traffic news, sports tickers and music - - filter against subscribers profile - - delivering relevant data to interested subscribers

Introduction Current SDI… … - - based of simple keyword matching and typical IR techniques - - e.g. a subscriber profile has the keyword “NBA” will match all those news with the keyword “NBA” exists HOWEVER… … - Still suffering from typical problems: Subscriber will also receive irrelevant information such as news with headline “Bill Gate loves to watch NBA” Even the current system drawn large concern on improving the effectiveness, they miss out the EFFICIENCY!

Introduction One of the usage of XML is to be a standard information exchange mechanism XML allows encoding of structural information within documents and can create more focused and accurate profiles of user interests. “XFilter” in this paper addressed the mentioned concerns

XML-based SDI Architecture Subscribers has a GUI interface to specify the profiles The underlying language is XPath E.g. /sports/nba//news Input

XFilter Architecture 4 major components 1. Event-base parser for XML document 2. XPath parser for user profiles 3. Filter engine, matching between profile and XML documents 4. Dissemination engine, for delivery the filtered data

Generally, how the system work? … New_incoming_document.xml Q1: /sports / nba //news [Q1-1] [Q1-2] [Q1-3] Q2: //nba/*/ news [Q2-1] [Q2-2] Q3: /stocks/quotes/PCCW [Q3-1] [Q3-2] [Q3-3] 3 subscribers sports nba news stocks quotes PCCW Q1-1 Q2-1 Q1-2 Q1-3Q2-2 Q3-1 Q3-2 Q3-3 Candidate List Wait List Q1-1 Q1-2

Filter Engine of XFilter XFilter convert the XPath query to a Finite State Machine A subscriber XPath (Profile) is MATCH with the XML document WHEN the FSM of the XPath query reach its final state A Query Index is built over the states of the (FSM) XPath queries.

Inside Filter Engine

Path Nodes XPath parser decompose XPath to set of path nodes Elements are nodes (no attribute) and act as state of FSM /sports/nba//news Wildcard (*) is ignored sports nbanews

Path Nodes Information Query ID Position Relative Position: =0 for 1 st node if 1 st node is not follow by “//” =-1 if any node followed by “//” Else =1+ (no of “*” nodes between itself and predecessor node) Level: If 1 st node and have absolute distance from the root, then level = 1+ distance from root If Rel. Pos. is –1, it is also –1, else =0 Q1=/sports/nba//news Q1 123 01 10 Q1-1Q1-2Q1-3 Q2 123 21 00 Q2-1Q2-2Q2-3 Q2=//nba/*/news/Bulls

Query Index All the nodes added to the Query Index(a hash table based on element names) Each unique element name associate with two lists: Candidate List and Wait List The current node of each query is placed in CL, others are in WL The FSM will move to next state when a path node promote to CL from WL sports nba news stocks quotes PCCW Q1-1 Q2-1 Q1-2 Q1-3Q2-2 Q3-1 Q3-2 Q3-3 Candidate List Wait List

XML Parsing and Filtering When a XML document arrives, it run thru the SAX XML Parser (event-driven) and will check with the Query Index when encountering: A begin element tag An end element tag Data internal to an element Input XMLSAX API Michael Jordan … Start document Start element: sports Start element: news Start element: ball games Start element: nba Characters: Michael Jordon End element: nba …

XML Parsing and Filtering (cont) Start_Element_Handler (element_name, element level, attribute name, attribute values) { Lookup the element name in the Query Index and examines all nodes in the CL and perform LEVEL CHECK and ATTRIBUTE FILTER CHECK } Q1 1 0 1 Q1-1

Level Check and Attribute Check Level check is to ensure the element appears in the document matches the expected level in the user query Recall: - the level of a path node is –1  relative pos is –1  a “//” is before this node  unrestricted - else the level of path node must = the level of the input element The attribute filter check applies any simple predicates that reference the attributes of the element

Level Check and Attribute Check If both level check and attribute check succeed, that node is pass. If that node is the final path node (final state) of the query (e.g. Q1-3) then the document is match the query, if that node is not the final path node, the query is then moved the next state. State move is done by copying the next node of the query from WL to CL and update the corresponding relative position and level

End element handler and character handler When an end element is encounter in SAX parser, the path node of that element is deleted from CL When element data is encounter in SAX parser, it works like start element handler except it performs a content check rather than attribute check

List Balancing Recall: The first path node of the XPath query is placed on the CL and remaining path node are placed on WL Inefficient for many situations as the 1 st element usually have poor selectively Some CL has long length, some CL has short length, and not balancing! (e.g. the length of CL of element “news” usually much longer than the length of CL of element “NBA”

List Balancing List balancing introduce a “pivot” node When a new query is adding to the index, the element node of the query whose entry in the index has shortest CL is chosen as pivot and placed it on the CL (instead of the 1 st node) E.g. When a new subscriber add /sports/worldcup//news, if the length of “worldcup” element is shortest compare with “sports” and “news”, “worldcup” is the pivot and add to CL The prefix “sports” will then be a precondition and use a stack to hold it, the filter will stop is the precondition for the node fails

List Balancing Q3=/*/sports/news//bulls Q3 123 01 10 Q1-1Q1-2Q1-3 Q3 12 0 1 Q1-1Q1-2 Assume the element “news” has the shortest CL among the 3 elements Stack: “sport”

List Balancing

Prefiltering Prefiltering is to eliminate from consideration, any query that contains an element name that is not present in the input document to avoid unnecessary work done Done before order and filter checking (thus every incoming XML is parsed twice)

Prefiltering A “key” element is chosen for each query when initially parsed The key is chosen like List Balancing whereas a hash table(call occurrence table) containing an entry of is constructed when a document arrives The queries referenced by the table are checked to see if all of the element names exist in the document, only the successful queries would go further

Prefiltering Assume the key is in blue color Q1: /sports/nba//news/scores Q2: /sports/NHL//news Q3: /sports/nba/Bulls//news Q4: /sports//Bulls/ranking O’ Neal… Bulls beat Lakers Sports18012002.xml sports nbaQ1 Lakers news BullsQ3,Q4 Occurrence Table Q3 All elements in Queries exists in The document?

Performance evaluation Evaluate the performance by varying: Number of subscribers profile Depth of subscribers queries and incoming XML document Probability of wildcards Filter placement and selectively List Balance with Prefiltering has the best performance

Related Work Enhance XFilter by considering not only element but also attributes Enhance XFilter by reordering the input profiles (XPath queries of subscribers) when building the index so as to have more well- balance Candidates List Refer to “Indexing Attributes and Reordering Profiles for XML Document Filtering and Information Devliery” by Wang Lian, David Cheung and S.M. Yiu, WAIM 2001

HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.

Similar presentations

Presentation on theme: "HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.

Similar presentations

Presentation on theme: "HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin."— Presentation transcript:

Similar presentations

About project

Feedback