1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435, USA

2 Talk Outline Goal (What?) Background and Motivation (Why?) Query Language and Examples (What?) Implementation Details (How?) Evaluation and Applications (Why?) Conclusions

3 Goal

4 Develop a keyword-based XML Query Language and its Semantics that is flexible and sufficiently expressive easy to use (for query formulation) Implement, reusing mature software components, for efficient indexing and search

5 Background and Motivation

6 XML vs Text Documents DATA: Exploit metadata/markup and aggregation structure implicit in XML documents For expressiveness and precision QUERY: Obtain progressively improved extractions using convenient keyword-based queries in contrast with accurate extractions using complex XML-based queries

7 Relationship to Other Work Extends XSEarch (Cohen et al) Expressive power: Incorporates attributes and their values Equivalence (E.g., RDF) : vs s

8 Invariance under Refinement word_1 and word_2 vs word_1 and word_2

9 Coherence Interconnectedness : Infer related pieces of information using aggregation implicit in XML Cohen et al : Name equivalence XSEarch Li et al: Structural equivalence Scheme-free XML Guo et al : Completeness XRANK

10 Information Retrieval Explore robust relevance ranking strategy to deal with high recall Variation on TFIDF Naïve implementation computationally prohibitive Extension beyond “type-delimited-document” unclear

11 Query Language and Examples (What?)

12 Query Syntax Entity-Attribute-Keyword Search Terms e:a:k e:a:, :a:k, e::k e::, a::, ::k Signed/optional Search Terms + e:a:k vs e:a:k

13 Single Search Term Satisfaction e:a:k The search term e:a:k is satisfied by a tree containing a subtree with the top element e that is associated with the attribute a with value containing k, or a subelement a with descendant text node containing k.

14 Example (Mondial) <country id="f0_149" name="Austria" capital="f0_1467" population="8023244" datacode="AU" total_area="83850" population_growth="0.41" infant_mortality="6.2"... government="federal republic"...>... :name:Vienna is satisfied by Vienna 1583000... name::Vienna is satisfied by a part of it Vienna.

15 Example (Heterogeneity) Adam Dingle @inproceedings{IMN97, author="Adam Dingle and Ed MacNair and Thao Nguyen", … author:name:Dingle misses the last one.

16 Query Answer Candidate Query Answer Candidate for the query Q(t_1,t_2,...,t_m), is a Most preferred satisfying collection of trees (P_1,P_2,...,P_m) Precise : smallest enclosing Adequate : optional search terms satisfied as much as possible

17 Query Answer Query Answer for the query Q(t_1,t_2,...,t_m), is a Query Answer Candidate (P_1,P_2,...,P_m) in which Trees P_i’s are Interconnected Specifies trees related to the same “real- world” entity

18 Interconnectedness (Cohen et al) Two subtrees T_a and T_b are said to be interconnected if the path from their roots to the lowest common ancestor does not contain two distinct nodes with the same element, or the only distinct nodes with the same element are these roots.

19 Interconnectedness (Li et al) Two subtrees T_a and T_b are said to be interconnected, if the path from T_a's root to their lowest common ancestor in the tree does not contain another node that is the lowest common ancestor of T_a and a distinct subtree T_b‘, where T_b' has the same root element label as T_b.

20 Interconnectedness (Two Approaches)

21 Implementation Details (How?)

22 Tools Used Apache Lucene 2.0 APIs in Java A high-performance, text search engine library with smart indexing strategies. Further tuned for memory-centric operation in contrast with disk-centric defaults SAXParser APIs

23 Mapping to Lucene XML documents to Lucene documents for indexing XML keyword-based queries to Lucene queries for searching ENCODING XML fragment of an XML document is referred to internally using the filename and the XPath (of the XML fragment's root from the XML document root)

24 Evaluation and Application (Why?)

25 Experiments DATASETs: Sigmod, Mondial, and DBLP. PLATFORMS: For Sigmod and Mondial datasets: HP xw9300 Workstation with 2 GHz AMD Opteron dual-core processor (270), 4 GB of main memory, and 250 GB 7200 rpm hard drive, running 32-bit Windows XP. (java -Xms750M -Xmx1500M). For DBLP dataset: SUN Ultra-40 Workstation with 2.4 GHz dual AMD Opteron dual-core processor (280), 8GB of main memory, and 250GB 7500 rpm hard drive, running 64-bit Solaris 10. (java -Xms1000M -Xmx3600M).

26 Dataset Sizes DATASETSIZE Sigmod468 KB Mondial1743 KB DBLP337 MB

27 Dataset Indexing via Lucene DATASETINDEXING TIME INDEX SIZE Sigmod32 sec6 MB Mondial180 sec16 MB DBLP36 hrs4 GB

28 Query Answer: Computation Time vs Display Time DATASETSIMPLE QUERY COMPLEX QUERY Sigmod35 ms / 1 sec400 ms / 3 min Mondial25 ms / 350 ms 1 sec / 2 min DBLP335 ms / 1 sec ---

29 More Subtle Example In Extended Paper: A Coherent Keyword-Based XML Query Language

30 Pubs.xml - Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto - Digital Libraries Edward A. Fox Ohm Sornil - The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin Lawrence Page - An Algorithm for Suffix Stripping M.F.Porter - Indexing by Latent Semantic Analysis

31 Characteristics of Pubs.xml Total number of authors = 7 Total number of titles = 5 Title Distribution = 1 book (with 1 chapter) + [3 articles] Author Distribution = 2 ( 2 ) + [ 2 + 1 + 0 ]

32 Queries to Pubs.xml (Answer counts) Arbitrary mix and match of authors and titles = 7 * 5 = 35 author::, title:: (8 hits) +author::, +title:: (7 hits) +author::, +title::, +author:: (4 hits)

33 Completeness ( ::pWord, ::qWord) (Guo et al)

34 Conclusions

35 Developed declarative semantics for keyword-based XML Query language with an effective query answering algorithm Developed a notion of interconnectedness that provides coherent answers Implemented using Lucene 2.0 APIs Indexing: Time and Space Intensive But Query Answering: Quick

36 THANK YOU!

1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

Similar presentations

Presentation on theme: "1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

Similar presentations

Presentation on theme: "1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University."— Presentation transcript:

Similar presentations

About project

Feedback