XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg.

XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg University 24th International Conference on Data Engineering April 9, Cancun (Mexico), 2008

2 XML data often processed ad-hoc, e.g. in streaming scenarios and main memory-based processors Low main memory consumption then becomes the key prerequisite to performance XML prefiltering as an established technique that aims at decreasing main memory consumption Motivation We present a novel approach to XML prefiltering based on efficient string matching techniques

3 Buffer only data that is relevant to query evaluation Prefiltering/Projection  Statical analysis of the XQuery/XPath expression  Identifiy parts of the input document that are relevant to query evaluation  Discard parts of the input document that are not relevant to query evaluation A. Marian and J. Siméon Projecting XML Documents In Proc. VLDB’03, pages 213–224, 2003 S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena Accelerating Queries by Pruning XML Documents TKDE, 54(2):211–240, 2005 V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen Type-Based XML Projection In Proc. VLDB’06, 2006 XML Prefiltering

4 { /site//australia//description } XQuery Relevant Paths { /site//australia//description# } site regions XML Document XML Prefiltering africaasiaaustralia description „PDA“ item

5 Existing Approaches 1. Analysis of the input query, extraction of relevant paths 2. Tokenization of the input document 3. Compilation of an automaton that projects the document token by token XML Prefiltering Our Approach 1. Analysis of the input query, extraction of relevant paths 2. Use efficient string matching techniques to locate the relevant parts of the input document (without parsing and tokenizing the document) Challenge: take string matching algorithms to the second dimension, to navigate in tree-structured data

6 String Matching Techniques Example: Boyer-Moore Algorithm String m atchingforbeginne begin Search for keyword --- length of keyword = 5 1510152025 rs Similar algorithms exist for multi-keyword search (e.g., Commentz-Walter Algorithm) beginbeginbeginbegin begin begin match

7 String Matching and XML Prefiltering String matching techniques have originally been designed for search in flat and unstructered text But: XML is structured and prefiltering requires us to keep track of axis relations in the input paths (such as child and descendant relations) XML schema knowledge (e.g., in the form of DTDs) provides us with structural information that can be exploited for target-oriented search

8 The Runtime Automaton <!DOCTYPE site [ <!ELEMENT item (location,name,payment, description,shipping,incategory+)>... ]> We restrict to non-recursive DTDs, which can be transformed to finite automaton Ideas also applicable in the context of recursive DTDs Fragment of the XMark DTD

9 The Runtime Automaton ( child tags)

10 The Runtime Automaton ( child tags) Search for string “<site”

11 The Runtime Automaton ( child tags) Search for strings “ <item” and “</australia” in parallel

12 The Runtime Automaton ( child tags) { /site //australia //description# }

13 The Runtime Automaton { /site //australia //description# }

17 Static Compilation into Lookup Tables Automaton A s p0p0 p0p0 p1p1 p1p1 p2p2 p1p1 q1q1 p2p2 q2q2 q2q2 p2p2 q2q2 q1q1 q1q1 q0q0 Frontier Vocabulary V s{ } p0p0 p1p1 {, } p2p2 { } q0q0 q1q1 q2q2 {, } Action Table T sno operation p0p0 copy tag p1p1 p2p2 copy on q0q0 copy tag q1q1 q2q2 copy off s p1p1 q1q1 q0q0 q2q2 p2p2 p0p0

18 Static Compilation into Lookup Tables sp1p1 p0p0 s p0p0 p1p1 Extract from the original runtime automaton Extract from the optimized runtime automaton Shortest possible XML string between and : s=“ ” with |s|=25 Initially skip 25 characters Initial Jump Table J p0p0 25 q2q2 43 other states0

19 The Runtime Algorithm q := s; // current state c := 0; // cursor position while q is not final do begin (1) Perform initial jump J[q] (2) Perform keyword search for tags V[q] until a tag t is matched (starting from current cursor position c) (3) Assign q := A[q, t] (4) Perform action T[q] end Lean runtime algorithm Operates on top of the precompiled tables Uses efficient string-matching techniques to locate keywords (step (2)) Runtime Core Algorithm

20 United States <na me>T V Creditcard 15’’LCD-Fla tPanel Within country <incategory category="3"/> A Sample Run while q is not final do begin (1) Perform initial jump J[q] (2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t] (4) Perform action T[q] end Current state:q = s Initial Jump:J[q=s] = 0 Frontier Voc.:V[q=s] = { } Current state:q = p 0 Initial Jump:J[q=p 0 ] = 25 Frontier Voc.:V[q]={ } Current state:q = p 1 Initial Jump:J[q=p 1 ] = 0 Frontier Voc.:V[q=p 1 ] = {, } Matched tag „ “ A[s, ] = p 0 Matched tag „ “: A[p 0, ] = p 1 T[q=p 0 ] = copy tag ( )T[q=p 1 ] = copy tag ( ) { /site //australia //description# }

21 A Sample Run Egypt PDA Check <descripti on>Palm Zire 71 while q is not final do begin (1) Perform initial jump J[q] (2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t] (4) Perform action T[q] end Current state:q = p 2 Initial Jump:J[q=p 2 ] = 0 Frontier Voc.:V[q=p 2 ] = { } Matched tag „ “: A[p 1, ] = p 2 Current state:q = p 1 Initial Jump:J[q=p 1 ] = 0 Frontier Voc.:V[q=p 1 ] = {, } Matched tag „ “: A[p 2, ] = q 2 copy on copy off T[q=p2] = copy on T[q=p2] = copy off { /site //australia //description# }

22 Experiments Prototype implementation in C++: SMP Settings Core2 Duo IBM ThinkPad Z61p T2500 2.00GHz CPU with 1GB RAM Ubuntu Linux 6.06 LTS Data sets: XMark, Medline, Proteine Sequence Document Sizes: 1MB up to 5,000MB Queries: XMark queries, user-defined XPath queries Query Engines XQuery: Qizx/open, Saxon XPath: SPEX

23 XM1XM5XM10XM14XM20 Proj. Size67.64MB22.10MB307.63MB1357.28MB38.52MB Memory1.64MB1.68MB1.96MB1.64MB1.67MB Elapsed Time4min 12s 4min 55s5min 21s4min 10s Usr+Sys31.00s19.91s54.94s53.71s31.67s CPU12.52%8.05%13.85%17.07%12.92% Char. Comp.18.86%9.87%22.38%21.24%18.67% Experimental Results Projection of a 5,000MB XMark document for selected XMark benchmark queries Projection Characteristics for Selected XMark Benchmark Queries

24 Experimental Results Throughput comparison SMP projection for XMark (average over all queries on 5,000MB document) vs. Bare XML document tokenization performed by the Xerces C++ parser SMP is faster than all projection systems that rely on a prior tokenization of the input XML document

25 QizX XQuery Engine SucessTimeFailMemFail 1000MB without projection 0018 1000MB with projection 1800 5000MB without projection 0018 5000MB with projection 1521 Success Rates for 18 XMark Queries with and without Projection, where TimeFail: >1hour, MemFail: >1GB Experimental Results When coupled with projection, in-memory XQuery engines scale up to documents in the Gigabyte range

26 Experimental Results Throughput improvement 656MB Medline document 5 user-defined XPath queries Evaluated with the SPEX XPath engine

27 Summary Efficient string matching techniques, originally designed for keyword search in flat text, can be used for search and navigation in unparsed XML documents A novel approach to XML prefiltering on top on these ideas reduces XML prefiltering to a sequence of simple string matching tasks Extensive experimental evaluation demonstrates persistently high throughput and scalability of our XML prefiltering system significant improvements for both XQuery and XPath engines when coupled with document prefiltering

Thank You for Your Attention! Y. Diao et. al.: “Path Sharing and Predicate evaluation for High-Performance XML Filtering” in TODS, 2003. T. J. Green et al.: “Processing XML streams with deterministic automata and stream indexes” in TODS, 2004. D. Olteanu: “SPEX: Streamed and Progressive Evaluation of XPath” in TKDE, 2007. X. Li and G. Agrawal: “Efficient Evaluation of XQuery over Streaming Data” in VLDB, 2005. A. Marian and J. Simeon: “Projecting XML Documents” in VLDB, 2003. V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen: “Type-Based XML Projection” in VLDB, 2006. M. Schmidt, S. Scherzinger, and C. Koch: “Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation” in ICDE, 2007. A. V. Aho: “Algorithms for finding patterns in strings” in Handbook of Theoretical. Comp. Sc., Volume A, 1990. B. W. Watson and G. Zwaan: “A taxonomy of sublinear multiple keyword pattern matching algorithms” in Sci. Comput. Program., 1996. D. E. Knuth, J. H. Morris (Jr.), and V. R. Pratt: “Fast Pattern Matching in Strings” in SIAM J. Computing, 1977. R. S. Boyer and J. S. Moore: “A Fast String Searching Algorithm,” in Commun. ACM, 1977. A. V. Aho and M. J. Corasick: “Efficient string matching: An aid to bibliographic search” CACM, 1975. B. Commentz-Walter: “A String Matching Algorithm Fast on the Average” in Proc. ICALP, 1979. A. Berlea and H. Seidl: “Binary Queries for Document Trees” in Nordic J. of Computing, 2004. J. Jaakkola and P. Kilpelainen: “Nested text-region algebra” TR C-1999-2, Univ. of Helsinki, 1999. M. Takeda et al: “Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts” in Proc. SPIRE, 2002. M. Altinel et. al.: “Efficient Filtering of XML Documents for Selective Dissemination of Information” in ICDE, 2000. A. Bruggemann-Klein and D. Wood: “One-Unambiguous Regular Languages” in Inform. and Comp., 1998. J.-M. Champarnaud: “Subset Construction Complexity for Homogeneous Automata, Position Automata and ZPC-Structures” in Theor. Comput. Sci., 2001.

Additional Resources

The Runtime Automaton In some cases, intermediate states must be kept to keep track of axis relation { /a/b } NOT CORRECT!

The Runtime Automaton In some cases, intermediate states must be kept to keep track of axis relation { /a/b } CORRECT

Medline XPath Queries M1 /MedlineCitationSet//CollectionTitle M2 /MedlineCitationSet//DataBank[DataBankName/text()=“PDB”] /AccessionNumberList M3 /MedlineCitationSet//PersonalNameSubjectList /PersonalNameSubject[LastName/text()=“Hippocrates” or DatesAssociatedWithName=“Oct2006”] /TitleAssociatedWithName M4 /MedlineCitationSet//CopyrightInformation[contains(text(),“NASA”)] M5 /MedlineCitationSet/MedlineCitation[ contains(MedlineJournalInfo//text(),“Sterilization”)]/DateCompleted

XMark Queries let $auction := doc("auction.xml") return for $b in $auction/site/people/person[@id = "person0"] return $b/name/text() let $auction := doc("auction.xml") return count( for $i in $auction/site/closed_auctions/closed_auction where $i/price/text() >= 40 return $i/price ) XM1 XM5

XMark Queries let $auction := doc("auction.xml") return for $i in distinct-values($auction/site/people/person/profile/interest/@category) let $p := for $t in $auction/site/people/person where $t/profile/interest/@category = $i return {$t/profile/gender/text()} {$t/profile/age/text()} {$t/profile/education/text()} {fn:data($t/profile/@income)} {$t/name/text()} {$t/address/street/text()} {$t/address/city/text()} {$t/address/country/text()} {$t/emailaddress/text()} {$t/homepage/text()} {$t/creditcard/text()} return { {$i}, $p} XM10

XMark Queries let $auction := doc("auction.xml") return for $i in $auction/site//item where contains(string(exactly-one($i/description)),"gold") return $i/name/text() XM14

XMark Queries let $auction := doc("auction.xml") return {count($auction/site/people/person/profile[@income >= 100000])} { count($auction/site/people/person/profile[@income<100000 and @income >= 30000] ) } {count($auction/site/people/person/profile[@income < 30000])} {count(for $p in $auction/site/people/person where empty($p/profile/@income) return $p)} XM20

XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg.

Similar presentations

Presentation on theme: "XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg.

Similar presentations

Presentation on theme: "XML Prefiltering as a String Matching Problem Christoph Koch 1, Stefanie Scherzinger 2, Michael Schmidt 3 1 Cornell University 2 IBM Boeblingen 3 Freiburg."— Presentation transcript:

Similar presentations

About project

Feedback