Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego
SIGMOD, June Introduction Need for complex full-text predicates beyond simple keyword search Library of Congress (LoC) Biomedical data ACM, IEEE publications INEX data collection Wikipedia XML data set
SIGMOD, June XML real fragment from LoC Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc
SIGMOD, June Query with complex FT predicates Document fragments (nodes) that contain the keywords “Jefferson” and “education” and satisfy the predicates within a window of 10 words, with “Jefferson” ordered before “education”
SIGMOD, June Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc
SIGMOD, June Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc Return document fragments Naive solution: test the query at each node → redundant Need for efficient evaluation of full-text predicates use structural relationship between nodes avoid redundant computation
SIGMOD, June Existing languages Many XML full-text search languages expressive power, semantics, scores [BAS-06] XQFT-class W3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery Efficient query evaluation limited to Conjunctive keyword search (no predicates) Full-text predicates in isolation Need for a universal optimization framework Guarantee the universality of the solution
SIGMOD, June Contributions Formal semantics for XQFT-class Unified framework Capture family of tf*idf scoring methods Structure-aware algorithms to efficiently evaluate XQFT-class languages XFT full-text algebra Enable new optimizations inspired by relational rewritings
SIGMOD, June Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June Formalization: design goals Capture existing full-text languages Language semantics in terms of keyword patterns pattern matches predicates evaluated through matches Manipulate tuples enable relational query evaluation and rewritings
SIGMOD, June Formalization: patterns Pattern = tuple of simultaneously matching keywords Query expression: “Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education” Pattern (“Jefferson”, “education”)
SIGMOD, June Formalization: patterns Formalization specifies patterns ← conjunction of keywords set of patterns ← disjunction of keywords exclusion patterns ← negation of keywords No matches in the document
SIGMOD, June Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3)
SIGMOD, June Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45)
SIGMOD, June Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45) (22, 67)
SIGMOD, June Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45) (22, 67) (51, 3) …
SIGMOD, June Formalization: matching tables Matching table represents Nested relation Each node in the document Each pattern in the query Set of matches
SIGMOD, June Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc Formalization: matching tables NodePatternMatches action“Jefferson”, “education”(28, 45) (51, 45) ………
SIGMOD, June XFT Algebra Similar to relational algebra Manipulate matching tables Leverage relational query evaluation + optimization techniques XFT operators construct matching table R k for each keyword k get (k) manipulate matching tables R 1 or R 2 R 1 and R 2 R 1 minus R 2 σ times (R), σ ordered (R), σ window (R), σ distance (R)
SIGMOD, June XFT Algebra Query: Nodes that contain the keywords “Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education” × Benefit: equivalent query rewritings
SIGMOD, June Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June Query evaluation: AllNodes Straightforward implementation of the XFT algebra Each node is considered separately Each tuple is self-contained Relational-style evaluation Joins → equi-joins Predicates → selections on set of matches 5
SIGMOD, June Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc
SIGMOD, June NodePatternMatches 1“Jefferson”22, 28, 51, 54, “Jefferson” “Jefferson”22 1.2“Jefferson”28, “Jefferson” “Jefferson”51 1.3“Jefferson”54, “Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1“education”3, 45, “education” “education”3 1.2“education” “education” “education”45 1.3“education” “education”67 ×
SIGMOD, June NodePatternMatches 1“Jefferson”22, 28, 51, 54, “Jefferson” “Jefferson”22 1.2“Jefferson”28, “Jefferson” “Jefferson”51 1.3“Jefferson”54, “Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1“education”3, 45, “education” “education”3 1.2“education” “education” “education”45 1.3“education” “education”67 × NodePatternMatches 1“Jefferson”, “education”(22,45), (72,67) … 1.1“Jefferson”, “education”(22, 3) 1.2“Jefferson”, “education”(28, 45), (51, 45) 1.2.2“Jefferson”, “education”(51, 45) “Jefferson”, “education”(51, 45) 1.3“Jefferson”, “education”(54, 67), (72, 67) 1.3.2“Jefferson”, “education”(72, 67)
SIGMOD, June NodePatternMatches 1“Jefferson”22, 28, 51, 54, “Jefferson” “Jefferson”22 1.2“Jefferson”28, “Jefferson” “Jefferson”51 1.3“Jefferson”54, “Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1“education”3, 45, “education” “education”3 1.2“education” “education” “education”45 1.3“education” “education”67 × NodePatternMatches 1“Jefferson”, “education”(22,45), (72,67) … 1.1“Jefferson”, “education”(22, 3) 1.2“Jefferson”, “education”(28, 45), (51, 45) 1.2.2“Jefferson”, “education”(51, 45) “Jefferson”, “education”(51, 45) 1.3“Jefferson”, “education”(54, 67), (72, 67) 1.3.2“Jefferson”, “education”(72, 67) Predicate operates one tuple at a time
SIGMOD, June Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc
SIGMOD, June Query evaluation: SCU AllNodes = straightforward algorithm Reduce size of intermediate results structural relationships between nodes avoid redundant match representation SCU = Smallest Containing Unit 5
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1“Jefferson”22, 28, 51, 54, “Jefferson” “Jefferson”22 1.2“Jefferson”28, “Jefferson” “Jefferson”51 1.3“Jefferson”54, “Jefferson” “Jefferson” “Jefferson”72 Matching tables → SCU tables → captures same information
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 ×
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 NodePatternMatches “Jefferson”, “education”(51, 45) 1.3.2“Jefferson”, “education”(72, 67) × Equi-join does not work Need to compute LCA
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) “Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × 1.1 is the LCA of and 1.1.1
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 × NodePatternMatches 1.2“Jefferson”, “education”(28, 45) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … NodePatternMatches EMPTY !!! NodePatternMatches 1.1“Jefferson”, “education”(22, 3) “Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) …
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) “Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … ×
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) “Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × NodePatternMatches 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education” (22, 45) …
SIGMOD, June NodePatternMatches 1.1.3“Jefferson” “Jefferson”51 1.2“Jefferson” “Jefferson” “Jefferson”72 NodePatternMatches 1.1.1“education” “education” “education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) “Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × NodePatternMatches 1.3“Jefferson”, “education”(54, 67) (72, 67) 1“Jefferson”, “education” (22, 45) … Postorder Stack supports single scan
SIGMOD, June SCU summary Equivalent to AllNodes Structure-awareness reduces size of intermediate results Increase computation cost Compute LCAs of nodes Match propagation Stack-based techniques 5
SIGMOD, June Related work on LCA for XML LCA for conjunctive keyword search XRank [GSBS-03] Schema-free XQuery [LYJ-04] XKSearch [XP-05] Shortcomings No postprocessing, not compositional Input in document order Output postorder traversal Support for complex predicates is not straightforward
SIGMOD, June Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June Experimental goals AllNodes vs. SCU AllNodes: redundant representation SCU: smaller sizes, more computation SCU Overhead Stack Match propagation Benefit of Rewritings Relational-style rewritings
SIGMOD, June Experimental setup Centrino 1.8GHz with 1GB of RAM XMark generated datasets Size ranges from 50 MB – 300 MB
SIGMOD, June Experiments: AllNodes vs. SCU Varying document size (q1 - query without predicates) q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”)
SIGMOD, June Queries q4 = σ window>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1) q5 = σ window> (“See”, “internationally”, “description”, “charges”, “ship”) (q1) Recall that q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”) Experiments: SCU Overhead
SIGMOD, June Experiments: SCU Overhead q4 always true → no match propagation, just the stack overhead q5 always false → propagate all matches Varying query predicates (not pushed)
SIGMOD, June Queries q2 = σ orderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1) q3 = push selections in q2 Recall that q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”) Experiments: Benefit of Rewritings
SIGMOD, June Experiments: Benefit of Rewritings Varying document size (query with predicates) 40% improvement for relational-like query rewritings
SIGMOD, June Conclusion A unified logical framework for XML full-text search languages Algebra admits Efficient algorithms for operator evaluation Rewritings of queries into more efficient forms Facilitate XML joint optimizations of queries on both structure and text search Future work Score-aware logical framework
SIGMOD, June Thank you! 5