1 The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department.

Slides:



Advertisements
Similar presentations
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Chapter 5: Introduction to Information Retrieval
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Personalizing XML Text Search in Piment Sihem Amer-Yahia AT&T Labs Research - USA Irini Fundulaki Bell Labs - USA Prateek Jain IIT-Kanpur - India Laks.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
Adaptive Processing of Top-k Queries in XML Amelie Marian, Sihem Amer-Yahia Nick Koudas, Divesh Srivastava Proceedings of the 21st International Conference.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Structure and Content Scoring for XML
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to XML IR — Scoring and Ranking XML Group.
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
Information Retrieval and Web Design
Relax and Adapt: Computing Top-k Matches to XPath Queries
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1 The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department Talk at the Universities of Toronto and Waterloo Nov. 9 th and 10 th, 2005

2 Outline Introduction Querying Scoring Evaluation Open Issues

3 Outline Introduction  IR vs. Structured Document Retrieval (SDR)  XML vs. IR Search Querying Scoring Evaluation Open Issues

4 IR vs SDR Traditional IR is about finding relevant documents to a user’s information need, e.g., entire book. SDR allows users to retrieve document components that are more focussed on their information needs, e.g., a chapter, a page. Improve precision Exploit visual memory

5 DocumentsQuery Document representation Retrieval results Query representation IndexingFormulation Retrieval function Relevance feedback Conceptual Model for IR (Van Rijsbergen 1979)

6 Conceptual Model for SDR Structured documents Content + structure Inverted file + structure index tf, idf, … Matching content + structure Presentation of related components DocumentsQuery Document representation Retrieval results Query representation IndexingFormulation Retrieval function Relevance feedback

7 Conceptual Model for SDR (XML) Structured documentsContent + structure Inverted file + structure index tf, idf, … Presentation of related components Scoring may capture document structure query languages referring to both content and structure are being developed for accessing XML documents, e.g. XIRQL, NEXI, XQUERY FT XML adopted to represent a mix of structure and text (e.g., Library of Congress bills, IEEE INEX data collection) structure index captures in which document component the term occurs (e.g. title, section), as well as the type of document components (e.g. XML tags) additional constraints are imposed on structure e.g. a chapter and its sections may be retrieved Matching content + structure

8

9 XML Document Example 109th CONGRESS 1st Session H. R IN THE HOUSE OF REPRESENTATIVES May 26, 2005 Mr. Tierney (for himself, Ms. McCollum of Minnesota, Mr. George Miller of California ) introduced the following bill; which was referred to the Committee on Education and the Workforce …

10 THOMAS: Library of Congress

11 Outline Introduction Querying  search context: XML nodes vs entire document.  search result: XML nodes or newly constructed answers vs entire document.  search expression: keyword search, Boolean operators, proximity distance, scoping, thesaurus, stop words, stemming.  document structure: explicitly specified in query or used in query semantics. Scoring Evaluation Open Issues

12 Languages for XML Search Keyword search (CO Queries)  “xml” Tag + Keyword search  book: xml Path Expression + Keyword search (CAS Queries)  /book[./title about “xml db”] XQuery + Complex full-text search  for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

13 XRank XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … … (Guo et al, SIGMOD 2003)

14 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … … XRank

15 XIRQL XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … (Fuhr & Großjohann, SIGIR 2001) index nodes

16 Similar Notion of Results Nearest Concept Queries  (Schmidt et al, ICDE 2002) XKSearch  (Xu & Papakonstantinou, SIGMOD 2005)

17 Languages for XML Search Keyword search (CO Queries)  “xml” Tag + Keyword search  book: xml Path Expression + Keyword search (CAS Queries)  /book[./title about “xml db”] XQuery + Complex full-text search  for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

18 XSearch XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … … XML Indexing … Not a “meaningful” result (Cohen et al, VLDB 2003)

19 Languages for XML Search Keyword search (CO Queries)  “xml” Tag + Keyword search  book: xml Path Expression + Keyword search (CAS Queries)  /book[./title about “xml db”] XQuery + Complex full-text search  for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

20 XPath 2.0 fn:contains($e, string) returns true iff $e contains string //section[fn:contains(./title, “XML Indexing”)] (W3C 2005)

21 XIRQL Weighted extension to XQL (precursor to XPath) //section[0.6 ·.//* $cw$ “XQL” ·.//section $cw$ “syntax”] (Fuhr & Großjohann, SIGIR 2001)

22 XXL Introduces a similarity operator ~ Select Z From Where zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content (Theobald & Weikum, EDBT 2002)

23 NEXI Narrowed Extended XPath I INEX Content-and-Structure (CAS) Queries Specifically targeted for content-oriented XML search (i.e. “aboutness”) //article[about(.//title, apple) and about(.//sec, computer)] (Trotman & Sigurbjornsson, INEX 2004)

24 Languages for XML Search Keyword search (CO Queries)  “xml” Tag + Keyword search  book: xml Path Expression + Keyword search (CAS Queries)  /book[./title about “xml db”] XQuery + Complex full-text search  for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

25 Schema-Free XQuery Meaningful least common ancestor (mlcas) for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//year where $a/text() = “Mary” and exists mlcas($a,$b,$c) return {$b,$c} (Li, Yu, Jagadish, VLDB 2003)

26 TeXQuery and XQuery FT Fully composable FT primitives. Composable with XPath/XQuery. Based on a formal model. Scoring and ranking on all predicates. (Amer-Yahia, Botev, Shanmugasundaram, WWW 2004) ( W3C 2005) TeXQuery (Cornell U., AT&T Labs) IBM, Microsoft, Oracle proposals XQuery Full-Text Drafts

27 FTSelections and FTMatchoptions FTWord | FTAnd | FTOr | FTNot | FTMildNot | FTOrder | FTWindow | FTDistance | FTScope | FTTimes | FTSelection (FTMatchOptions)*  books//title [. ftcontains “usability” case sensitive with thesaurus “synonyms” ]  books//abstract [. ftcontains (“usability” || “web-testing”) ]  books//content ftcontains (“usability” && “software”) window at most 3 ordered with stopwords  books//abstract [. ftcontains ((“Utilisation” language “French” with stemming && “.?site” with wildcards) same sentence]  books//title ftcontains “usability” occurs 4 times && “web-testing” with special characters  books//book/section [. ftcontains books/book/title ]/title

28 FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b In any order

29 Galax XQuery Engine Full-Text Primitives (FTWord, FTWindow, FTTimes etc.) evaluation Text Text Text Text.xml GalaTex Parser Equivalent XQuery Query Equivalent XQuery Query XQFT Query XQFT Query Preprocessing & Inverted Lists Generation Text </xml Text </xml inverted lists.xml 4 GalaTex Architecture positions API (

30 Outline Introduction Querying Scoring Evaluation Open Issues

31 Scoring Keyword queries and Tag + Keyword queries  initial term weights per element.  elements with same tag may have same score.  score propagation along document structure.  overlapping elements. Path Expression + Keyword queries  initial term weights based on paths. XQuery + Complex full-text queries  compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).

32 Article ?XML,?search,?retrieval 0.9 XML 0.5 XML 0.2 XML 0.4 search 0.7 retrieval Term Weights TitleSection 1Section 2  how to obtain document and collection statistics (e.g., tf, idf)  how to estimate element scores (frequency, user studies, size)?  which components contribute best to content of Article?  do we need edge weights (e.g., size, number of children)?  is element size an issue?

33 Score Propagation (XXL) Title Section 1Section 2  Compute similar terms with relevance score r1 using an ontology (weighted distance in the ontology graph).  Compute TFIDF of each term for a given element content with relevance score r2.  Relevance of an element content for a term is r1*r2.  Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score. Article ?XML,?search, ?retrieval 0.9 XML 0.5 XML 0.2 XML 0.4 search 0.7 retrieval (Theobald & Weikum, EDBT 2002)

34 Overlapping elements Title Section 1Section 2  Section 1 and article are both relevant to “XML retrieval”  which one to return so that to reduce overlap?  Should the decision be based on user studies, size, types, etc? Article ?XML,?search, ?retrieval 0.9 XML 0.5 XML 0.2 XML 0.4 search 0.7 retrieval

35 Controlling Overlap Start with a component ranking, elements are re-ranked to control overlap. Retrieval status values (RSV) of those components containing or contained within higher ranking components are iteratively adjusted. 1.Select the highest ranking component. 2.Adjust the RSV of the other components. 3.Repeat steps 1 and 2 until the top m components have been selected. (Clarke, SIGIR 2005)

36 ElemRank w : Hyperlink edge d1/3 d1: Probability of following hyperlink 1-d1-d2-d3: Probability of random jump : Containment edge d2/2 d2: Probability of visiting a subelement d3 d3: Probability of visiting parent (Guo et al, SIGMOD 2003)

37 Scoring Keyword queries  compute possibly different scores. Tag + Keyword queries  compute scores based on tags and keywords. Path Expression + Keyword queries  compute scores based on paths and keywords. XQuery + Complex full-text queries  compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).

38 Vector–based Scoring (JuruXML) Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus nonnumerical database)] (term,path)-pairs: hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb Modified cosine similarity as retrieval function for vague matching of path conditions. (Mass et al, INEX 2002)

39 JuruXML Vague Path Matching Modified vector-based cosine similarity Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5

40 XML Query Relaxation Tree pattern relaxations:  Leaf node deletion  Edge generalization  Subtree promotion book edition paperback info author Dickens book edition paperback infoauthor Dickens book info author C. Dickens book edition (paperback) info author Charles Dickens edition? Query Data (Schlieder, EDBT 2002)(Delobel & Rousset, 2002) (Amer-Yahia, Lakshmanan, Pandit, SIGMOD 2004)

41 A Family of Scoring Methods Twig scoring  High quality  Expensive computation Path scoring Binary scoring  Low quality  Fast computation book edition (paperback) info author (Dickens) Query book edition (paperback) info author (Dickens) book edition (paperback) author (Dickens) book info + edition (paperback) author (Dickens) book info ++book (Amer-Yahia, Koudas, Marian, Srivastava, Toman, VLDB 2005)

42 Scoring Keyword queries  compute possibly different scores. Tag + Keyword queries  compute scores based on tags and keywords. Path Expression + Keyword queries  compute scores based on paths and keywords.  Evaluate effectiveness of scoring methods. XQuery + Complex full-text queries  compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).  compose approximation on structure and on text.

43 Outline Introduction Querying Scoring Evaluation  Formalization of existing XML search languages  Structure-aware evaluation algorithms  Implementation in GalaTex Open Issues

44 LOC document fragment 109th … Committee on Education … … 1 st session … …and the Workforce … Mr. Jefferson

45 Sample Query on LOC Find action descriptions of bills introduced by “Jefferson” with a committee name containing the words “education” and “workforce” at a distance of no more than 5 words in the text

46 Data model R tokPos NodetokPos wordposition list workforce{1, 3} education{2} Workforce 1 Education Workforce

47 Data model instantiation NodetokPos 1.2.2k1 ; {6} 1.1.2k1 ; {2, 4} 1.1.1k1 ; {1} 1.1 k1 1 k1, k2, k NodetokPos 1.2.2k1 ; {6} 1.2k1 ; {6} 1.1.2k1 ; {2, 4} 1.1.1k1 ; {1} 1.1k1 ; {1, 2, 4} 1k1 ; {1, 2, 4, 6} Instance 1: R k1 -redundant storage -each tuple is self- contained Instance 2: scuR k1 -no redundant positions -smallest nbr of nodes k2 5 k One relation per keyword in the document

48 FT-Algebra and Query Plan 5 σ distance({“education”},{”workforce”};≤5) R “education” R “workforce” ∏ nod e σ ordered({“education”,”workforce”}) R “Jefferson” × × × EC

49 Join Evaluation NodetokPos 1.2.2k1 ; {6} 1.2k1 ; {6} 1.1.2k1 ; {2, 4} 1.1.1k1 ; {1} 1.1k1 ; {1, 2, 4} 1k1 ; {1, 2, 4, 6} NodetokPos 1.2.1k2 ; {5} 1.2k2 ; {5} 1.1.2k2 ; {3} 1.1k2 ; {3} 1k2 ; {3, 5} NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1, 2, 4} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k ×

50 Join Evaluation on SCU NodetokPos 1.2.2k1 ; {6} 1.1.2k1 ; {2, 4} 1.1.1k1 ; {1} NodetokPos 1.2.1k2 ; {5} 1.1.2k2 ; {3} NodetokPos 1.1.2k2 ; {3} k1 ; {2, 4} scuR k1 scuR k2 × 1.1 k1 1 k1, k2, k k2 5 k

51 Need for LCAs NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} scuR k1 scuR k2 × k1 1 k1, k2, k k2 5 k NodetokPos 1.2.2k1 ; {6} 1.1.2k1 ; {2, 4} 1.1.1k1 ; {1} NodetokPos 1.2.1k2 ; {5} 1.1.2k2 ; {3} (Schmidt et al, ICDE 2002)(Li, Yu, Jagadish, VLDB 2003) (Guo et al, SIGMOD 2003)(Xu & Papakonstantinou, SIGMOD 2005)

52 SCU: is LCA enough? NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k σ distance({“k1”},{”k2”};=2 ) σ ordered({“k2”,”k1”})  fail scuR k1 scuR k2 ×  pass

53 SCU: is LCA enough? NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k σ distance({“k1”},{”k2”};=2 ) σ ordered({“k2”,”k1”})  fail scuR k1 scuR k2  pass ×

54 SCU: is LCA enough? NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k σ distance({“k1”},{”k2”};=2 ) σ ordered({“k2”,”k1”})  fail scuR k1 scuR k2 × Does not satisfy ‘ordered’ alone, but it should be an answer!

55 SCU: is LCA enough? NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k σ distance({“k1”},{”k2”};=2 ) σ ordered({“k2”,”k1”})  fail scuR k1 scuR k2 ×

56 SCU: position propagation NodetokPos 1.2k2 ; {5} k1 ; {6} 1.1.2k2 ; {3} k1 ; {2, 4} 1.1k2 ; {3} k1 ; {1} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6} 1.1 k1 1 k1, k2, k k2 5 k σ distance({“k1”},{”k2”};=2 ) σ ordered({“k2”,”k1”}) NodetokPos 1.1k2 ; {3} k1 ; {1, 2, 4} 1k2 ; {3, 5} k1 ; {1, 2, 4, 6}  pass

57 SCU Summary Key ideas  R 1 SCU R 2 → find LCA  σ SCU (R) → propagation along doc. structure if node satisfies σ predicate, output node o/w propagate its tokPos to its first ancestor in R Benefit: reduces size of intermediate results Challenge: minimize computation overhead  selections additional column in R for direct access to ancestors TRIE structures  joins record highest ancestor in EC of each node in scuR and use sort-merge ×

58 Galax XQuery Engine Text Text Text Text.xml Parser: to FT-Algebra FT-Algebra plan FT-Algebra plan Full-Text Query Full-Text Query Preprocessing & Inverted Lists Generation Text </xml Text </xml.xml 4 GalaTex Architecture: in progress inverted lists BerkeleyDB (instance 1 / 2) Code generation AllNodes / SCU Query Execution FT-Algebra operators implem. FT-Algebra operators implem. Executable code Executable code EC +positions API

59 Open Issues (in no particular order) Difficult research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate! System architecture: DB on top of IR, IR on top of DB, true merging? Experimental evaluation of scoring methods (INEX). Score-aware algebra for XML for the joint optimization of queries on both structure and text. More details: