Introduction to XML IR XML Group
Outline Introduction XML Search XML Scoring and Ranking Conclusion
Introduction Submit keywords Top-k ranking
Introduction The result is XML fragment Q1: Linda Q2: Female, Linda Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ”Alice” Gender Female ProjRef ”Linda” Q1: Linda Q2: Female, Linda The result is XML fragment Q3: Female, Researcher
Introduction-Conceptual Model Documents Query Indexing Formulation Keywords Document representation Query representation Inverted index (Algorithm Design) Retrieval function Relevance feedback Retrieval results Matching content + structure (Scoring and Ranking) Presentation of related components (Semantic Definition)
Introduction XML IR Query Semantic Query Processing (XML Search Algorithm) Scoring and Ranking Result representation
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic XML Search Algorithms XML Scoring and Ranking Conclusion
XML Search Languages Three classes of XML search languages Keyword search “book xml” Path Expression + Keyword search /book[./title about “xml db”]] XQuery + Complex full-text search for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5
Search Semantic Tree & Graph (IDRef) 理想的结果 错误的结果 Q: Female, XML Researcher Name ”Linda” Gender Female Projects Topic ”XML” Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ProjRef ”Linda” Gender Female Project Topic ”XML” Q: Female, XML Tree & Graph (IDRef)
Search Semantic Factors affect the Semantic Tree & Graph (是否考虑IDRef) Relationship Between Entities (实体间的关系) Schema (是否考虑Schema) XML结构的灵活性
Search Semantic-Related Work Tree Graph NO LCA[ICDE01] XSEarch[VLDB03] XRANK[SIGMOD03] MLCA[VLDB04] SLCA[SIGMOD05] Symmetry[WWW06] YES Interconnection[CIKM05] XKeyword[ICDE03] IDRef Schema 考虑实体之间关系 考虑实体之间的交换
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion
LCA & SLCA(MLCA) Q3: Bit, 1999 (LCA) Q1: Ben, Bit Q2: Bob, Byte Q3: Bit, 1999 (SLCA)
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion
Find papers by Vianu on the topic of XSEarch[VLDB03] Find papers by Vianu on the topic of “logical databases” How can we find such papers?
Standard Search Engine A document containing some of the three query terms is considered as a result.
The document is not relevant to the query. This does not work!!! The document contains the three query terms. Hence, it is returned by a standard search engine. BUT The document is returned BUT it does not contain any paper on “logical databases” by Vianu The document is not relevant to the query. This does not work!!! This fragment does not represent a paper about logical databases This fragment does not represent a paper by Vianu <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </proceedings>
Lowest common ancestor of Relationship Trees Relationship tree of n1, n2, …, nk Lowest common ancestor of n1, n2, …, nk The relationship tree of nodes n1,..., nk is the subtree T of the document D, such that T is rooted at the lowest common ancestor (lca) of n1,..., nk, and T consists of the k paths from the lca to n1 through nk … nk n1 n2
XSEarch: A Semantic Search Engine for XML n1,..., nk are interconnected if either relationship tree of n1,..., nk does not contain two nodes with the same label, or the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk
Lowest common ancestor of circled nodes Example (1) Lowest common ancestor of circled nodes Relationship tree proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Circled nodes belong to different inproceedings entities. They ARE NOT interconnected!
Lowest common ancestor of circled nodes Example (2) Lowest common ancestor of circled nodes proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Relationship tree Circled nodes belong to the same inproceedings entity. They ARE interconnected!
Queries and Computation on the Web Example (3) Lowest common ancestor of circled nodes proceedings Relationship tree inproceedings inproceedings title author title author author Moshe Y. Vardi Victor Vianu Serge Abiteboul Queries and Computation on the Web Querying Logical Databases Circled nodes belong to the same inproceedings entity, but are labeled with the same tag. They ARE interconnected.
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion
Interconnection Semantics for Keyword Search in XML[CIKM05] ID references are ignored That is, documents are always trees The schema is ignored Therefore, missing information is not taken into account
Keyword-Search Example {Cohen , IR}
A Result {Cohen , IR} Cohen and IR are in the same department
Another Result {Cohen , IR} Identifying meaningful Cohen and IR are in the same department and Cohen wrote an article about IR This fragment should have a higher rank Identifying meaningful relationships can improve ranking in keyword search
A Schema Defines Document Structure The root of the schema is the label of the root of the document
A Schema Defines Document Structure An edge in the document is allowed only if an edge between the corresponding labels appears in the schema
In the formal framework, patterns are the basic building blocks Formally, a pattern is a pair (L,C) C is a tree of labels L is a set of labels ( , ) {title,publication,author} C contains L C has no redundant edges
Interconnection by Patterns A set O of objects is interconnected if the objects are in a tree that is isomorphic to the pattern ( , ) Now I’ll define when a patterns interconnects a set of objects. This is a formal definition, I will explain it by an example. {title,publication,author}
Interconnection by Patterns {title,publication,author} ( ) ,
Interconnection by Patterns {title,publication,author} ( ) , Interconnected … so a pattern represents a specific meaningful relationship
Interconnection Semantics An interconnection semantics P is a set of patterns A set of objects is interconnected by P if it is interconnected by a pattern of P ({title,name} , ) ({title,name} , )
The Subtrees of {title,publication}
The Subtrees of {title,author}
The Subtrees of {title,author} ? The author did not actually wrote the paper Is this what we mean?
The Interconnection Semantics Puca Intuitively, p is structurally minimal if internal nodes in C cannot be roots of trees containing L The semantics Puca(S) is the set of all structurally minimal patterns
One Structurally Minimal Pattern ({title,author} , ) In the schema, article is the only common ancestor of {title,author}
Another Structurally Minimal Pattern ({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}
A Third Structurally Minimal Pattern ({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}
Not a Structurally Minimal Pattern ({title,author} , ) In the schema, department, publications and incproc. are all common ancestors of {title,author}
Back to the Document
Puca(S)-Interconnected title and author This subtree shows Puca(S)-interconnection! ({title,author} , )
Puca(S)-Interconnected title and author This subtree shows Puca(S)-interconnection! ({title,author} , )
Not Puca(S)-Interconnected This subtree does not show Puca(S)-interconnection! ({title,author} , )
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion
Keyword Proximity Search on XML Graphs[ICDE03] Input: a set of keywords Results: trees of XML fragments(called target objects) that contains all the keywords, ranked according to their size Assume the existence of schema, facilitates the presentation of the results and used in optimizing the performance of the system.
Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6 Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8
Query semantics Result: the set of all possible Minimal Total Target Object Networks(MTTON’s) What’s MTTON? Node network j: an uncycled subgraph of G, such that each edge in j is an edge in G Total node network j of keyword {k1,…,km}: a node network where every keyword is contained at least one node n of j Minimal Total Node Network(MTTN): a total node network j where no node can be removed and j still be a total node network. Score : number of edges
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Language XML Search Algorithms XRank XML Scoring and Ranking Conclusion
Main Issue Given: Query keywords Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order
Main issue: Decouples representation of ancestors and descendants Naïve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 1 <workshop> date 2 <title> 3 <editors> 4 <proceedings> 5 28 July … XML and … David Carmel … <paper> 6 <paper> … <title> <author> 7 8 … … Problems: 1. Space Overhead 2. Spurious Results XQL and … Ricardo … Main issue: Decouples representation of ancestors and descendants
Dewey Encoding of IDs [1850s] <workshop> date 0.0 <title> 0.1 <editors> 0.2 <proceedings> 0.3 28 July … XML and … David Carmel … <paper> 0.3.0 <paper> 0.3.1 … <title> 0.3.0.0 <author> 0.3.0.1 … … XQL and … Ricardo …
XRank: Dewey Inverted List (DIL) Position List Dewey Id Score XQL 5.0.3.0.0 85 32 Sorted by Dewey Id 8.0.3.8.3 38 89 91 … … … Ricardo 5.0.3.0.1 82 38 Sorted by Dewey Id 8.2.1.4.2 99 52 … … … Store IDs of elements that directly contain keyword - Avoids space overhead
XRank: Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted by Score …(other keywords)
RDIL: Algorithm An element may be ranked highly in one list and low in another list B+-tree helps search for low ranked element When to stop scanning inverted lists? Based on Threshold Algorithm [Fagin et al., 2002], which periodically calculates a threshold Can stop if we have sufficient results above the threshold Extension to most specific results
RDIL: Query Processing Output Heap P Temp Heap P B+-tree on Dewey Id Ricardo P: 9.0.4.2.0 Inverted List Rank(9.0.4) threshold = Score(P)+Score(R) threshold = Score(P)+Max-Score XQL R 9.0.4.1.2 8.2.1.4.2 9.0.4.1.2 9.0.4.1.2 9.0.5.6 9.0.5.6 10.8.3 B+-tree on Dewey Id 9.0.4.2.0
Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Language XML Search Algorithms XML Scoring and Ranking PageRank -> XRank INEX Query Relaxation … Conclusion
Q&A Thanks!