Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to XML IR XML Group.

Similar presentations


Presentation on theme: "Introduction to XML IR XML Group."— Presentation transcript:

1 Introduction to XML IR XML Group

2 Outline Introduction XML Search XML Scoring and Ranking Conclusion

3 Introduction Submit keywords Top-k ranking

4 Introduction The result is XML fragment Q1: Linda Q2: Female, Linda
Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ”Alice” Gender Female ProjRef ”Linda” Q1: Linda Q2: Female, Linda The result is XML fragment Q3: Female, Researcher

5 Introduction-Conceptual Model
Documents Query Indexing Formulation Keywords Document representation Query representation Inverted index (Algorithm Design) Retrieval function Relevance feedback Retrieval results Matching content + structure (Scoring and Ranking) Presentation of related components (Semantic Definition)

6 Introduction XML IR Query Semantic
Query Processing (XML Search Algorithm) Scoring and Ranking Result representation

7 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Semantic XML Search Algorithms XML Scoring and Ranking Conclusion

8 XML Search Languages Three classes of XML search languages
Keyword search “book xml” Path Expression + Keyword search /book[./title about “xml db”]] XQuery + Complex full-text search for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

9 Search Semantic Tree & Graph (IDRef) 理想的结果 错误的结果 Q: Female, XML
Researcher Name ”Linda” Gender Female Projects Topic ”XML” Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ProjRef ”Linda” Gender Female Project Topic ”XML” Q: Female, XML Tree & Graph (IDRef)

10 Search Semantic Factors affect the Semantic Tree & Graph (是否考虑IDRef)
Relationship Between Entities (实体间的关系) Schema (是否考虑Schema) XML结构的灵活性

11 Search Semantic-Related Work
Tree Graph NO LCA[ICDE01] XSEarch[VLDB03] XRANK[SIGMOD03] MLCA[VLDB04] SLCA[SIGMOD05] Symmetry[WWW06] YES Interconnection[CIKM05] XKeyword[ICDE03] IDRef Schema 考虑实体之间关系 考虑实体之间的交换

12 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

13 LCA & SLCA(MLCA) Q3: Bit, 1999 (LCA) Q1: Ben, Bit Q2: Bob, Byte
Q3: Bit, (SLCA)

14 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

15 Find papers by Vianu on the topic of
XSEarch[VLDB03] Find papers by Vianu on the topic of “logical databases” How can we find such papers?

16 Standard Search Engine
A document containing some of the three query terms is considered as a result.

17 The document is not relevant to the query. This does not work!!!
The document contains the three query terms. Hence, it is returned by a standard search engine. BUT The document is returned BUT it does not contain any paper on “logical databases” by Vianu The document is not relevant to the query. This does not work!!! This fragment does not represent a paper about logical databases This fragment does not represent a paper by Vianu <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </proceedings>

18 Lowest common ancestor of
Relationship Trees Relationship tree of n1, n2, …, nk Lowest common ancestor of n1, n2, …, nk The relationship tree of nodes n1,..., nk is the subtree T of the document D, such that T is rooted at the lowest common ancestor (lca) of n1,..., nk, and T consists of the k paths from the lca to n1 through nk nk n1 n2

19 XSEarch: A Semantic Search Engine for XML
n1,..., nk are interconnected if either relationship tree of n1,..., nk does not contain two nodes with the same label, or the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk

20 Lowest common ancestor of circled nodes
Example (1) Lowest common ancestor of circled nodes Relationship tree proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Circled nodes belong to different inproceedings entities. They ARE NOT interconnected!

21 Lowest common ancestor of circled nodes
Example (2) Lowest common ancestor of circled nodes proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Relationship tree Circled nodes belong to the same inproceedings entity. They ARE interconnected!

22 Queries and Computation on the Web
Example (3) Lowest common ancestor of circled nodes proceedings Relationship tree inproceedings inproceedings title author title author author Moshe Y. Vardi Victor Vianu Serge Abiteboul Queries and Computation on the Web Querying Logical Databases Circled nodes belong to the same inproceedings entity, but are labeled with the same tag. They ARE interconnected.

23 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

24 Interconnection Semantics for Keyword Search in XML[CIKM05]
ID references are ignored That is, documents are always trees The schema is ignored Therefore, missing information is not taken into account

25 Keyword-Search Example
{Cohen , IR}

26 A Result {Cohen , IR} Cohen and IR are in the same department

27 Another Result {Cohen , IR} Identifying meaningful
Cohen and IR are in the same department and Cohen wrote an article about IR This fragment should have a higher rank Identifying meaningful relationships can improve ranking in keyword search

28 A Schema Defines Document Structure
The root of the schema is the label of the root of the document

29 A Schema Defines Document Structure
An edge in the document is allowed only if an edge between the corresponding labels appears in the schema

30 In the formal framework, patterns are the basic building blocks
Formally, a pattern is a pair (L,C) C is a tree of labels L is a set of labels ( , ) {title,publication,author} C contains L C has no redundant edges

31 Interconnection by Patterns
A set O of objects is interconnected if the objects are in a tree that is isomorphic to the pattern ( , ) Now I’ll define when a patterns interconnects a set of objects. This is a formal definition, I will explain it by an example. {title,publication,author}

32 Interconnection by Patterns
{title,publication,author} ( ) ,

33 Interconnection by Patterns
{title,publication,author} ( ) , Interconnected … so a pattern represents a specific meaningful relationship

34 Interconnection Semantics
An interconnection semantics P is a set of patterns A set of objects is interconnected by P if it is interconnected by a pattern of P ({title,name} , ) ({title,name} , )

35 The Subtrees of {title,publication}

36 The Subtrees of {title,author}

37 The Subtrees of {title,author}
? The author did not actually wrote the paper Is this what we mean?

38 The Interconnection Semantics Puca
Intuitively, p is structurally minimal if internal nodes in C cannot be roots of trees containing L The semantics Puca(S) is the set of all structurally minimal patterns

39 One Structurally Minimal Pattern
({title,author} , ) In the schema, article is the only common ancestor of {title,author}

40 Another Structurally Minimal Pattern
({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}

41 A Third Structurally Minimal Pattern
({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}

42 Not a Structurally Minimal Pattern
({title,author} , ) In the schema, department, publications and incproc. are all common ancestors of {title,author}

43 Back to the Document

44 Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection! ({title,author} , )

45 Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection! ({title,author} , )

46 Not Puca(S)-Interconnected
This subtree does not show Puca(S)-interconnection! ({title,author} , )

47 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

48 Keyword Proximity Search on XML Graphs[ICDE03]
Input: a set of keywords Results: trees of XML fragments(called target objects) that contains all the keywords, ranked according to their size Assume the existence of schema, facilitates the presentation of the results and used in optimizing the performance of the system.

49 Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6
Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8

50 Query semantics Result: the set of all possible Minimal Total Target Object Networks(MTTON’s) What’s MTTON? Node network j: an uncycled subgraph of G, such that each edge in j is an edge in G Total node network j of keyword {k1,…,km}: a node network where every keyword is contained at least one node n of j Minimal Total Node Network(MTTN): a total node network j where no node can be removed and j still be a total node network. Score : number of edges

51 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Language XML Search Algorithms XRank XML Scoring and Ranking Conclusion

52 Main Issue Given: Query keywords
Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order

53 Main issue: Decouples representation of ancestors and descendants
Naïve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL ; 5 ; 6 ; 7 1 <workshop> date 2 <title> 3 <editors> 4 <proceedings> 5 28 July … XML and … David Carmel … <paper> 6 <paper> <title> <author> 7 8 Problems: 1. Space Overhead 2. Spurious Results XQL and … Ricardo … Main issue: Decouples representation of ancestors and descendants

54 Dewey Encoding of IDs [1850s]
<workshop> date 0.0 <title> 0.1 <editors> 0.2 <proceedings> 0.3 28 July … XML and … David Carmel … <paper> 0.3.0 <paper> 0.3.1 <title> <author> XQL and … Ricardo …

55 XRank: Dewey Inverted List (DIL)
Position List Dewey Id Score XQL 85 32 Sorted by Dewey Id 38 89 91 Ricardo 82 38 Sorted by Dewey Id 99 52 Store IDs of elements that directly contain keyword - Avoids space overhead

56 XRank: Ranked Dewey Inverted List (RDIL)
B+-tree On Dewey Id XQL Inverted List … Sorted by Score …(other keywords)

57 RDIL: Algorithm An element may be ranked highly in one list and low in another list B+-tree helps search for low ranked element When to stop scanning inverted lists? Based on Threshold Algorithm [Fagin et al., 2002], which periodically calculates a threshold Can stop if we have sufficient results above the threshold Extension to most specific results

58 RDIL: Query Processing
Output Heap P Temp Heap P B+-tree on Dewey Id Ricardo P: Inverted List Rank(9.0.4) threshold = Score(P)+Score(R) threshold = Score(P)+Max-Score XQL R 10.8.3 B+-tree on Dewey Id

59 Outline Introduction XML Search XML Scoring and Ranking Conclusion
XML Search Language XML Search Algorithms XML Scoring and Ranking PageRank -> XRank INEX Query Relaxation Conclusion

60 Q&A Thanks!


Download ppt "Introduction to XML IR XML Group."

Similar presentations


Ads by Google