1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

2 Agenda System Architecture Introduction Semantic-based Similarity Search  Query Expansion  Semantic Similarity Computation Structural-based Similarity Search  Adapting PRIX algorithm Indexing Query Processing  Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

3 System Architecture Introduction

5 Query Expansion (I) An Example: Tags in a sample query {title, Praveen Rao, information retrieval} Keywords {title, Praveen, Rao, information, retrieval} Keyword Extensions {{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} Valid Keyword Extensions {{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} (Continue in next page)

6 Query Expansion (II) Tag Extensions {{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}} Valid Tag Extensions {{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}} Query Expansions 1.{ {title}, {Praveen Rao}, {modern information retrieval} } 2.{A claim on theory of computation} ， {Praveen Rao}, {modern information retrieval} } …… Valid Queries { {title}, {Praveen Rao}, {modern information retrieval} }

7 Semantic Similarity Computation Similarity between query q and one of its extensions q’ t: tag in query q t’: tag in query q’ n: number of tags in q = 1, if ki= ki’ α (0 = ki’ m: number of keywords in tag t

9 Indexing: Prix (PRüfer sequences for Indexing Xml)

10 Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor- Descendant) Indexing structure in DB

11 Query Processing Procedure  Filtering Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m is the number of nodes in the query.  Refinement Connectivity Gap Consistency Frequency Consistency

12 Subsequence Matching Definition - Example: * Good results: media, mult, mm, ted, tia, etc… Why it works? Is not enough, need more refinements…

13 Refinement #1 Concept of Dummy Nodes - PRIX offers only partial match - Solution: extend prix to leaves level - Example:

14 Refinement #2 Connection vs Connectionless - Definition - How to check it? - If not connected, then what? - Solution: apply penalty - Example (Disconnected By Gap): - Example (Disconnected By Unknown):

15 Refinement #3 Checking for Gap Consistency - Gap Consistency depends on gaps of prüfer sequence - How to check it? - Determines if query tree is subset of searching domain

16 Refinement #4 Checking for Frequency Consistency - Frequency consistency depends on Gap Consistency and occurrences of NPS - How to check it? - Determines if query tree is exact match in searching domain - If not frequency consistent, then what? - Solution: apply penalty

17 Structure Similarity Calculations are based on edit distances which transforms to penalty values Each mismatch node in structure has penalty equal to size of subtree + 1 Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty

18 Structural Similarity #1: Connectivity

19 Structural Similarity #2: Gap Similarity

20 Structural Similarity #3: Frequency Similarity

22 Rank returned XML patterns Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

23 Advantages of the approach Prix Indexing  Faster  Captures all structural information Similarity based  Structure similarity  Semantic similarity

24 Limitations and Extensions Limitation of Prix:  Ordering of nodes  We need to handle it in query extension a baca caba cb a bc

25 Limitations and Extensions More Limitations of Prix:  It is difficult to map intuitive structure similarities in tree to sequences similarities in Prix sequences  thus difficult to have accurate definitions of the similarity However:  Translate tree structures to equivalent sequences and further do data mining or similarity matching on sequences is a promising direction

26 Limitations and Extensions Limitations of Semantic similarity  Too many similar results However:  We consider semantic similarity together with structure information In broad sense:  Structure similarity  Semantic similarity  Syntax similarity  Similarity information from co-occurrences of keywords  Similarity information from user feedback  Similarity information from metadata (DTD, data source, region, language, link structure of XML files, etc.)

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

Similar presentations

Presentation on theme: "1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

Similar presentations

Presentation on theme: "1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao."— Presentation transcript:

Similar presentations

About project

Feedback