1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao
2 Agenda System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm Indexing Query Processing Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
3 System Architecture Introduction
4 Agenda System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm Indexing Query Processing Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
5 Query Expansion (I) An Example: Tags in a sample query {title, Praveen Rao, information retrieval} Keywords {title, Praveen, Rao, information, retrieval} Keyword Extensions {{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} Valid Keyword Extensions {{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} (Continue in next page)
6 Query Expansion (II) Tag Extensions {{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}} Valid Tag Extensions {{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}} Query Expansions 1.{ {title}, {Praveen Rao}, {modern information retrieval} } 2.{A claim on theory of computation} , {Praveen Rao}, {modern information retrieval} } …… Valid Queries { {title}, {Praveen Rao}, {modern information retrieval} }
7 Semantic Similarity Computation Similarity between query q and one of its extensions q’ t: tag in query q t’: tag in query q’ n: number of tags in q = 1, if ki= ki’ α (0 = ki’ m: number of keywords in tag t
8 Agenda System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm Indexing Query Processing Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
9 Indexing: Prix (PRüfer sequences for Indexing Xml)
10 Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor- Descendant) Indexing structure in DB
11 Query Processing Procedure Filtering Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m is the number of nodes in the query. Refinement Connectivity Gap Consistency Frequency Consistency
12 Subsequence Matching Definition - Example: * Good results: media, mult, mm, ted, tia, etc… Why it works? Is not enough, need more refinements…
13 Refinement #1 Concept of Dummy Nodes - PRIX offers only partial match - Solution: extend prix to leaves level - Example:
14 Refinement #2 Connection vs Connectionless - Definition - How to check it? - If not connected, then what? - Solution: apply penalty - Example (Disconnected By Gap): - Example (Disconnected By Unknown):
15 Refinement #3 Checking for Gap Consistency - Gap Consistency depends on gaps of prüfer sequence - How to check it? - Determines if query tree is subset of searching domain
16 Refinement #4 Checking for Frequency Consistency - Frequency consistency depends on Gap Consistency and occurrences of NPS - How to check it? - Determines if query tree is exact match in searching domain - If not frequency consistent, then what? - Solution: apply penalty
17 Structure Similarity Calculations are based on edit distances which transforms to penalty values Each mismatch node in structure has penalty equal to size of subtree + 1 Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty
18 Structural Similarity #1: Connectivity
19 Structural Similarity #2: Gap Similarity
20 Structural Similarity #3: Frequency Similarity
21 Agenda System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm Indexing Query Processing Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
22 Rank returned XML patterns Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)
23 Advantages of the approach Prix Indexing Faster Captures all structural information Similarity based Structure similarity Semantic similarity
24 Limitations and Extensions Limitation of Prix: Ordering of nodes We need to handle it in query extension a baca caba cb a bc
25 Limitations and Extensions More Limitations of Prix: It is difficult to map intuitive structure similarities in tree to sequences similarities in Prix sequences thus difficult to have accurate definitions of the similarity However: Translate tree structures to equivalent sequences and further do data mining or similarity matching on sequences is a promising direction
26 Limitations and Extensions Limitations of Semantic similarity Too many similar results However: We consider semantic similarity together with structure information In broad sense: Structure similarity Semantic similarity Syntax similarity Similarity information from co-occurrences of keywords Similarity information from user feedback Similarity information from metadata (DTD, data source, region, language, link structure of XML files, etc.)