Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1.

Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1

Introduction XML Keyword search –Inspired by IR style keyword search on the web –Enables user to access information in XML database –XML data modeled as a rooted, labeled tree –Recent research efforts Efficiency Effectiveness 2

Capture users search intention –Identify the target that user intends to search for –Infer the predicate constraint that user intends to search via Result ranking –Rank the query results according to their objective relevance to user search intention 3

State of the Art Search semantics design –LCA (Lowest Common Ancestor) Node v is a LCA of keyword set K={w 1, w 2,…,w k } if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K –SLCA (Smallest LCA) Node v is a SLCA of keyword set K={w 1, w 2,…,w k } if –(1) v is a LCA of K –(2) no proper descendant of v is LCA of K –XSeek Infers the search intention based on the concept of objects and an analysis of the matching between keyword and data node 4

State of the Art (cont) Efficient result retrieval –Designed based on a certain search semantics –XKSearch, Multiway SLCA etc. Result ranking –XRANK, XKSEarch, EASE –They only consider Structural compactness of matching results Keyword proximity Similarity at node level 5

Problems Unaddressed Not address the user search intention adequately! –Meaningfulness of query result SLCA is less meaningful in many cases –Keyword Ambiguity Problems 1. A keyword can appear both as an xml node type and as the text value of some other nodes 2. A keyword can appear in the text values of different xml node types and carry different meanings Neither SLCA nor Xseek can well address keyword ambiguity 6

Meaningfulness Keyword query rock music –Search intention: find customers interested in rock music – C3 –SLCA returns: interest node of C3 customers storeDB books... book title publisher ID authors author B 2... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis C 4... Daniel Jones John Williams book title... ID authors author B 1 Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C 1 customer ID name interest interests rock music Art Smith C 3 purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford Problems 7

Keyword Ambiguity Q = customer, interest, art –Ambiguity 1: customer, interest; Ambiguity 2: art –Intention: find customer whose interest is art –less relevant or irrelevant result to be returned also --- C 1,C 3, B 1 s title purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford 8 Problems

Keyword Ambiguity (cont) Q = customer, art –art can be the value of interest node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) –customer can be tag name of customer node, or (part of) value of title of(B1) - How to rank C1 to C4 and B1? customers storeDB books... book title publisher ID authors author B 2... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis C 4... Daniel Jones John Williams book title... ID authors author B 1 Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C 1 customer ID name interest interests rock music Art Smith C 3 purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford 9 Problems

Objectives & Challenges Challenges I.How to decide which sub-tree(s) with appropriate node types can capture user desired information II.How to return sub-trees of an appropriate size (i.e. contain enough but non-overwhelming information) III.How to rank those sub-trees by their relevance Address the below as a single problem –Search intention identification –Query result retrieval –Result ranking –Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data 10

Challenges Difficulty in applying TF*IDF to XML XML DB carries semantic information while text DB contains pure text information. XML TF*IDF must be aware of the underlying semantics. All contents of XML data are stored in leaf nodes only What is analogy of flat document in XML? o Sub-tree classified according to its prefix path Normalization factor is not simply the size of sub-tree o Structure of sub-trees may also infest the ranks 11

TF*IDF Recap Rule 1: A keyword appearing in many documents should not be regarded as more important than a keyword appearing in a few. --- IDF Rule 2: A document with more occurrences of a query keyword should not be regarded as less important for that keyword than a document that has less. --- TF Rule 3: A normalization factor is needed to balance between long and short documents –as Rule 2 discriminates against short documents which may have less chance to contain more occurrences of keywords. 12

Our Approach –Extend IR-style keyword search techniques (like TF*IDF) from text database to XML database, in order to capture the hierarchical structure of xml document by analyzing the knowledge of statistics of underlying XML data –Major Contributions 1.Identify users desired search-for node and search-via node(s) in a heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates 2.Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account 3.Design a Keyword Search Engine XReal 13

Data Model Node type - Two nodes are of same node type if they share the same prefix path /storeDB/customers/customer/name vs. /storeDB/books/book/publisher/name customers storeDB books... book title publisher ID authors author B 2... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis C 4... Daniel Jones John Williams book title... ID authors author B 1 Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C 1 customer ID name interest interests rock music Art Smith C 3 purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford Value node – text values contained in leaf node Structural node Single-valued node type, multi-valued node type Grouping type – all its children are of same multi-valued type 14

XML TF and IDF XML DF (document frequency) –The number of T-typed nodes that contain keyword k in their sub-trees in XML database. Granularity of similarity measurement is sub-trees of certain node type T XML TF (term frequency) –The number of occurrences of a keyword k in a given value node a in XML database. 15

Infer the desired search-for node Guidelines: A node type T is considered as a desired search for node if 1.T is intuitively related to every query keyword 2.XML nodes of type T should be informative enough to contain enough relevant information 3.XML nodes of type T should be not overwhelming to contain too much irrelevant information Confidence of T as the search for node w.r.t. query q. product instead of sum is used to follow 1 st guideline log part designed to follow 3 rd guideline exponential part designed to follow 2 nd guideline r is a decay factor in (0,1]. 16

Infer the Search-Via Nodes Infer structural node to search via –Structural node n is a good candidate if it is related to as many (but not necessarily all) keywords as possible Search via node type normally is not unique Infer individual value node to search via –Statistics alone is not adequate to infer the likelihood of a value node as (part of) search via node –Capture keyword co-occurrence 17

customers storeDB books... book title publisher ID authors author B 2... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis C 4... Daniel Jones John Williams book title... ID authors author B 1 Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C 1 customer ID name interest interests rock music Art Smith C 3 purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford E.g. Q = customer, name, rock, interest, art Easy to find name and interest have high confidence to be the search via nodes But hard to know rock is value of name or interest, art is value of interest or name How to differ customer C4 from C3? Capture keyword co-occurrence 18

Capture keyword co-occurrence Proximity factors for a value node v of type k t containing keyword k –Given a query q and a certain value node v, if there are two keywords k t and k in q, s.t. k t matches the type of an ancestor node of v and k matches a keyword in v –In-Query distance Distance between keyword k and node type k t in query q Favors: k t appears before k –Structural distance Depth distance between v and the nearest k t typed ancestor node of v –Value-Type distance Max of the above two 19

Principles of XML keyword search Principle 1 –When searching for D-typed nodes via a single-valued type V, ideally only the values and structures nested in V-typed nodes can affect the relevance, regardless of the size of other typed nodes nested in D-typed nodes. However, TF*IDF similarity in IR normalizes the relevance score of each document w.r.t. its size Principle 2 – address keyword Ambiguity 2 –When searching for nodes of type D via a multi-valued type V, the relevance of a D-typed node which contains a query relevant V-typed node should not be affected (i.e. normalized) too much by other query-irrelevant V-typed nodes. Example: query art - C4 should not be less relevant than C1 20

Principles of XML keyword search Principle 1 and 2 –Especially useful for interpreting pure keyword query - find search via node correctly Principle 3 –The order of keywords in a query is important to indicate the search intention Incorporate the search via confidence C via we defined before 21

XML TF*IDF Similarity To calculate the similarity between the search for node and the query q –Base case: similarity between value node a and q Apply original TF*IDF directly since a contains keywords only without any structure –Recursive case: similarity between structural node n and q Based on similarities of its children c and the confidence level of c as the node type to search via IDF TF Normalization factor 22

XML TF*IDF Similarity (cont.) Recursive Case –Intuition 2. An internal node n is relevant to q, if n has a child c such that the type of c has high confidence to be a search via node w.r.t. q (i.e. large C via (T c, q)), and c is highly relevant to q (i.e. large sim(q, c)). –Intuition 3. An internal node n is more relevant to q if n has more query-relevant children when all others being equal. Weighted sum of all ns childrens similarity and their confidence to be the search via node Overall weight of node n w.r.t query q which essentially plays the role of a normalization factor 23

Flowchart of answering a query 1.Identify user search intention –Compute the confidence of all possible candidate node types and choose desired search for node T for 2.Relevance-oriented ranking –Compute XML TF*IDF similarity in a bottom-up approach from value nodes containing keywords up to nodes of type T for –Return a ranked list of sub-trees rooted at nodes of type T for If more than one search for node type have comparable confidence, a ranked list for each search for node is returned 24

Experimental Result Data set –DBLP, XMark, WSU, eBay Comparison –Compare XReal with SLCA, Xseek Equipment –Implement in Java –Run on 3.6GHz pentium IV, 1 GB memory PC with Windows XP –Berkeley DB java edition for storing keyword inverted lists and keyword frequency table 25

Search Effectiveness Accuracy in inferring the search for node –Conducted by user survey –Tested queries contain at least one of the two ambiguity problems –Conclusion XReal works well, especially when the search for node is not given explicitly in the query 26

Search Effectiveness Result effectiveness –Measured by precision, recall, F-measure –Observations XReal achieves higher precision than SLCA and Xseek for queries that contain ambiguities XReal Performs as well as XSeek when queries have no ambiguity in XML data XReal: Top-100 precision higher than overall precision F-measure also shows good overall effectiveness of both XReal and XSeek 27

Ranking Effectiveness Metrics –Number of Top-1 answers that are relevant –Reciprocal Rank (R-Rank) –Mean Average Precision (MAP) 28

Efficiency & Scalability Compare three adoptions of indices for XReal, and SLCA –Dup Store only the dewey id and XML TF –DupType Stores an extra node type (i.e. its prefix path) –DupTypeNorm Stores an extra normalization factor W a for value node 29

XMarkDBLP 30

Q&A Thank You 31

32 customers storeDB books... book title publisher ID authors author... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis... Daniel Jones John Williams book title... ID authors author Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C1 customer ID name interest interests rock music Art Smith purchase purchases customer ID name interest interests street artJohn Martin... name Oxford C2 C3 C4 B1 B2

Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1.

Similar presentations

Presentation on theme: "Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1.

Similar presentations

Presentation on theme: "Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1."— Presentation transcript:

Similar presentations

About project

Feedback