Introduction to XML IR — Scoring and Ranking XML Group.

Introduction to XML IR — Scoring and Ranking XML Group

Outline Introduction Related Work Conclusion

Introduction (1) XML IR vs. Traditional IR
No fixed retrieval unit vs. Document Nested document components vs. Flat text Queries may not always be precise and can return a large number of results, especially in large document collections. Rank the query results so that the most relevant results appear first XML Scoring and Ranking Score elements wrt. their relevance to a query Determine the appropriate level of component granularity to return to users It is becoming increasingly popular to publish data on the Web in the form of XML documents. XML search XPath & XQuery Keyword search Queries may not always be precise and can return a large number of results, especially in large document collections. Rank the query results so that the most relevant results appear first Knowledge of schema Knowledge of a query language (eg: XQuery) Knowledge of the role of the keywords

{Cohen , IR} XML scoring and ranking:
score elements wrt. their relevance to a query {Cohen , IR} Higher rank

XML scoring and ranking:
Determine the appropriate level of component granularity to return to users book title sec Web XML ‘XML’ book title sec Web XML sec XML title

Introduction (2) XML Scoring and Ranking Approaches: Challenges
Traditional weighting terms: TF-IDF values Challenges Term and element statistics Scoring at element level requires statistics at element level XML elements are nested Capture the structure of XML document Challenges Capture the structure of XML document Weighting terms (TF-IDF values)

Outline Introduction Related Work Conclusion Vector-based Scoring
XRank XML Query Relaxation Conclusion

Searching XML Documents via XML Fragments [SIGIR 2003] XSEarch: A Semantic Search Engine for XML [VLDB 2003] Configurable Indexing and Ranking for XML Information Retrieval [SIGIR 2004] XRank XML Query Relaxation Conclusion

Recall: Vector Space Model in Text Retrieval Task
Given a query According to the retrieval formula, compute the relevance score for each document; Rank the documents according to relevance score. Vector Space Model Represent doc/query by a vector of terms Relevance between doc and query distance between two vectors Weighting terms (TF-IDF values) d q Relevance is estimated by similarity measures such as cosine between two objects of the same nature: a query and a document. cosine similarity Term frequency (tf) Inverse document frequence (idf) Ranking documents by relevance to query of typically only a few terms Relevance is based on term distribution statistics A standard approach for weighting terms is the term frequency-inverse document frequency (tf-idf) approach The standard index format for storing term frequencies is the inverted file format

Searching XML Documents via XML Fragments [SIGIR 2003]
book book Query: <book> <title> XML </title> </book> title sec title sec C=/book/title Web title XML XML XML book1.xml cr1=1 cr2=(1+2)/(1+3)=0.75 XML book2.xml C1 =/book/title C2=/book/sec/title Extending the Vector Space Model A query term is denoted by (ti,ci) ti is the content ci is the context leading to ti (ti, ci) can be matched with several document term (ti, ck) Example:<article>XML</article> Query term: (XML, article) Can be matched with (XML, article/title), (XML, article/bdy/sec) and more Modified cosine similarity as retrieval function for vague matching of path conditions

XSEarch [VLDB 2003] sec: ‘XML’ title: ‘search’ book: books book book
XML…search XML..search…XML web title XML…search search Query: Tag + Keyword Result Subtrees of a document are extracted using the interconnection index and other indices Ranking method Estimate relevance Extension of tf*idf classical in IR Ranking factors Similarity between query and result Weight of labels appearing in the result Characteristics of result tree

w(’XML’,sec)+w(‘search’,title)
sec:’XML’ title:’search’ book: book title web sec XML…search search book title sec XML…search XML..search…XML w(’XML’,sec)+w(‘search’,title) Size of the relationship tree: small fragment indicates that its nodes are closer, and thus, probably, “more related” Ancestor-descendant relationships between a pair of nodes in a fragment, indicates “strong relation” between these nodes Size of the tree A-D relationships

Weighted Term Frequency [SIGIR 2004]
//article[about(., ‘XML’)] articles article article 1 2 fm body fm body 5 1 year kwd sec year sec sec 2003 XML retrieval XML retrieval… 2000 XML… Database… Users configure label weight w (fm.kwd) = w(fm) * w(kwd) = 1*5 = 5 w (body.sec)= w(body) * w(sec)= 2*1= 2 tfw(article,’XML’) =w(fm.kwd) * tf(fm.kwd,’XML’)+ w(body.sec) * tf(body.sec,’XML’) =5*1+2*1=7

XRank [SIGMOD 2003] From PageRank to ElemRank XML Query Relaxation Conclusion

Recall: PageRank [Brin & Page 1998]
: random jump : Hyperlink edge d/3+(1-d)/6 (1-d)/6 d: Probability of following hyperlink w 1-d: Probability of random jump Web as directed graph Random walk of a web surfer Nd: total number of documents Nh(u): the number of out-going hyperlinks from u d is usually set to 0.85 阻尼系数得分在pagerank中，如果一个页面被许多其他页面引用，这个页面可能是重要页面一个页面尽管没有被多次引用，但被一个重要页面引用，则也可能是重要页面一个页面的重要性均分传递到它引用的页面页面的重要性用pagerank度量但是，如果p1链接到一个重要页面，p1的重要性我们无法确定在XML中的包含边（containment edge）中，例如，一个paper element 有高的rank值，自然的，它的sections element也应该有高的rank值。 So，Forword element rank along containment edge If a workshop contains many papers that have high elemranks, then the workshop should have a high elemrank, this corresponds to reverse elemrank propagation the number of out-going hyperlinks from document u total number of documents

From PageRank to ElemRank
: Hyperlink edge : Containment edge : Reverse containment edge w Do not distinguish between containment and hyperlink edges. d: probability of following edge 1-d: probability of random jump Ne: the number of XML elements Nc(u): the number of sub-elements of u Nh(u) : the number of out-going hyperlinks from u d1: Probability of following hyperlink d2: Probability of visiting a subelement d3: Probability of visiting parent 1-d1-d2-d3: Probability of random jump

XRank XML Query Relaxation Structure and Content Scoring for XML [VLDB 2005] Conclusion

XML Query Relaxation [VLDB 2005]
Motivations: XML Data Heterogeneity Data book title (Great Expectations) edition (paperback) info author (Dickens) book title (Great Expectations) info author (Dickens) book title (Great Expectations) edition (paperback) info author (Dickens) Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] book title (Great Expectations) edition (paperback) info author (Dickens)

XML Query Relaxation (2)
[Amer-Yahia et al. EDBT’02] book title (Great Expectations) edition (paperback) info author (Dickens) Query Tree pattern relaxations: Leaf node deletion Edge generalization Subtree promotion Relaxations book title (Great Expectations) info author (Dickens) book title (Great Expectations) edition (paperback) info author (Dickens) book author (Dickens)

Scoring Function for XML Approximate Matches
book title (Great Expectations) info author (Dickens) Exact matches should be scored higher than relaxed matches (idf) book title (Great Expectations) edition (paperback) info author (Dickens) score(a)>=score(b) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) (a) (b) book info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) score(a)<=score(c) (c)

Scoring a b a b Query relaxations: Query: (1+4)/(1+2)=1.67 a b Data: a
(1+4)/(1+3)=1.25 idf=1.67 tf=1 idf=1.67 tf=2 a 1 idf=1 tf=1 idf=1.25 tf=1 a b ＞ c e d ≤ d’ if idf(d) < idf(d’) or idf(d)=idf(d’) and tf(d) ≤tf(d’)

XRank XML Query Relaxation Conclusion

Conclusion (1) Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. Queries return document fragments. Granularity of returned results affects scoring. For queries containing conditions on structure structural conditions may affect scoring. Existing proposals extend common scoring methods: PageRank or vector-based similarity.

Conclusion (2) Scoring for keyword search
No structure constraint in query How to take document structure into account How about semantic information

Thank you ! Q&A

Introduction to XML IR — Scoring and Ranking XML Group.

Similar presentations

Presentation on theme: "Introduction to XML IR — Scoring and Ranking XML Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to XML IR — Scoring and Ranking XML Group.

Similar presentations

Presentation on theme: "Introduction to XML IR — Scoring and Ranking XML Group."— Presentation transcript:

Similar presentations

About project

Feedback