Introduction to XML IR — Scoring and Ranking XML Group.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Chapter 5: Introduction to Information Retrieval
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Improved TF-IDF Ranker
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
IR Models: Overview, Boolean, and Vector
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
COMP630 Paper Presentation by Haomian(Eric) Wang.
Vector Space Model CS 652 Information Extraction and Integration.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Chapter 19: Information Retrieval
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Effective XML Keyword Search with Relevance Oriented Ranking Paper by: Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu Presented by: Ilanit Goldshtein.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Information Retrieval
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Algorithmic Detection of Semantic Similarity WWW 2005.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Information Retrieval
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Structure and Content Scoring for XML
MCN: A New Semantics Towards Effective XML Keyword Search
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Structure and Content Scoring for XML
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Relax and Adapt: Computing Top-k Matches to XPath Queries
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Introduction to XML IR — Scoring and Ranking XML Group

Outline Introduction Related Work Conclusion

Outline Introduction Related Work Conclusion

Introduction (1) XML IR vs. Traditional IR No fixed retrieval unit vs. Document Nested document components vs. Flat text Queries may not always be precise and can return a large number of results, especially in large document collections. Rank the query results so that the most relevant results appear first XML Scoring and Ranking Score elements wrt. their relevance to a query Determine the appropriate level of component granularity to return to users It is becoming increasingly popular to publish data on the Web in the form of XML documents. XML search XPath & XQuery Keyword search Queries may not always be precise and can return a large number of results, especially in large document collections. Rank the query results so that the most relevant results appear first Knowledge of schema Knowledge of a query language (eg: XQuery) Knowledge of the role of the keywords

{Cohen , IR} XML scoring and ranking: score elements wrt. their relevance to a query {Cohen , IR} Higher rank

XML scoring and ranking: Determine the appropriate level of component granularity to return to users book title sec Web XML ‘XML’ book title sec Web XML sec XML title

Introduction (2) XML Scoring and Ranking Approaches: Challenges Traditional weighting terms: TF-IDF values Challenges Term and element statistics Scoring at element level requires statistics at element level XML elements are nested Capture the structure of XML document Challenges Capture the structure of XML document Weighting terms (TF-IDF values)

Outline Introduction Related Work Conclusion Vector-based Scoring XRank XML Query Relaxation Conclusion

Outline Introduction Related Work Conclusion Vector-based Scoring Searching XML Documents via XML Fragments [SIGIR 2003] XSEarch: A Semantic Search Engine for XML [VLDB 2003] Configurable Indexing and Ranking for XML Information Retrieval [SIGIR 2004] XRank XML Query Relaxation Conclusion

Recall: Vector Space Model in Text Retrieval Task Given a query According to the retrieval formula, compute the relevance score for each document; Rank the documents according to relevance score. Vector Space Model Represent doc/query by a vector of terms Relevance between doc and query distance between two vectors Weighting terms (TF-IDF values) d q Relevance is estimated by similarity measures such as cosine between two objects of the same nature: a query and a document. cosine similarity Term frequency (tf) Inverse document frequence (idf) Ranking documents by relevance to query of typically only a few terms Relevance is based on term distribution statistics A standard approach for weighting terms is the term frequency-inverse document frequency (tf-idf) approach The standard index format for storing term frequencies is the inverted file format

Outline Introduction Related Work Conclusion Vector-based Scoring Searching XML Documents via XML Fragments [SIGIR 2003] XSEarch: A Semantic Search Engine for XML [VLDB 2003] Configurable Indexing and Ranking for XML Information Retrieval [SIGIR 2004] XRank XML Query Relaxation Conclusion

Searching XML Documents via XML Fragments [SIGIR 2003] book book Query: <book> <title> XML </title> </book> title sec title sec C=/book/title Web title XML XML XML book1.xml cr1=1 cr2=(1+2)/(1+3)=0.75 XML book2.xml C1 =/book/title C2=/book/sec/title Extending the Vector Space Model A query term is denoted by (ti,ci) ti is the content ci is the context leading to ti (ti, ci) can be matched with several document term (ti, ck) Example:<article>XML</article> Query term: (XML, article) Can be matched with (XML, article/title), (XML, article/bdy/sec) and more Modified cosine similarity as retrieval function for vague matching of path conditions

Outline Introduction Related Work Conclusion Vector-based Scoring Searching XML Documents via XML Fragments [SIGIR 2003] XSEarch: A Semantic Search Engine for XML [VLDB 2003] Configurable Indexing and Ranking for XML Information Retrieval [SIGIR 2004] XRank XML Query Relaxation Conclusion

XSEarch [VLDB 2003] sec: ‘XML’ title: ‘search’ book: books book book XML…search XML..search…XML web title XML…search search Query: Tag + Keyword Result Subtrees of a document are extracted using the interconnection index and other indices Ranking method Estimate relevance Extension of tf*idf classical in IR Ranking factors Similarity between query and result Weight of labels appearing in the result Characteristics of result tree

w(’XML’,sec)+w(‘search’,title) sec:’XML’ title:’search’ book: book title web sec XML…search search book title sec XML…search XML..search…XML w(’XML’,sec)+w(‘search’,title) Size of the relationship tree: small fragment indicates that its nodes are closer, and thus, probably, “more related” Ancestor-descendant relationships between a pair of nodes in a fragment, indicates “strong relation” between these nodes Size of the tree A-D relationships

Outline Introduction Related Work Conclusion Vector-based Scoring Searching XML Documents via XML Fragments [SIGIR 2003] XSEarch: A Semantic Search Engine for XML [VLDB 2003] Configurable Indexing and Ranking for XML Information Retrieval [SIGIR 2004] XRank XML Query Relaxation Conclusion

Weighted Term Frequency [SIGIR 2004] //article[about(., ‘XML’)] articles article article 1 2 fm body fm body 5 1 year kwd sec year sec sec 2003 XML retrieval XML retrieval… 2000 XML… Database… Users configure label weight w (fm.kwd) = w(fm) * w(kwd) = 1*5 = 5 w (body.sec)= w(body) * w(sec)= 2*1= 2 tfw(article,’XML’) =w(fm.kwd) * tf(fm.kwd,’XML’)+ w(body.sec) * tf(body.sec,’XML’) =5*1+2*1=7

Outline Introduction Related Work Conclusion Vector-based Scoring XRank [SIGMOD 2003] From PageRank to ElemRank XML Query Relaxation Conclusion

Recall: PageRank [Brin & Page 1998] : random jump : Hyperlink edge d/3+(1-d)/6 (1-d)/6 d: Probability of following hyperlink w 1-d: Probability of random jump Web as directed graph Random walk of a web surfer Nd: total number of documents Nh(u): the number of out-going hyperlinks from u d is usually set to 0.85 阻尼系数 得分 在pagerank中,如果一个页面被许多其他页面引用,这个页面可能是重要页面 一个页面尽管没有被多次引用,但被一个重要页面引用,则也可能是重要页面 一个页面的重要性均分传递到它引用的页面 页面的重要性用pagerank度量 但是,如果p1链接到一个重要页面,p1的重要性我们无法确定 在XML中的包含边(containment edge)中,例如,一个paper element 有高的rank值,自然的,它的sections element也应该有高的rank值。 So,Forword element rank along containment edge If a workshop contains many papers that have high elemranks, then the workshop should have a high elemrank, this corresponds to reverse elemrank propagation the number of out-going hyperlinks from document u total number of documents

From PageRank to ElemRank : Hyperlink edge : Containment edge : Reverse containment edge w Do not distinguish between containment and hyperlink edges. d: probability of following edge 1-d: probability of random jump Ne: the number of XML elements Nc(u): the number of sub-elements of u Nh(u) : the number of out-going hyperlinks from u d1: Probability of following hyperlink d2: Probability of visiting a subelement d3: Probability of visiting parent 1-d1-d2-d3: Probability of random jump

Outline Introduction Related Work Conclusion Vector-based Scoring XRank XML Query Relaxation Structure and Content Scoring for XML [VLDB 2005] Conclusion

XML Query Relaxation [VLDB 2005] Motivations: XML Data Heterogeneity Data book title (Great Expectations) edition (paperback) info author (Dickens) book title (Great Expectations) info author (Dickens) book title (Great Expectations) edition (paperback) info author (Dickens) Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] book title (Great Expectations) edition (paperback) info author (Dickens)

XML Query Relaxation (2) [Amer-Yahia et al. EDBT’02] book title (Great Expectations) edition (paperback) info author (Dickens) Query Tree pattern relaxations: Leaf node deletion Edge generalization Subtree promotion Relaxations book title (Great Expectations) info author (Dickens) book title (Great Expectations) edition (paperback) info author (Dickens) book author (Dickens)

Scoring Function for XML Approximate Matches book title (Great Expectations) info author (Dickens) Exact matches should be scored higher than relaxed matches (idf) book title (Great Expectations) edition (paperback) info author (Dickens) score(a)>=score(b) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) (a) (b) book info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) score(a)<=score(c) (c)

Scoring a b a b Query relaxations: Query: (1+4)/(1+2)=1.67 a b Data: a (1+4)/(1+3)=1.25 idf=1.67 tf=1 idf=1.67 tf=2 a 1 idf=1 tf=1 idf=1.25 tf=1 a b > c e d ≤ d’ if idf(d) < idf(d’) or idf(d)=idf(d’) and tf(d) ≤tf(d’)

Outline Introduction Related Work Conclusion Vector-based Scoring XRank XML Query Relaxation Conclusion

Conclusion (1) Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. Queries return document fragments. Granularity of returned results affects scoring. For queries containing conditions on structure structural conditions may affect scoring. Existing proposals extend common scoring methods: PageRank or vector-based similarity.

Conclusion (2) Scoring for keyword search No structure constraint in query How to take document structure into account How about semantic information

Thank you ! Q&A