Introduction to XML IR XML Group.

Slides:



Advertisements
Similar presentations
Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
Advertisements

Cooperative Query Answering for Semistructured Data Speakers: Chuan Lin & Xi Zhang By Michael Barg and Raymond K. Wong.
XML: Extensible Markup Language
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.
COMP630 Paper Presentation by Haomian(Eric) Wang.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
Identifying Meaningful Return Information for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Guoliang Li et al.
Overview of Search Engines
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Algorithmic Detection of Semantic Similarity WWW 2005.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Schema-Free XQuery Based on the work of: Yanyao Li, Cong Yu and H.V.Jagadish From the University of Michigan From the University of Michigan Presented.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval and Web Search
Computing Full Disjunctions
Probabilistic Data Management
Latent Semantic Indexing
Toshiyuki Shimizu (Kyoto University)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to XML IR — Scoring and Ranking XML Group.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
MCN: A New Semantics Towards Effective XML Keyword Search
Recuperação de Informação B
Information Retrieval and Web Design
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Introduction to XML IR XML Group

Outline Introduction XML Search XML Scoring and Ranking Conclusion

Introduction Submit keywords Top-k ranking

Introduction The result is XML fragment Q1: Linda Q2: Female, Linda Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ”Alice” Gender Female ProjRef ”Linda” Q1: Linda Q2: Female, Linda The result is XML fragment Q3: Female, Researcher

Introduction-Conceptual Model Documents Query Indexing Formulation Keywords Document representation Query representation Inverted index (Algorithm Design) Retrieval function Relevance feedback Retrieval results Matching content + structure (Scoring and Ranking) Presentation of related components (Semantic Definition)

Introduction XML IR Query Semantic Query Processing (XML Search Algorithm) Scoring and Ranking Result representation

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic XML Search Algorithms XML Scoring and Ranking Conclusion

XML Search Languages Three classes of XML search languages Keyword search “book xml” Path Expression + Keyword search /book[./title about “xml db”]] XQuery + Complex full-text search for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

Search Semantic Tree & Graph (IDRef) 理想的结果 错误的结果 Q: Female, XML Researcher Name ”Linda” Gender Female Projects Topic ”XML” Research Institute Projects Researcher Project Topic Name ”Alice” ”Joe” ProjRef ”Linda” ”John” ”XML” ”RDF” Gender Female Male Researcher Name ProjRef ”Linda” Gender Female Project Topic ”XML” Q: Female, XML Tree & Graph (IDRef)

Search Semantic Factors affect the Semantic Tree & Graph (是否考虑IDRef) Relationship Between Entities (实体间的关系) Schema (是否考虑Schema) XML结构的灵活性

Search Semantic-Related Work Tree Graph NO LCA[ICDE01] XSEarch[VLDB03] XRANK[SIGMOD03] MLCA[VLDB04] SLCA[SIGMOD05] Symmetry[WWW06] YES Interconnection[CIKM05] XKeyword[ICDE03] IDRef Schema 考虑实体之间关系 考虑实体之间的交换

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

LCA & SLCA(MLCA) Q3: Bit, 1999 (LCA) Q1: Ben, Bit Q2: Bob, Byte Q3: Bit, 1999 (SLCA)

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

Find papers by Vianu on the topic of XSEarch[VLDB03] Find papers by Vianu on the topic of “logical databases” How can we find such papers?

Standard Search Engine A document containing some of the three query terms is considered as a result.

The document is not relevant to the query. This does not work!!! The document contains the three query terms. Hence, it is returned by a standard search engine. BUT The document is returned BUT it does not contain any paper on “logical databases” by Vianu The document is not relevant to the query. This does not work!!! This fragment does not represent a paper about logical databases This fragment does not represent a paper by Vianu <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </proceedings>

Lowest common ancestor of Relationship Trees Relationship tree of n1, n2, …, nk Lowest common ancestor of n1, n2, …, nk The relationship tree of nodes n1,..., nk is the subtree T of the document D, such that T is rooted at the lowest common ancestor (lca) of n1,..., nk, and T consists of the k paths from the lca to n1 through nk … nk n1 n2

XSEarch: A Semantic Search Engine for XML n1,..., nk are interconnected if either relationship tree of n1,..., nk does not contain two nodes with the same label, or the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk

Lowest common ancestor of circled nodes Example (1) Lowest common ancestor of circled nodes Relationship tree proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Circled nodes belong to different inproceedings entities. They ARE NOT interconnected!

Lowest common ancestor of circled nodes Example (2) Lowest common ancestor of circled nodes proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Relationship tree Circled nodes belong to the same inproceedings entity. They ARE interconnected!

Queries and Computation on the Web Example (3) Lowest common ancestor of circled nodes proceedings Relationship tree inproceedings inproceedings title author title author author Moshe Y. Vardi Victor Vianu Serge Abiteboul Queries and Computation on the Web Querying Logical Databases Circled nodes belong to the same inproceedings entity, but are labeled with the same tag. They ARE interconnected.

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

Interconnection Semantics for Keyword Search in XML[CIKM05] ID references are ignored That is, documents are always trees The schema is ignored Therefore, missing information is not taken into account

Keyword-Search Example {Cohen , IR}

A Result {Cohen , IR} Cohen and IR are in the same department

Another Result {Cohen , IR} Identifying meaningful Cohen and IR are in the same department and Cohen wrote an article about IR This fragment should have a higher rank Identifying meaningful relationships can improve ranking in keyword search

A Schema Defines Document Structure The root of the schema is the label of the root of the document

A Schema Defines Document Structure An edge in the document is allowed only if an edge between the corresponding labels appears in the schema

In the formal framework, patterns are the basic building blocks Formally, a pattern is a pair (L,C) C is a tree of labels L is a set of labels ( , ) {title,publication,author} C contains L C has no redundant edges

Interconnection by Patterns A set O of objects is interconnected if the objects are in a tree that is isomorphic to the pattern ( , ) Now I’ll define when a patterns interconnects a set of objects. This is a formal definition, I will explain it by an example. {title,publication,author}

Interconnection by Patterns {title,publication,author} ( ) ,

Interconnection by Patterns {title,publication,author} ( ) , Interconnected … so a pattern represents a specific meaningful relationship

Interconnection Semantics An interconnection semantics P is a set of patterns A set of objects is interconnected by P if it is interconnected by a pattern of P ({title,name} , ) ({title,name} , )

The Subtrees of {title,publication}

The Subtrees of {title,author}

The Subtrees of {title,author} ? The author did not actually wrote the paper Is this what we mean?

The Interconnection Semantics Puca Intuitively, p is structurally minimal if internal nodes in C cannot be roots of trees containing L The semantics Puca(S) is the set of all structurally minimal patterns

One Structurally Minimal Pattern ({title,author} , ) In the schema, article is the only common ancestor of {title,author}

Another Structurally Minimal Pattern ({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}

A Third Structurally Minimal Pattern ({title,author} , ) In the schema, inproc. is the only common ancestor of {title,author}

Not a Structurally Minimal Pattern ({title,author} , ) In the schema, department, publications and incproc. are all common ancestors of {title,author}

Back to the Document

Puca(S)-Interconnected title and author This subtree shows Puca(S)-interconnection! ({title,author} , )

Puca(S)-Interconnected title and author This subtree shows Puca(S)-interconnection! ({title,author} , )

Not Puca(S)-Interconnected This subtree does not show Puca(S)-interconnection! ({title,author} , )

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Semantic search on XML Tree search on XML Tree considering entity relationship search on XML Graph considering schema XML Search Algorithms XML Scoring and Ranking Conclusion

Keyword Proximity Search on XML Graphs[ICDE03] Input: a set of keywords Results: trees of XML fragments(called target objects) that contains all the keywords, ranked according to their size Assume the existence of schema, facilitates the presentation of the results and used in optimizing the performance of the system.

Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6 Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8

Query semantics Result: the set of all possible Minimal Total Target Object Networks(MTTON’s) What’s MTTON? Node network j: an uncycled subgraph of G, such that each edge in j is an edge in G Total node network j of keyword {k1,…,km}: a node network where every keyword is contained at least one node n of j Minimal Total Node Network(MTTN): a total node network j where no node can be removed and j still be a total node network. Score : number of edges

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Language XML Search Algorithms XRank XML Scoring and Ranking Conclusion

Main Issue Given: Query keywords Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order

Main issue: Decouples representation of ancestors and descendants Naïve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 1 <workshop> date 2 <title> 3 <editors> 4 <proceedings> 5 28 July … XML and … David Carmel … <paper> 6 <paper> … <title> <author> 7 8 … … Problems: 1. Space Overhead 2. Spurious Results XQL and … Ricardo … Main issue: Decouples representation of ancestors and descendants

Dewey Encoding of IDs [1850s] <workshop> date 0.0 <title> 0.1 <editors> 0.2 <proceedings> 0.3 28 July … XML and … David Carmel … <paper> 0.3.0 <paper> 0.3.1 … <title> 0.3.0.0 <author> 0.3.0.1 … … XQL and … Ricardo …

XRank: Dewey Inverted List (DIL) Position List Dewey Id Score XQL 5.0.3.0.0 85 32 Sorted by Dewey Id 8.0.3.8.3 38 89 91 … … … Ricardo 5.0.3.0.1 82 38 Sorted by Dewey Id 8.2.1.4.2 99 52 … … … Store IDs of elements that directly contain keyword - Avoids space overhead

XRank: Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted by Score …(other keywords)

RDIL: Algorithm An element may be ranked highly in one list and low in another list B+-tree helps search for low ranked element When to stop scanning inverted lists? Based on Threshold Algorithm [Fagin et al., 2002], which periodically calculates a threshold Can stop if we have sufficient results above the threshold Extension to most specific results

RDIL: Query Processing Output Heap P Temp Heap P B+-tree on Dewey Id Ricardo P: 9.0.4.2.0 Inverted List Rank(9.0.4) threshold = Score(P)+Score(R) threshold = Score(P)+Max-Score XQL R 9.0.4.1.2 8.2.1.4.2 9.0.4.1.2 9.0.4.1.2 9.0.5.6 9.0.5.6 10.8.3 B+-tree on Dewey Id 9.0.4.2.0

Outline Introduction XML Search XML Scoring and Ranking Conclusion XML Search Language XML Search Algorithms XML Scoring and Ranking PageRank -> XRank INEX Query Relaxation … Conclusion

Q&A Thanks!