1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Slides:



Advertisements
Similar presentations
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
COMP630 Paper Presentation by Haomian(Eric) Wang.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Chapter 19: Information Retrieval
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
1 PQ Trees, PC Trees, and Planar Graphs Hsu & McConnell Presented by Roi Barkan.
Presented By: - Chandrika B N
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 Information Retrieval LECTURE 1 : Introduction.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Structured Text Retrieval Models. Str. Text Retrieval Text Retrieval retrieves documents based on index terms. Observation: Documents have implicit structure.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
CMSC 202, Version 5/02 1 Trees. CMSC 202, Version 5/02 2 Tree Basics 1.A tree is a set of nodes. 2.A tree may be empty (i.e., contain no nodes). 3.If.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
General Architecture of Retrieval Systems 1Adrienn Skrop.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Overview of XML Data Management Research at Cornell Jayavel Shanmugasundaram Cornell University.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval
HITS Hypertext-Induced Topic Selection
Information Retrieval and Web Search
Ariel Rosenfeld Bar-Ilan Uni.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Information Retrieval
Lectures on Graph Algorithms: searching, testing and sorting
Trees CMSC 202, Version 5/02.
CMSC 202 Trees.
Introduction to XML IR — Scoring and Ranking XML Group.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
MCN: A New Semantics Towards Effective XML Keyword Search
Data Structures Introduction
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Efficient Aggregation over Objects with Extent
Introduction to XML IR XML Group.
Presentation transcript:

1 Keyword Search over XML

2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are not appropriate for the naive user: –if XML “replaces” HTML as the web standard, users can’t be expected to write graph queries Allow Keyword Search over XML!

3 Keyword Search A keyword search is a list of search terms There can be different ways to define legal search terms. Examples: –keyword:label, e.g., author:Smith –keyword, e.g., :Smith –label, e.g., author: –value (without distinguishing between keywords and labels)

4 Challenges (1) Determining which part of the XML document corresponds to an answer –When searching HTML, the result units are usually documents –When searching XML, a finer granularity should be returned, e.g., a subtree

5 What should be returned for the query :ACID, :Kempster ?

6 Challenges (2) Avoiding the return of non-meaningfully related elements –XML documents often contain many unrelated fragments of information. Can these information units be recognized?

7 What should be returned for the query :XML, author: ?

8 What should be returned for the query :XML, :Kempster ?

9 Challenges (3) Ranking mechanisms –How should document fragments/XML elements be ranked Ideas?

10 In what order should the answers be returned for :ACID, author: ?

11 Defining a Search Semantics When defining a search over XML, all previous challenges must be considered. We must decide: –what portions of a document are a search result? –should any results be filtered out since they are not meaningful? –how should ranking be performed Typically, research focuses on one of these problems and provides simple solutions for the other problems.

12 Topics Discussed XRank: Paper presents a variation of PageRank for ranking XML elements –focus on ranking Interconnection Semantics: Methods to determine whether a set of nodes is meaningfully related –focus on filtering out meaningless results

13 XRank: Ranked Keyword Search over XML Documents Guo, Shao, Botev, Shanmugasundram SIGMOD 2003

14 Queries and their Semantics Queries are keywords k 1,…,k n, as in a search engine Query results are portions of XML documents that contain all words. Formally: –Let v be a node in the document. To determine whether v should be returned: First, “remove” any descendents of v that contain all the keywords k 1,…,k n. If v still contains all of k 1,…,k n, then v should be a result of the search. –Intuition: Only return v if no more specific element can be returned. Note: Containment is via child edges, not IDREF edges

15 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

16 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

17 Ranking Results: Intuition Granularity of ranking –In HTML, there is a rank for each document –In XML, we want a rank for each element. Different elements in the same document may have different ranks Propose to extend ideas used for ranking HTML: –PageRank: Documents with more incoming links are more important (recursive definition) –Proximity: If the document contains the search terms close together, then the document is more important Overall Rank: combination of PageRank and proximity

18 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Should both papers be ranked the same?

19 Topics We discuss: –Ranking –The Index Structure –Query Processing

20 Ranking Results Take into consideration –hyperlinks –proximity We only discuss here ranking by the linking structure. Ranking by proximity can easily be defined (ideas?) What kind of “links” are the in a graph of XML documents?

21 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Child/Parent “links”

22 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … IDREF “links”

23 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … XLink “links” (out of the document)

24 v : Hyperlink edge 1-d: Probability of random jump d: Probability of following hyperlink d /3 Remember: Page Rank Number of documents Number of outgoing links

25 A Graph of XML documents Nodes: N –each element in a document is a node Edges: E = CE  CE -1  HE –CE are “containment links”, i.e., there is an edge (u,v) in CE if u is a parent of v in the XML document –HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if there is an IDREF link or XLink link from u to v Want to define ElemRank, the parallel to PageRank, but for XML elements

26 Attempt 1 at ElemRank v Hyperlink edge Containment edge There are now 4 ways to get to an element. Consider all in the formula.

27 Attempt 1 at ElemRank: Problem v Hyperlink edge Containment edge Consider a paper with few sections and many references. The more references there are, the less important each section is. Why?

28 Attempt 2 at ElemRank v Hyperlink edge Containment edge Consider Hyperlinks and Structural links separately

29 Attempt 2 at ElemRank: Problem v Hyperlink edge Containment edge In fact, better to consider parent- child links differently from child-parent links

30 Actual ElemRank v Hyperlink edge Containment edge Consider Hyperlinks, Parent links and Child links separately

31 Interpretation in terms of Random Walks The element rank of e is the probability that e will be reached if we start at a random element and at each point we chose one of the following options: –with probability 1-d 1 -d 2 -d 3 jump to a random element in a random page –with probability d 1 follow a random hyperlink from the current element –with probability d 2 follow a random edge to a child element –with probability d 3 follow the parent edge

32 ElemRank Example Suppose that d 1 = d 2 = d 3 = 0.3 In what order will the nodes be ranked? What will be the formula for each node? 1 Hyperlink edge Containment edge 2 3 4

33 Think About it Very nice definition of ElemRank Does it make sense? Would ElemRank give good results in the following scenarios: –IDREFs connect articles with articles that they cite –IDREFs connect managers with their departments –IDREFs connect cleaning staff with their departments in which they work –IDREFs connect countries with bordering contries (as in the CIA factbook)

34 Topics We discuss: –Ranking –The Index Structure –Query Processing

35 Indexing We now discuss the index structure Recall that we will be ranking according to ElemRank Recall that we want to return “most specific elements” How should the data be stored in an index?

36 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Space Overhead How much space is needed in storage?

37 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Spurious Results Cant simply return intersection of the lists, since if a node satisfies a query, so do all its ancestors

38 Dewey Encoding of ID Use path information to identify elements – DeweyID An ancestor’s ID is a prefix of its descendant’s ID Actually (not shown) all the node ids are prefixed by the document number July …XML and …David Carmel … … …… ……

39 Dewey Inverted List (DIL) Store, for each keyword a list containing : – the id of the node containing the keyword –the rank of the node containing the keyword –the positions of the keyword in the node Rank and positions are needed to compute ranking To simplify, in the following slides, we only store lists of node ids

40 Topics We discuss: –Ranking –The Index Structure –Query Processing

41 Query Processing Challenges: –How do we find nodes that contain all keywords? –How do find only the most specific node that contains all keywords? –Can this be done in a single scan of the inverted keyword lists?

42 Example: Document 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

43 Example: Document with IDs 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

44 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … XQL Lists contain ids for nodes that directly contain keyword. Lists are sorted language paper … XQL …

45 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … XQL We want to find nodes that should be returned. Which? How will they be ranked? language paper … XQL …

46 Algorithm: Data Structures XQL language Contains[1]Contains[2] DeweyID Result heap: ContainsAll

47 Algorithm: Pseudo Code Find smallest next entry in inverted lists Find longest common prefix of entry and dewey stack Pop all non-matching values from dewey stack. When popping: –propogate down containment information, if containsAll is false –if containsAll turns from false to true, add result to output Add non-matching values from entry into dewey stack. Mark containment for entry’s keyword

48 Example: Algorithm XQL language Contains[1]Contains[2] DeweyID Result heap: ContainsAll

49 Example: Algorithm XQL language Contains[1]Contains[2] DeweyID Result heap: Smallest entry is for keyword 1, XQL. lcp with Dewey stack = none. Pop (nothing). Add (all). ContainsAll

50 Example: Algorithm XQL language  Contains[1]Contains[2] DeweyID Result heap: ContainsAll

51 Example: Algorithm XQL language  Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = Pop non-matching entries

Example: Algorithm XQL language  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = Add additional entries

Example: Algorithm XQL language  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack =

Example: Algorithm XQL language  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = Pop non-matching entries

Example: Algorithm XQL language  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack =  Continue on Blackboard!