Download presentation
Presentation is loading. Please wait.
Published byIsabella Stephens Modified over 9 years ago
1
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang
2
2 Background(1) --- Data Database: Bioinformatics---John SmithProtein ------SIGIRN.Fuhr, K. Grobjohann XIRQL JournalConf.AuthorsTitle Schema: Papers (Title, Authors, Conf., Journal) Un-structured DataWell-structured Data IR: Intel: New chip, new price war. February 1, 2004: 6:32 PM EST. Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998, ….. An example document: Lack of flexibility Lack of extensibility Lack of the logical structure of a document. Semi-structured Data DB+IR: XIRQL N.Fuhr K.Grobjohann SIGIR Why is semi-structured data important?
3
3 XML in a nutshell Hierarchical data format Nested element structure having a root Self describing data (tags), schema is attached to the data itself. 1997 Karen Sparck Jones Peter Willett Morgan Kaufmann Readings in Information Retrieval … Start tag contentEnd tag Attribute Readings in … 1997 book … year author title Karen Sparck Jones Peter Willett id=“25” author Morgan Kaufmann publisher element
4
4 Background(2) --- Query Database:Boolean Query SQL (Structured Query Language): SELECT title FROM papers WHERE conf= ‘ SIGIR ’ Return the unranked tuples satisfying the query. IR:Ranked Query Keywords: paper SIGIR Return the ranked documents according to the relevance. How to query semi-structured data (e.g. XML data) ?
5
5 Related Work DB-oriented approaches –E.g. XML-QL, XQL, XQUERY … WHERE Harry Potter $a, $y in “books.xml”, $y>2002 CONSTRUCT $t DB+IR approaches –E.g. XIRQL IR-oriented approaches –E.g. this paper
6
6 Problem Refinement---CAS Search Document collection: –XML documents Each document is a hierarchical structure of nested elements Markup in the document mainly serves for exposing the logical structure of a document. Query –content + explicit references to the XML structure –specifies the target element need to be returned An example: Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for papers.
7
7 Approach Compare apple and apple Recall vector space models –Both documents and queries are expressed in free text. –Compare unstructured data to unstructured data This paper: –Search XML documents via XML fragments
8
8 Query---XML Fragments(1) Topic 1: Find all books about fishing fishing Topic 2: Find all books having a title about search fishing { for $t in document ( “ library.xml ” //book/title) where contains ($t/text(), “ search ” ) return $t } XQuery More intuitive More flexible
9
9 Query --- XML Fragment(2) Limited expressiveness –E.g. “ Finding figures that describe the Corba architecture and the paragraphs that refer to those figures. “ Requires a “ join ” operation between two elements “ figures ” and “ paragraphs ”
10
10 Recall: Text Retrieval Task Give a query –According to the retrieval formula, compute the relevance score for each document; –Rank the documents according to relevance score. Vector Space Model –Represent doc/query by a vector of terms –Relevance between doc and query distance between two vectors d q
11
11 Extending the Vector Space Model(1) Indexing unit: –E.g. ( “ Harry Potter ”, /book/title) –Can be matched with ( “ Harry Potter ”,/book) ( “ Harry Potter ”,/book/sec/title) Retrieval Formula Context resemblance measure Perfect match:,when ; 0,otherwise. Partial match:,when c i subsequence of c k ; 0, otherwise Fuzzy match: Flat (ignore context):
12
12 Extending the Vector Space Model(2),where If c is rare, idf(t,c) would be high in spite of t being very common. “ Merge-idf ” variant:,where and “ Merge ” variant:
13
13 Evaluation Runs –Partial-match –Partial-match. merge-idf –Partial-match.merge –Fuzzy-match.merge-idf –Flat (ignore context)
14
14 Result(1) Result for “ free-text-oriented ” topics –An example topic : 1995,1996,1997,1998,1999 XML Electronic commerce
15
15 Result(2) Result for “ context-oriented ” topics –An example topic: Content-Based retrieval of video databases
16
16 Summary Using XML fragments with an extended vector space model is promising. Use different solutions for different types of applications Something wrong?
17
17 Another Problem --- CO Search Document collection: –XML documents Query: – a set of keywords Task: Find smallest element satisfying the query Challenge: rank the components instead of document
18
18 t1 t2 Possible Method(1): treat each component as a document. Possible Solutions,where Problem with this method: XML components are nested.
19
19 t1 t2 Possible Method(2): counting TF at the component level; computing N & DF at the document level. Possible Solutions (Cont.),where Impossible to differentiate between the rankings of the three sections
20
20 Proposed Solution Create a index for each component type –Elements in each index are regarded as documents –Keep N, DF,TF for the specific component type –Can apply the regular vector space model on each index Given a query –Run the query in parallel on each index –Return one ranked list of results, one from each index Normalize the scores in each index into the range (0,1) –Achieved by computing Merge the normalized results into a one ranked list of all components Assume the set of potential components to be returned must be known in advance. Assume no nesting of the same component.
21
21 Conclusion Possible solutions to solve the following challenges. –Challenge 1 (Information/Doc Unit): What is an appropriate information unit? Document may no longer be the most natural unit Components in a document may be more appropriate –Challenge 2 (Query): What is an appropriate query language? Keyword (free text) query is no longer the only choice Constraints on the structures can be posed
22
22 References Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX ’ 03 workshop. Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR ’ 03 XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Gro ß johann. SIGIR ’ 02
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.