ISP 433/533 Week 11 XML Retrieval
Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity Document has structure –E.g. title, sections, footnotes, etc A markup language is a mechanism to identify structures in a document –Data + Metadata
Extensible Markup Language XML Markup (tags – not a fixed set) Content Nested, named trees with attributes One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown
Elements Delimited by angle brackets Identify the nature of the content they surround Elements can be nested within another element –A tree structure Element may have attributes –E.g.
Unit of Retrieval Traditional IR –Document XML IR –Element or fragment of element
Example Retrieval Units
Requirements for XML Retrieval Basic needs for XML retrieval –Query both Data and Metadata –express the query in an user convenient way –return proper document fragments –rank the results according to their relevance
INEX The initiative for evaluating XML retrieval –international, coordinated effort to promote evaluation procedures for content-based XML retrieval –provides large test collection of XML documents (12,000 articles in IEEE CS publications since 1995) –introduces both content-only (CO) and content-and- structure (CAS) topics –designed to be a long-term initiative with workshops held on a yearly basis (currently in the second year)
INEX CO Topic example semantic web Research and business opportunities and challenges in developing and deploying the concept of the Semantic Web and the associated idea of web services. To be relevant, a document/component must either discuss the technical issues and opportunities associated with the semantic web, or it must discuss the business challenges, especially the question of viable business models for web services. semantic web, ontologies, SOAP, UDDI, RDF…
INEX CAS Topic example //fig, //p, //ip1 Corba architecture //fgc Figure Corba Architecture //p, //ip1 Find figures that describe the Corba architecture and the paragraphs that refer to those figures. To be relevant a figure must describe the standard Corba architecture or a system architecture that relies heavily on Corba…Retrieved components would ideally contain both the figure and the paragraph referring to it. CORBA Object Request Broker Architecture …
An Inverted Indexing for XML (1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … … (1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … … (1, 3, 2) … … (1, 4, 2) … … “retrieval” “information” Element index Text index Information Retrieval Using RDBMS Beyond Simple Translation Extension of IR Features
XPath XPath is a non-XML language for identifying particular parts of XML documents –picking nodes and sets of nodes Similar to Unix file system expression “ /people/person/name/first_name ” “*” wildcard “..” parent “.” context node –“//” descendents attribute –[] predicate,specify a condition
XPath Example chapter/heading document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter
XPath Example chapter//heading document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter
XPath Example //chapter[heading] document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter
XPath Example author="John Smith"] document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter
More XPath Examples –All the elements that have attribute “id” //middle_initial/../first_name –All the first_name elements that are siblings of middle_initial elements //person[profession=‘physicist’] –All person elements that have a profession child element with the value “physicist”
XQuery A language to query data that is similar to XML in structure –nested, named trees with attributes Based on XPath FOR/LET PathExpression WHERE AdditionalSelectionCriteria RETURN ResultConstruction
XQuery Example Find the name(s) of customers who have ordered the part whose part_id is "xx" FOR $c IN customers FOR $o IN orders WHERE $c.cust_id=$o.cust_id AND $o.part_id="xx" RETURN $c.name
More XQuery Example Find titles and prices of books by ‘Meyer’ or ‘Smith’ FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN $b/title $b/price
One Document Structure Previous XQuery works bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Hedi $13.95
Another Document Structure Same XQuery doesn’t work author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo
Problem with XQuery Requires knowledge of document structure Dependent on document structure Difficult for naive user Need extensions to solve the problem Still in active research
Don’t know the tags? Integrating with full-text keywords search Automatically identifying tag names Translate query terms to tag names Query expansion
Don’t know the structure? Schema-free XQuery –Automatically identifying minimum, meaningful set of nodes that can provide answer Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95
Querying XML with Natural Language Translate natural language query to Schema-free XQuery NaLIX demo
Relevance Scoring Query: articles about “search engine”
TermJoin User-defined score function generates the score based on term occurrences and other information They are then joined score = 1 score = 2 score = 4 score = 5