Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Similar presentations


Presentation on theme: "ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity."— Presentation transcript:

1 ISP 433/533 Week 11 XML Retrieval

2 Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity Document has structure –E.g. title, sections, footnotes, etc A markup language is a mechanism to identify structures in a document –Data + Metadata

3 Extensible Markup Language XML Markup (tags – not a fixed set) Content Nested, named trees with attributes One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown 10.55....

4 Elements Delimited by angle brackets Identify the nature of the content they surround Elements can be nested within another element –A tree structure Element may have attributes –E.g.

5 Unit of Retrieval Traditional IR –Document XML IR –Element or fragment of element

6 Example Retrieval Units

7 Requirements for XML Retrieval Basic needs for XML retrieval –Query both Data and Metadata –express the query in an user convenient way –return proper document fragments –rank the results according to their relevance

8 INEX The initiative for evaluating XML retrieval –international, coordinated effort to promote evaluation procedures for content-based XML retrieval –provides large test collection of XML documents (12,000 articles in IEEE CS publications since 1995) –introduces both content-only (CO) and content-and- structure (CAS) topics –designed to be a long-term initiative with workshops held on a yearly basis (currently in the second year)

9 INEX CO Topic example semantic web Research and business opportunities and challenges in developing and deploying the concept of the Semantic Web and the associated idea of web services. To be relevant, a document/component must either discuss the technical issues and opportunities associated with the semantic web, or it must discuss the business challenges, especially the question of viable business models for web services. semantic web, ontologies, SOAP, UDDI, RDF…

10 INEX CAS Topic example //fig, //p, //ip1 Corba architecture //fgc Figure Corba Architecture //p, //ip1 Find figures that describe the Corba architecture and the paragraphs that refer to those figures. To be relevant a figure must describe the standard Corba architecture or a system architecture that relies heavily on Corba…Retrieved components would ideally contain both the figure and the paragraph referring to it. CORBA Object Request Broker Architecture …

11 An Inverted Indexing for XML (1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … … (1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … … (1, 3, 2) … … (1, 4, 2) … … “retrieval” “information” Element index Text index Information Retrieval Using RDBMS Beyond Simple Translation Extension of IR Features 1 2 34567 8 910111213 14 151617181920 21 22 23

12 XPath XPath is a non-XML language for identifying particular parts of XML documents –picking nodes and sets of nodes Similar to Unix file system expression “ /people/person/name/first_name ” “*” wildcard “..” parent “.” context node –“//” descendents –“@” attribute –[] predicate,specify a condition

13 XPath Example chapter/heading document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter

14 XPath Example chapter//heading document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter

15 XPath Example //chapter[heading] document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter

16 XPath Example /document[@class="H.3.3"  author="John Smith"] document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Language XQL section We describe syntax of XQL chapter

17 More XPath Examples //@id/.. –All the elements that have attribute “id” //middle_initial/../first_name –All the first_name elements that are siblings of middle_initial elements //person[profession=‘physicist’] –All person elements that have a profession child element with the value “physicist”

18 XQuery A language to query data that is similar to XML in structure –nested, named trees with attributes Based on XPath FOR/LET PathExpression WHERE AdditionalSelectionCriteria RETURN ResultConstruction

19 XQuery Example Find the name(s) of customers who have ordered the part whose part_id is "xx" FOR $c IN customers FOR $o IN orders WHERE $c.cust_id=$o.cust_id AND $o.part_id="xx" RETURN $c.name

20 More XQuery Example Find titles and prices of books by ‘Meyer’ or ‘Smith’ FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN $b/title $b/price

21 One Document Structure Previous XQuery works bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Hedi $13.95

22 Another Document Structure Same XQuery doesn’t work author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo

23 Problem with XQuery Requires knowledge of document structure Dependent on document structure Difficult for naive user Need extensions to solve the problem Still in active research

24 Don’t know the tags? Integrating with full-text keywords search Automatically identifying tag names Translate query terms to tag names Query expansion

25 Don’t know the structure? Schema-free XQuery –Automatically identifying minimum, meaningful set of nodes that can provide answer Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95

26 Querying XML with Natural Language Translate natural language query to Schema-free XQuery NaLIX demo

27 Relevance Scoring Query: articles about “search engine”

28 TermJoin User-defined score function generates the score based on term occurrences and other information They are then joined score = 1 score = 2 score = 4 score = 5


Download ppt "ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity."

Similar presentations


Ads by Google