WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XPath Eugenia Fernandez IUPUI. XML Path Language (XPath) a data model for representing an XML document as an abstract node tree a mechanism for addressing.
Information Retrieval in Practice
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
1 COS 425: Database and Information Management Systems XML and information exchange.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
CS276B Text Retrieval and Mining Winter 2005 Lecture 12.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Overview of Search Engines
Overview of XPath Author: Dan McCreary Date: October, 2008 Version: 0.2 with TEI Examples M D.
Introduction to XPath Bun Yue Professor, CS/CIS UHCL.
4/20/2017.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
10/06/041 XSLT: crash course or Programming Language Design Principle XSLT-intro.ppt 10, Jun, 2004.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Lecture 21 XML querying. 2 XSL (eXtensible Stylesheet Language) In HTML, default styling is built into browsers as tag set for HTML is predefined and.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
1/17 ITApplications XML Module Session 7: Introduction to XPath.
Introduction to XPath Web Engineering, SS 2007 Tomáš Pitner.
XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
Processing of structured documents Spring 2003, Part 7 Helena Ahonen-Myka.
XPath. Why XPath? Common syntax, semantics for [XSLT] [XPointer][XSLT] [XPointer] Used to address parts of an XML document Provides basic facilities for.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
August Chapter 6 - XPath & XPointer Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
Database Systems Part VII: XML Querying Software School of Hunan University
ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Martin Kruliš by Martin Kruliš (v1.1)1.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture Nov 2002.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
XML Extensible Markup Language
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Information Retrieval in Practice
XML: Extensible Markup Language
Querying and Transforming XML Data
An Introduction to IR Chapter 10: XML Retrieval 9th Course,
XML Indexing and Search
XML in Web Technologies
Introduction to Information Retrieval
CS276B Text Retrieval and Mining Winter 2005
Presentation transcript:

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Overview Monday XML Clustering 1 Tuesday Clustering 2 Clustering 3, Interactive Retrieval Wednesday Classification 1 Classification 2 Thursday Classification 3 Information Extraction Friday Bioinformatics Projects Joker Active learning in Text Mining

Today’s Topics Quick XML intro XML indexing and search Database approach Xquery 2 IR approaches

What is XML? eXtensible Markup Language A framework for defining markup languages No fixed collection of markup tags Each XML language targeted for application All XML languages share features Enables building of generic tools

Basic Structure An XML document is an ordered, labeled tree character data at the leaf nodes contain the actual data (text strings) element nodes are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes

XML Example

FileCab This chapter describes the commands that manage the FileCab inet application.

Elements Elements are denoted by markup tags thetext Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag:

XML vs HTML Relationship?

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML can be formalized as an XML language (XHTML) XML defines logical structure only HTML: same intention, but has evolved into a presentation language

XML: Design Goals Separate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured data (data that are structured, but don’t fit relational model) XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language Having a Humble Opinion of Self EVERY man naturally desires knowledge Aristotle, Metaphysics, i. 1. ; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. Augustine, Confessions V. 4.

XML Schemas Schema = syntax definition of XML language Schema language = formal language for expressing XML schemas Examples DTD XML Schema (W3C) Relevance for XML information retrieval Our job is much easier if we have a (one) schema

XML Tutorial tml tml (Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorial

XML Indexing and Search

Native XML Database Uses XML document as logical unit Should support Elements Attributes PCDATA (parsed character data) Document order Contrast with DB modified for XML Generic IR system modified for XML

XML Indexing and Search Most native XML databases have taken a DB approach Exact match Evaluate path expressions No IR type relevance ranking Only a few that focus on relevance ranking Many types of XML don’t need relevance ranking If there is a lot of text data, relevance ranking is usually needed.

Timber: DB extension for XML DB: search tuples Timber: search trees Main focus Complex and variable structure of trees (vs. tuples) Ordering Non-native XML database without relevance ranking without “IR-type” handling of text

Three Native XML Databases Toxin Xirql IBM Haifa system

ToXin Exploits overall path structure Supports any general path query Query evaluation in three stages Preselection stage Selection stage Postselection stage

ToXin: Motivation Strawman (Dataguides) Index all paths occurring in database Sufficient for simple queries: Find all authors with last name Smith Does not allow backward navigation Example query: find all the titles of articles authored by Smith

Query Evaluation Stages for Backward Navigation Pre-selection First navigation down the tree Selection Value selection according to filter Post-selection Navigation up and down again

ToXin

Evaluation: Factors Impacting Performance Data source (collection) specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter

Test Collections

Query Classification

Evaluation

ToXin: Summary Efficient system supporting structured queries All paths are indexed (not just from root) Path index linear in corpus size Shortcomings Order of nodes ignored No IR-type relevance

IR/Relevance Ranking for XML Why is this difficult?

IR XML Challenge 1: Term Statistics There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text contexts is problematic Consider medical collection “new” not a discriminative term in general Very discriminative for journal titles New England Journal of Medicine

IR XML Challenge 2: Fragments Which fragments are legitimate to return? Paragraph, abstract, title Bold, italic IR systems don’t store content (only index) Need to go to document for displaying fragment Problematic if fragment is not simply a node

Remainder of Lecture Queries for semi-structured text How they differ from regular IR queries Xquery Two XML search systems with relevance ranking Xirql IBM Haifa system

Types of (Semi)Structured Queries Location/position (“chapter no.3”) Simple attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Employees with two managers All of the above: mixed structure/content

XPath Declarative language for Addressing (used in XLink/XPointer and in XSLT) Pattern matching (used in XSLT and in XQuery) Location path a sequence of location steps separated by / Example: child::section[position()<6] / descendant::cite / attribute::href

Axes in XPath ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

Location steps A single location step has the form: axis :: node-test [ predicate ] The axis selects a set of candidate nodes (e.g. the child nodes of the context node). The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction, etc.), or names (e.g. element name). The predicates (zero or more) cause a further, more complex, filtration child::section[position()<6]

XQuery SQL for XML Usage scenarios Human-readable documents Data-oriented documents Mixed documents (e.g., patient records) Based on XPath

XQuery Expressions path expressions element constructors list expressions conditional expressions quantified expressions datatype expressions

FLWR Expressions FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR generates an ordered list of bindings of publisher names to $p LET associates to each binding a further binding of the list of book elements with that publisher to $b at this stage, we have an ordered list of tuples of bindings: ($p,$b) WHERE filters that list to retain only the desired tuples RETURN constructs for each tuple a resulting value

XQuery vs SQL Order matters! document("zoo.xml")//chapter[2]//figure[capt ion = "Tree Frogs"] XQuery is turing complete, SQL is not.

XQuery Example Møller and Schwartzbach

XQuery 1.0 Standard on Order Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model. … if a node in document A is before a node in document B, then every node in document A is before every node in document B. This structure-oriented ordering can have undesirable effects. Example: Medline

Document collection = 100s of XML docs, each with thousands of abstracts <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" " edline_ dtd"> some content

Document collection = 100s of XML docs, each with thousands of abstracts <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" " edline_ dtd"> (content) …

How XQuery makes ranking difficult All documents in collection A must be ranked before all documents in collection B. Fragments must be ordered in depth-first, left-to-right order.

Semi-Structured Queries More complex than “unstructured” queries Xquery standard

XIRQL University of Dortmund Goal: open source XML search engine Motivation “Returnable” fragments are special “atomic units” E.g., don’t return a some text fragment Structured Document Retrieval Principle Empower users who don’t know the schema Enable search for any person_name no matter how schema refers to it

Atomic Units Specified in schema Only atomic units can be returned as result of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence” from atomic units

XIRQL Indexing

Structured Document Retrieval Principle A system should always retrieve the most specific part of a document answering a query. Example query: xql Document: 0.3 XQL 0.5 example 0.8 XQL 0.7 syntax  Return section, not chapter

Augmentation weights Ensure that Structured Document Retrieval Principle is respected. Assume different query conditions are disjoint events -> independence. P(XQL|chapter-N) = P(XQL|chapter-F) +P(sec.|chapter-N)*P(XQL|sec.) -P(XQL|chapter-F)*P(sec.|chapter-N)*P(XQL|sec.) = * *0.6*0.8 = P(XQL|sec.)=0.8 > 0.636=P(XQL|chapter-N) Section ranked ahead of chapter

Datatypes Example: person_name Assign all elements and attributes with person semantics to this datatype Allow user to search for “person” without specifying path

XIRQL: Summary Relevance ranking Fragment/context selection Datatypes (person_name) Probabilistic combination of evidence

IBM Haifa Approach Reject XQuery Willing to give up some expressiveness No joins and backward navigation Find all the titles of articles authored by Smith Simpler & more efficient approach Represent queries as XML fragments

Query Examples

Extended Weighting Formula Direct extension of tf.idf cr = context resemblance measure

Context Resemblance Measures Flat: Perfect match cr(ci,cj):=1 if i==j, cr(ci,cj):=0 otherwise Partial match cr(ci,cj):= (1+|ci|)/(1+|cj|) if ci subsequence of cj cr(ci,cj):= 0 otherwise Fuzzy match For example, string similarity of paths Example? Ignore context cr(ci,cj) := 1 in all cases

Implementation Indexing Index term/context pairs: t#c Example: istambul#/country/capital Retrieval Fetch all contexts of a term

Weighting IDF per context: compute inverse document frequency for each context separately Problem: not enough data Global IDF: compute a single global idf weighting Merge-idf Compute idf for context ci by looking at all contexts with similarity > 0 Merge-all Compute tf in analogy to merge-idf

Results

Discussion Flat best But it depends on the query. Average hides individual differences. Which queries will flat do well on? Semantics of XML structure Best case XML structure corresponds to unit/subunit structure of documents Worst case (except for flat) Semantics of terms different in different structural units

IBM Haifa: Summary Goal: information discovery vs. data exchange, data access via API etc Queries are XML fragments No separate query language One of the best performers in Inex bakeoff Extension of vector space Works well for: Specific context, vague information need Doesn’t work well for Non-specific context, DB-type information need

XML Summary DB approach Good for DB-type queries But no relevance ranking And no ordering Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity, coherence …) Different approaches to relevance-ranked XML IR

XML IR challenge: Schemas Ideally: There is one schema User understands schema In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas Need to identify similar elements in different schemas Example: employee

XML IR challenge: UI Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox? In general: design layer between XML and user

Project Suggestions XML information retrieval using Lucene Address some of the XML IR challenges Automatic creation of datatypes Weighting Others?

Resources xquery full text requirements Other approaches ons/2003/sigmod2003-xrank.pdf ons/2003/sigmod2003-xrank.pdf proceedings/DelNoe01/22_Schlieder.pdf proceedings/DelNoe01/22_Schlieder.pdf Xml classification