Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

XML: Extensible Markup Language
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.
Information Retrieval in Practice
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
K nearest neighbor and Rocchio algorithm
Xyleme A Dynamic Warehouse for XML Data of the Web.
CSE 190: Internet E-Commerce Lecture 17: XML, XSL.
ADVISE: Advanced Digital Video Information Segmentation Engine
QSX (LN 3)1 Query Languages for XML XPath XQuery XSLT (not being covered today!) (Slides courtesy Wenfei Fan, Univ Edinburgh and Bell Labs)
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
XML Views El Hazoui Ilias Supervised by: Dr. Haddouti Advanced XML data management.
1 COS 425: Database and Information Management Systems XML and information exchange.
Storage of XML Data XML data can be stored in –Non-relational data stores Flat files –Natural for storing XML –But has all problems discussed in Chapter.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
Database Systems and XML David Wu CS 632 April 23, 2001.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
Overview of Search Engines
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 9: Wrappers PRINCIPLES OF DATA INTEGRATION.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
XSL XML Stylesheet Langauage. XPath Notation for addressing elements in an XML document /xyz - selects the root element /xyz/abc - selects all elements.
Text Classification, Active/Interactive learning.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Querying Structured Text in an XML Database By Xuemei Luo.
Intro to XML Originally Presented by Clifford Lemoine Modified by Box.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Database Systems Part VII: XML Querying Software School of Hunan University
Clustering XML Documents for Query Performance Enhancement Wang Lian.
 2002 Prentice Hall, Inc. All rights reserved. 1 Chapter 12 – XSL: Extensible Stylesheet Language Transformations (XSLT) Outline 12.1Introduction 12.2Setup.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
XML A Language Presentation. Outline 1. Introduction 2. XML 2.1 Background 2.2 Structure 2.3 Advantages 3. Related Technologies 3.1 DTD 3.2 Schemas and.
Martin Kruliš by Martin Kruliš (v1.1)1.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
XML Extensible Markup Language
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Information Retrieval in Practice
Intro to XML.
XML in Web Technologies
Structure and Content Scoring for XML
Structure and Content Scoring for XML
Lecture 11: XML and Semistructured Data
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

2 Extraction for Data Integration: Motivating Example Products books item booktitleauthorpublisher ISBN Native Schema Publications book titleauthorpublisher ISBN External Schema price ISBNTitleAuthorPublisherPrice price music video

3 Why Extract Data from XML? XML query processing is still in development. Still not as fast as RDBMS Relational query processing is still standard for many business applications By extracting into one relational schema, avoid overhead of XML runtime data integration Extracted relations can be best exploited for relatively static data (e.g., product catalogs)

4 Related Work XTRACT (induces DTDs) Lore/DataGuides HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER, … ) Plain Text Information Extraction (Proteus, Snowball, Rapier) Supervised/Assisted XML Schema Mapping (e.g., Clio)

5 Outline Motivation Problem statement XMLMiner approach Training XMLMiner Extraction from new documents Some observation from the prototype Summary

6 Problem Statement Given a target flat relation R, extract information for the tuples in R from XML (or HTML) documents, with potentially significant variations in schema. Problems with current integration/extraction approaches: –Hard-coding the rules/queries requires significant effort; The resulting rules can be brittle. –XML Schema or DTD is not always provided

7 XMLMiner Approach Learn signatures from example XML documents Represent document structure while maintaining flexibility (to allow schema variations) Assume that a tuple in the target relation corresponds to a subtree rooted at an instance node. (The subtree may contain more detailed info of the tuple than needed.) Represent input document nodes as vectors, and then find the closest (i.e., most similar) instance node vector Use labels and data values to map children of the instance node to target tuple attributes

8 XMLMiner Architecture: Training and Extraction Canonical Tree

9 High Level Description Training: –Each XML document is merged/split to a schema-like tree, called canonical tree –User identifies the attributes nodes (under instance node), corresponding to the target tuple attributes –System derives the instance node in the tree –Build a model for the structure of the tuple and each attribute Extracting: –Apply the model to find the most likely instance node and attribute nodes in the new XML documents

10 Training Stage I: Create Canonical Tree for each Example Document

11 Canonical Form Conversion Example: Merging Similar Nodes Merge all siblings with the same label (e.g., Item  Item*) Intuition: Siblings with the same label represent “similar” entities. Original Document Structure“Merged” Document

12 Example: Split Heterogeneous Nodes  Canonical Form Canonical Tree:

13 Training Stage I Result: Canonical Tree Original Document: Canonical Form:

14 Training Stage II: Generate Instance Node Signatures Features used to create signatures for an instance node I (item) in the canonical tree: –A: Ancestors of I –S: Siblings of I –C: Descendants of I –I: Self: Tag of I Siblings and Ancestors  position of I in the document The Descendants :  internal structure of I

15 Training Stage (cont.): Example Instance Node Signature Signature (A,S,C,I) for Item : [ A: { “Products”, “Books”}, S: { “Category_Desc”}, C: { “Title”, “Author”, “Publisher”, “New”, “Used”, “ISBN”, “Price”, “Num_Copies” } I: {“Item”} ]

16 Signature Similarity Vector Space model, TF*IDF weights for terms Incorporates structure (similarity-by-region) S X : [ A: { “Products”:1, }, S: { “Music”:0.33, “Video”:0.33}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “New”:0.2, “Used”:0.2, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Item”} ] S Y : [ A: { “Products”:1, “Books”:0.5}, S: { “CDs”:0.5}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Book”} ] Similarity(S X, S Y ) = S X.A * S Y.A + S X.S * S Y.S + +S X.C * S Y.C + S X.I * S Y.I

17 Training Stage III: Attribute Signatures Structural + Data signature S(D, A, S, C, I) –1: Data signature D for the values of R.X (e.g., can be a histogram of values for X) –Structure signature for attribute X: (A; S; C; I ): Similar to instance signature Original instance node  “document” root, A  ancestors (Item, Publisher, New) I  self (ISBN) S  siblings (Price, NumCopies) C  null.

18 Outline Motivation Problem statement XMLMiner approach Training XMLMiner Extraction from new documents XMLMiner prototype Summary

19 Extraction Stage 1.Assumption: Input documents have internal regularity 2.Compute canonical tree for some of the input documents 3.Build signature of each node in the canonical form, and compute similarity with known instance node signatures 4.Map descendants of highest scoring node to attributes of target table using attribute signatures

20 Extraction I: Represent test documents in canonical form Publications book titleauthorpublisher price editor Test Document Canonical Form ISBN book* titleauthorpublisher price editor ISBN Publications Intuition: Robustness (allows “optional” nodes) Efficiency: Canonical form has fewer nodes that original tree

21 Extraction II: Find Instance Node in Canonical Tree For each node K in CT Compute Signature of K S K Compute score for K as Similarity( S K, S I ) S I is the signature of instance node I from training The node with highest score is the instance node in C T book* titleauthorpublisher price editor ISBN Publications

22 Extraction III: Map children of instance node to attributes For each node J of subtree at K For each attribute X of R AS J  Attribute Signature of J AS X  Attribute Signature of X Compute score for J as Similarity( AS J, AS X ) Pick mapping such that Product of the scores over attributes of R is maximized. book* titleauthorpublisher price editor ISBN

23 Extraction IV: Generate XPath queries for the new documents Apply XPath queries to the “new” XML documents Simple XPath queries can be handled by Xerces parser or more advanced “streaming parser”

24 XMLMiner Prototype Successfully finds best instance node (“Book”) in test document

25 Summary Partially supervised, low effort XML  relational extraction Flexible vector space representation that preserves some original structure Can potentially be more robust than current state-of-the-art systems that rely on rules