Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

2 Extraction for Data Integration: Motivating Example Products books item booktitleauthorpublisher ISBN Native Schema Publications book titleauthorpublisher ISBN External Schema price ISBNTitleAuthorPublisherPrice price music video

3 Why Extract Data from XML? XML query processing is still in development. Still not as fast as RDBMS Relational query processing is still standard for many business applications By extracting into one relational schema, avoid overhead of XML runtime data integration Extracted relations can be best exploited for relatively static data (e.g., product catalogs)

4 Related Work XTRACT (induces DTDs) Lore/DataGuides HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER, … ) Plain Text Information Extraction (Proteus, Snowball, Rapier) Supervised/Assisted XML Schema Mapping (e.g., Clio)

5 Outline Motivation Problem statement XMLMiner approach Training XMLMiner Extraction from new documents Some observation from the prototype Summary

6 Problem Statement Given a target flat relation R, extract information for the tuples in R from XML (or HTML) documents, with potentially significant variations in schema. Problems with current integration/extraction approaches: –Hard-coding the rules/queries requires significant effort; The resulting rules can be brittle. –XML Schema or DTD is not always provided

7 XMLMiner Approach Learn signatures from example XML documents Represent document structure while maintaining flexibility (to allow schema variations) Assume that a tuple in the target relation corresponds to a subtree rooted at an instance node. (The subtree may contain more detailed info of the tuple than needed.) Represent input document nodes as vectors, and then find the closest (i.e., most similar) instance node vector Use labels and data values to map children of the instance node to target tuple attributes

8 XMLMiner Architecture: Training and Extraction Canonical Tree

9 High Level Description Training: –Each XML document is merged/split to a schema-like tree, called canonical tree –User identifies the attributes nodes (under instance node), corresponding to the target tuple attributes –System derives the instance node in the tree –Build a model for the structure of the tuple and each attribute Extracting: –Apply the model to find the most likely instance node and attribute nodes in the new XML documents

10 Training Stage I: Create Canonical Tree for each Example Document

11 Canonical Form Conversion Example: Merging Similar Nodes Merge all siblings with the same label (e.g., Item  Item*) Intuition: Siblings with the same label represent “similar” entities. Original Document Structure“Merged” Document

12 Example: Split Heterogeneous Nodes  Canonical Form Canonical Tree:

13 Training Stage I Result: Canonical Tree Original Document: Canonical Form:

14 Training Stage II: Generate Instance Node Signatures Features used to create signatures for an instance node I (item) in the canonical tree: –A: Ancestors of I –S: Siblings of I –C: Descendants of I –I: Self: Tag of I Siblings and Ancestors  position of I in the document The Descendants :  internal structure of I

15 Training Stage (cont.): Example Instance Node Signature Signature (A,S,C,I) for Item : [ A: { “Products”, “Books”}, S: { “Category_Desc”}, C: { “Title”, “Author”, “Publisher”, “New”, “Used”, “ISBN”, “Price”, “Num_Copies” } I: {“Item”} ]

16 Signature Similarity Vector Space model, TF*IDF weights for terms Incorporates structure (similarity-by-region) S X : [ A: { “Products”:1, }, S: { “Music”:0.33, “Video”:0.33}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “New”:0.2, “Used”:0.2, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Item”} ] S Y : [ A: { “Products”:1, “Books”:0.5}, S: { “CDs”:0.5}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Book”} ] Similarity(S X, S Y ) = S X.A * S Y.A + S X.S * S Y.S + +S X.C * S Y.C + S X.I * S Y.I

17 Training Stage III: Attribute Signatures Structural + Data signature S(D, A, S, C, I) –1: Data signature D for the values of R.X (e.g., can be a histogram of values for X) –Structure signature for attribute X: (A; S; C; I ): Similar to instance signature Original instance node  “document” root, A  ancestors (Item, Publisher, New) I  self (ISBN) S  siblings (Price, NumCopies) C  null.

18 Outline Motivation Problem statement XMLMiner approach Training XMLMiner Extraction from new documents XMLMiner prototype Summary

19 Extraction Stage 1.Assumption: Input documents have internal regularity 2.Compute canonical tree for some of the input documents 3.Build signature of each node in the canonical form, and compute similarity with known instance node signatures 4.Map descendants of highest scoring node to attributes of target table using attribute signatures

20 Extraction I: Represent test documents in canonical form Publications book titleauthorpublisher price editor Test Document Canonical Form ISBN book* titleauthorpublisher price editor ISBN Publications Intuition: Robustness (allows “optional” nodes) Efficiency: Canonical form has fewer nodes that original tree

21 Extraction II: Find Instance Node in Canonical Tree For each node K in CT Compute Signature of K S K Compute score for K as Similarity( S K, S I ) S I is the signature of instance node I from training The node with highest score is the instance node in C T book* titleauthorpublisher price editor ISBN Publications

22 Extraction III: Map children of instance node to attributes For each node J of subtree at K For each attribute X of R AS J  Attribute Signature of J AS X  Attribute Signature of X Compute score for J as Similarity( AS J, AS X ) Pick mapping such that Product of the scores over attributes of R is maximized. book* titleauthorpublisher price editor ISBN

23 Extraction IV: Generate XPath queries for the new documents Apply XPath queries to the “new” XML documents Simple XPath queries can be handled by Xerces parser or more advanced “streaming parser”

24 XMLMiner Prototype Successfully finds best instance node (“Book”) in test document

25 Summary Partially supervised, low effort XML  relational extraction Flexible vector space representation that preserves some original structure Can potentially be more robust than current state-of-the-art systems that rely on rules

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University

Similar presentations

Presentation on theme: "Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

Similar presentations

Presentation on theme: "Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*"— Presentation transcript:

Similar presentations

About project

Feedback

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University

Presentation on theme: "Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene AgichteinVanja Josifovski IBM Almaden and Columbia University"— Presentation transcript: