Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Schema Integration Ray Dos Santos July 19, 2009.

Similar presentations


Presentation on theme: "XML Schema Integration Ray Dos Santos July 19, 2009."— Presentation transcript:

1 XML Schema Integration Ray Dos Santos July 19, 2009

2 2 XML integration Fundamental problem: schema matching, which takes two (or more) schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other. Objective: find corresponding entities.

3 3 Some application domains for XML HealthCare Level Seven http://www.hl7.org/http://www.hl7.org/ Geography Markup Language (GML) Systems Biology Markup Language (SBML) http://sbml.org/http://sbml.org/ XBRL, the XML based Business Reporting standard http://www.xbrl.org/ http://www.xbrl.org/ Global Justice XML Data Model (GJXDM) http://it.ojp.gov/jxdm http://it.ojp.gov/jxdm ebXML http://www.ebxml.org/http://www.ebxml.org/ e.g. Encoded Archival Description Application http://lcweb.loc.gov/ead/ http://lcweb.loc.gov/ead/ Digital photography metadata XMP An XML grammar for sensor data (SensorML) Real Simple Syndication (RSS 2.0)

4 4 Integrating two schemas Consider two schemas, S1 and S2, representing two customer relations, Cust and Customer. S1 S2 Cust Customer CNoCustID CompNameCompany FirstNameContact LastNamePhone

5 5 Integrating two schemas (contd) Represent the mapping with a similarity relation, , over the power sets of S1 and S2, where each pair in  represents one element of the mapping. E.g., Cust.CNo  Customer.CustID Cust.CompName  Customer.Company {Cust.FirstName, Cust.LastName}  Customer.Contact

6 6 Different types of matching Schema-level only matching: only schema information is considered. Domain and instance-level only matching: some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web. Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available.

7 7 Pre-processing for integration Tokenization: break an item into atomic words using a dictionary, e.g.,  Break “fromCity” into “from” and “city”  Break “first-name” into “first” and “name” Expansion: expand abbreviations and acronyms to their full words, e.g.,  From “dept” to “departure” Stopword removal and stemming Standardization of words: Irregular words are standardized to a single form, e.g.,  From “colour” to “color”

8 8 Schema-level matching Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc. Match cardinality:  1:1 match: one element in one schema matches one element of another schema.  1:m match: one element in one schema matches m elements of another schema.  m:n match: m elements in one schema matches n elements of another schema.

9 9 An example m:1 match is similar to 1:m match. m:n match is complex, and there is little work on it.

10 10 Linguistic approaches They are used to derive match candidates based on names, comments or descriptions of schema elements: Name match:  Equality of names  Synonyms  Equality of hypernyms: A is a hypernym of B is B is a kind-of A.  Common sub-strings  Cosine similarity  User-provided name match: usually a domain dependent match dictionary

11 11 Linguistic approaches (contd) Description match: in many files, there are comments to schema elements, e.g., Cosine similarity can be used to compare comments after stemming and stopword removal.

12 12 Constraint based approaches Constraints such as data types, value ranges, uniqueness, relationship types, etc. An equivalent or compatibility table for data types and keys can be provided. E.g.,  string  varchar, and (primiary key)  unique Note: On the Web, the constraint information is often not available, but some can be inferred based on the domain and instance data.

13 13 Domain and instance-level matching In many applications, some data instances or attribute domains may be available. Value characteristics are used in matching. Two different types of domains  Simple domain: each value in the domain has only a single component (the value cannot be decomposed).  Composite domain: each value in the domain contains more than one component.

14 14 Match of simple domains A simple domain can be of any type. If the data type information is not available (this is often the case on the Web), the instance values can often be used to infer types, e.g.,  Words may be considered as strings  Phone numbers can have a regular expression pattern. Data type patterns (in regular expressions) can be learned automatically or defined manually.  E.g., used to identify such types as integer, real, string, month, weekday, date, time, zip code, phone numbers, etc.

15 15 XML is different from databases Limited use of acronyms and abbreviations on the XML: but natural language words and phrases, for general public to understand.  Databases use acronyms and abbreviations extensively. Limited vocabulary: for easy understanding A large number of similar databases: a large number of sites offer the same services or selling the same products. Additional structures: the information is usually organized in some meaningful way. But the organization needs to be understood first.  Related attributes are together.  Hierarchical organization.

16 16 Instance-based matching via footprints  Assume a global schema is given and a set of instances are also given.  The method uses each instance value of every attribute to probe the underlying ontology to obtain the footprints.  These footprints are used to help with a best-faith matching estimate. It performs matches based on  Events  Relationships  Spatial characteristics  Attributes

17 17 Entity Rel Event Object Spatial -- name -- description -- shape -- size -- atts -- begin at -- stop at -- elapse -- start -- end -- touches -- overlaps -- contains -- spans -- within -- parent -- child -- type-of -- is-a -- part-of Washington DC Monuments Attached to the ground. Intelligently places itself at the height of the underlying terrain. -77.0822035425683,- 111.42228990140251,0 Washington Monument Major monuments of the federal capital -77.0822035425683 11.42228990140251,0

18 18 Entity Rel Event Objec t Spati al -- name -- description -- shape -- size -- atts -- begin at -- stop at -- elapse -- start -- end -- touches -- overlaps -- contains -- spans -- within -- parent -- child -- type-of -- is-a -- part-of FootPrint: 01ZZ9918025310 Events: 9920 01XX7718 10 01ZZ99200253109920 05XX4412 10 Rel: 01ZZ9920025310 Spatial: 9920 01XX7720 10 Temporal, moving objects Relationships among objects Location

19 19 Next Steps  How to define footprints  Define rules to minimize footprint size and count  Order footprints such that the most appropriates are “looked at” first  Footprints for subtrees ?


Download ppt "XML Schema Integration Ray Dos Santos July 19, 2009."

Similar presentations


Ads by Google