Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Views Joachim Biskup Universität Dortmund and David W. Embley Brigham Young University Funded by NSF
Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this
Presentation Outline Overview Matching (Direct) Matching (Derived) Matching Algorithm Summary
Requirements 1.f is an injective function. 2.f maps obj. sets to obj. sets and rel. sets to rel. sets 3.f respects rel-set arities. 4.f respects referential integrity. 5.f respects types. 6.f respects real-world identity. 7.f ’s coercions are G/S compatible. 8.f respects subset constraints. 9.f respects mutual-exclusion constraints. 10.f respects union constraints
User Interaction (IDS Statements) Issue –Explains the issue –Example: units, may need transformation Default –Explains the default option –Example: if no transformation, no conversion Suggestion –Gives a suggestion about how to resolve the issue –Example: if needed, specify the conversion
Theorem Let f be the generated mapping from target t to source s, populated such that s has a valid interpretation. Let t’ be the submodel of t populated from s by f. Then t’ has a valid interpretation. Proof: the paper is the proof …
Target (Graphical View)
Target (Textual View)
Source Example (Assumed to be Populated)
Matching (Direct) Object Sets Relationship Sets
Object-Set Type Compatibility 1.type(a) = type(b) 2.type(a) type(b) 3.type(a) type(b) 4.type(a) type(b)
type(a) = type(b) Same type –string = string, but Airport Head Of State –Need better matching techniques Same type, different units –Size Nr Sq Km –Need unit conversion Same type, different format –Date Date, but 01/02/2002 Jan 2, 2002 –Need format conversion Same type, same units and format, different assumptions –Altitude Altitude, but altitude of aircraft and spacecraft differ –Need same assumptions Same type, same units and format, same assumption, OIDs
type(a) type(b) and type(a) type(b) Real Integer or Video Image –Target has greater discriminating power –Can add.0 or make a video of a single image (?) Integer Real or Image Video –Source has greater discriminating power –Can round off or select one of the frames (?)
type(a) type(b) Image String –Mismatch, even if same attribute (e.g. both City) –Types can help discard potential matches String(5) Integer –But suppose the integer is 2 –Might work, but is “2.000” ok?
Relationship Match Requirements Referential integrity Constraints –Cardinality –Mandatory/Optional
Referential Integrity a b a’ b’ TargetSource... a’’ The types of a, a’, and a’’ can all be different, but not arbitrary. Example: a (String), a’ (Integer), a’’ (Real).
Relationship-Set Constraint Compatibility 1.constr(a) constr(b) 2.(constr(a) constr(b)) 3.(constr(a) constr(b)) 4.(constr(a) constr(b))
constr(a) constr(b) Person Car owns drives o o o o Person Car ? o o Need more information to resolve: Perhaps “?” is “purchased.”
(constr(a) constr(b)) City City Map City City Map ab The target (a) expects many maps, but the source can’t supply them.
(constr(a) constr(b)) City City Map City City Map ab The target (a) expects one map, but the source can supply many.
(constr(a) constr(b)) City City Map City City Map ab The target (a) expects at least one and potentially many maps, but the source may have none or at most one. o
Matching (Derived) Generalization/Specialization Composite Values Derived Relationship Sets Displayable/Nondisplayable Object Sets
Generalization/Specialization For a target object set, a source object set may: –have no overlap (just ignore) –have a proper subset (accept or find missing generalization) –have the same values (direct match) –have a proper superset (hard, except for roles) –overlap (like proper subset and proper superset) Consider roles and missing generalizations
Roles target: source: City Travel Video CityClip: Video o o o o Video With City Scene Video With City Scene
Missing Generalization targetsource City MapCountry MapCity Map: ImageCountry Map: Image Map: Image
Composite Values Composite in Source (split) Composite in Target (merge) Examples of Derived Relationships
Composite in Source Video Nr HoursNr Minutes Video Time Nr HoursNr Minutes targetsource Note also that we generated a source path.
Composite in Source Video Nr HoursNr Minutes Video Nr HoursNr Minutes targetsource
Composite in Target Video Nr HoursNr Minutes target Video Time source Time
Composite in Target Video target Video Time source Time
Displayable/Nondisplayable Object-Set Matches Nondisplayable in Source: find a key Nondisplayable in Target: create a key
Nondisplayable in Source targetsource Airport No Key: Discard Match City Airline flys to serves
Nondisplayable in Source targetsource Airport No Key: Discard Match City Airline flys to serves
Nondisplayable in Source targetsource Airport One Key: Choose it City Airline flys to serves Airport Name
Nondisplayable in Source targetsource Airport One Key: Choose it City Airline flys to serves Airport Name
Nondisplayable in Source targetsource Airport Two or more Keys: Choose One City Airline flys to serves Airport Name Airport Code
Nondisplayable in Source targetsource Airport Two or more Keys: Choose One City Airline flys to serves Airport Name Airport Code
Matching Algorithm
Sample Match Table
Pictorial View of Match Table target source
Summary
Concluding Remarks QED (the theorem holds) Let f be the generated mapping from target t to source s, populated such that s has a valid interpretation. Let t’ be the submodel of t populated from s by f. Then t’ has a valid interpretation. Proof: the paper is the proof …
Pictorial View of Match Table t = target s = source f = the mapping t’ has a valid interpretation t’ = submodel
Concluding Remarks QED (the theorem holds) Merge (several sources) –All sources extracted to same view –Union merge Object identity problems Constraint problems Source Modeling (convert to OSM) Framework defined, but not implemented