Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ.

Similar presentations


Presentation on theme: "Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ."— Presentation transcript:

1 Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ.

2 2

3 Background Information integration is the problem of sharing and using data across disparate information sources. What challenges information integration is that information sources are often distributed, autonomous, and heterogeneous. 3

4 Example of Information Integration Patient healthcare and medical data usually resides in multiple sources such as different units of hospitals, labs, clinics, personal data management devices, and even drugstores. Example tasks for Information integration: –obtaining a holistic view of patient health status –merging data for multiple healthcare providers 4

5 Information Integration 5

6 A Central Issue A key component of any solutions for information integration is the definitions of mappings between different data sources/schemas. Despite a decade’s effort, building schema mapping remains a very difficult problem. The difficulty lies in the requirement of understanding the meaning of the schemas being mapped. 6

7 An Example IDAdmDatePatRef Admission DisDateIDDocPatRef Treatment DateDescIDNameMedCr# Patient Diagnosis Philadelphia General Hospital DB IDEnterPolicy# Coronary LeavePatient IDEnterPolicy# Pulmonary LeavePatient ID Admission IDDocProgID Treatment Date IDSymptomPatRef Progress Boston Mass General Hospital DB Transfer patient medical information from Philadelphia General Hospital to Boston Mass General Hospital. 7

8 Schema Semantics IDAdmDatePatRef Admission DisDateIDDocPatRef Treatment DateDescIDNameMedCr# Patient DiagnosisIDEnterPolicy# Coronary LeavePatientIDDocProgID Treatment DateIDSymptomPatRef Progress IDSymptomPatRef Progress hasID hasSymptom Patient hasRefN hasName relate *1 IDDocProgID Treatment Date Progress hasID hasSymptom Treatment hasID hasDate Doctor hasPhyID hasName prescribe apply *1 1* 8

9 We aim at developing an automatic tool for discovering semantic mappings from database schemas to conceptual models (CM). Discovering Semantics DB conceptual model 9

10 Benefits of Discovering Semantics for Schemas IDDocPatRef Treatment DateDesc Philadelphia General Hospital DB Schema Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName prescribe recommend apply monitorrelate *1 ** 1**1 1* Boston Mass Hospital DB Conceptual Model Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName 10

11 We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution. Maintaining Semantics DB conceptual model DB’ conceptual model’ 11

12 Using Semantics for Discovering Schema Mapping DB2 conceptual model 2 DB1 conceptual model 1 12

13 Roadmap Background Contributions Discovering Semantics for Schemas Maintaining Semantics for Schemas Using the Semantics for Schema Mapping Conclusions 13

14 Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName prescribe recommend apply monitorrelate *1 ** 1**1 1* Much more semantics in conceptual models, e.g., weak entities, partOf, n-ary relationships, ISA relationships… Need to distinguish them all from schema structures. Challenges IDDocPatRef Treatment DateDesc Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName Discover all and only the “reasonable” trees we call semantic trees that are plausible semantics of the table. 14

15 Schema matching tools: associate atomic elements in different schemas using syntactic links. Schema mapping tools: infer query expressions for translating/exchanging data. –unable to discover expected semantics of a schema in terms of a conceptual model. Existing Mapping Tools IDDocPatRef Treatment DateDescIDDocProgID Treatment Date 15

16 Our Solution for Discovering Semantics IDDocPatRef Treatment DateDesc Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName prescribe recommend apply monitorrelate *1 ** 1**1 1* Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName Simple correspondences can be specified manually or by using a schema matching tool. The key is to discover “reasonable” links based on 1.analysis key and foreign key constraints in schemas. 2. a careful study of standard database design princiles. We focus on deriving semantic trees connecting the individual concepts using “reasonable” links. 16

17 Discovering Semantic Trees IDDocPatRef Treatment DateDesc Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName prescribe recommend apply monitorrelate *1 ** 1**1 1* Treatment hasID hasDate Doctor hasPhyID hasName Progress hasID hasSymptom Patient hasRefN hasName Step1: determine a skeleton tree and its anchor by key columns. Step2: determine skeleton trees the their anchors corresponded to by f.k. columns. Step4: link any concepts corresponding to unaccounted-for columns. Step3: link the skeleton trees using shortest functional paths. 17

18 “Divide and Conquer” A gradual manner: 1.ER0 – an initial subset with binary relationships. 2.ER1 – adding n-ary relationships 3.ER2 – adding ISA relationships. 18

19 “Good” Properties of the Algorithm Guarantees only for “standard” relational schemas. 1.A sense of “completeness”: the algorithm finds all the “correct” semantics. 2.A sense of “soundness”: for multiple candidates, each one would result in an “indistinguishable” table by the standard database design methodology. 19

20 The MAPONTO Tool the mapping formulas 20

21 Evaluation Results Schemas# of Table s # of Columns Ontology# of Nodes # of Links UTCS Departmen t 832Academic Department 621913 VLDB Conference 938Academic Conference 27143 DBLP Bibliograph y 527Bibliographic Data 751178 OBSERVER Project 8115Bibliographic Data 751178 Country618CIA Factbook52125 21

22 Evaluation Results correct semantics for 85% of the tested tables. maximum number of semantics candidates is 4. Average execution time less than 1 second. 22

23 Roadmap Background Contributions Discovering Semantics for Schemas Maintaining Semantics for Schemas Using the Semantics for Schema Mapping Conclusions 23

24 We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution. Maintaining Semantics DB conceptual model DB’ conceptual model’ 24

25 Challenges in Maintenance What to maintain: how to define the property for maintenance and how to detect violation on the property. How to capture changes to CMs and relational schemas. How to reconcile CMs and schemas according to the intent of users. 25

26 Our Goals of Mapping Maintenance To keep the mapping consistent: a consistent conceptual-relational mapping allows two-way legal instances translation. To reconcile the conceptual model when the associated schema evolve. To update the mapping when associated conceptual model evolve. 26

27 Capturing CM/Schema Changes A user can change CM/schema in different ways: –Modifying the original model. –Generating a new model. It is difficult to ask the user to provide a sequence of primitive actions. It would be easier to ask the user to draw correspondences. Biosample(bsid,species,organ,…,donor_disease) Biosample(bsid,species,organ,…) tissue(bsid,donor_disease) 27

28 Reconciling CM and Schema Analyzing the existing semantics in the original mappings in terms of skeleton trees and connections between anchors. Discovering changes through correspondences between old and new models. Synchronizing models and adapting the mapping accordingly. 28

29 Evaluation Methodology and Results The same data sets for discovering conceptual- relational mappings. Measuring efficiency and benefits in comparison to mapping reconstructing approach. Comparing the number of mapping candidates generated by maintaining and reconstructing approaches. The maintenance approach can save at least 80% of user effort for reaching consistent mappings. Execution time is insignificant: avg. < 1 sec. 29

30 Roadmap Background Contributions Discovering Semantics for Schemas Maintaining Semantics for Schemas Using the Semantics for Schema Mapping Conclusions 30

31 Using CM-Relational Mappings for Discovering Schema Mapping DB2 conceptual model 2 DB1 conceptual model 1 31

32 Current Solutions for Schema Mapping compose Progress(ID,PatRef,Symptom) with Treatment(ID’,ProgID,Doc,Date) where Progress.ID=Treatment.ProgID → Treatment(ID’,PatRef,Doc,Date,Symptom). SOURCE: TARGET: Treatment IDDocProgIDDateIDSymptomPatRef Progress IDDocPatRef Treatment DateDesc 32

33 33/44 1.load Doctor.name and Doctor.clinic into employee as employee.name and employee.clinic in the target. 2. load Scientist.name and Scientist.lab into employee as employee.name and employee.lab in the target. 3. compose Doctor(ssn,name’,clinic) with Scientist(ssn,name,lab) where they have the same ssn → employee(z,name,clinic,lab). Using the Semantics Employee ssn name Doctor ssn clinic Scientist ssn lab X Doctor Scientist employee ssnnameclinic ssnnamelab eidnamecliniclab Employee ssn name Doctor ssn clinic Scientist ssn lab X

34 Principles of the Semantic Approach Discovering two conceptual subgraphs (CSG) that are “semantically similar” (≠ “structurally match”) and then translating the CSGs into algebraic expressions 1.connections between corresponding pairs of nodes are semantically similar or compatible, e.g., ISA, partOf… 2.maintaining desirable properties in database queries. 3.the principle of parsimony: smallest trees. 34

35 Evaluation Methodology Comparison between the semantic approach and traditional approachs based on referential integrity constraint. Manually specified mapping expressions as a “gold standard”. Traditional “precision” and “recall” as evaluation criteria. Data collection from a variety of domains. 35

36 Test Data Schema# tablesAssociated CM# nodes in CM #mappings tested DBLP1 DBLP2 22 9 Bibliographic DBLP2 ER 75 7 6 Mondia1 Mondial2 28 26 Factbook Mondial2 ER 52 26 5 Amalgam1 Amalgam2 15 27 Amalgam1 ER Amalgam2 ER 8 26 7 3Sdb1 3Sdb2 9999 3Sdb1 ER 3Sdb2 ER 33 UTCS UTDB 8 13 KA ontology CS dept. ontology 105 62 2 HotelA HotelB 6565 hotelA ontology hotelB ontology 7777 5 NetworkA NetworkB 18 19 networkA ontology networkB ontology 28 27 6 36

37 Summary of the Evaluation Results Found all the expected mappings as found by the traditional approach. Improved precision (70% of the test cases) by eliminating suspicious pairings. Improved recall (40% of the test cases) by considering ISA as functional relationship. No much complicated semantics, no improvements. 37

38 Roadmap Background Contributions Discovering Semantics for Schemas Maintaining Semantics for Schemas Using the Semantics for Schema Mapping Conclusions 38

39 Conclusions A novel and effective tool for discovering semantics for schemas in terms of conceptual models. A round-trip engineering process for maintaining semantic mappings. A semantic approach for improving schema mappings using the semantics. A suite of tools for assisting users to discover and maintain mappings between different data representations in a variety of information integration situations. 39

40 Thank You! 40


Download ppt "Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ."

Similar presentations


Ads by Google