1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Three-Step Database Design
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Information Integration Using Logical Views Jeffrey D. Ullman.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
Organizing Data & Information
How can Computer Science contribute to Research Publishing?
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
TAMBIS Transparent Access to Multiple Biological Information Sources.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Overview of Search Engines
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Dimitrios Skoutas Alkis Simitsis
Transparent access to multiple bioinformatics information sources (TAMBIS) Goble, C.A. et al. (2001) IBM Systems Journal 40(2), Genome Analysis.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Aim Ability to automate the detection of financial inconsistency and irregularity Problem Need to create a unified and logically rigorous terminology.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Service Marts: a Service Framework for Search Computing Alessandro Campi Andrea Maesani.
Chapter 7 K NOWLEDGE R EPRESENTATION, O NTOLOGICAL E NGINEERING, AND T OPIC M APS L EO O BRST AND H OWARD L IU.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
WonderWeb. Ontology Infrastructure for the Semantic Web. IST WP4: Ontology Engineering Heiner Stuckenschmidt, Michel Klein Vrije Universiteit.
Ontology Technology applied to Catalogues Paul Kopp.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Databases and DBMSs Todd S. Bacastow January
Building Trustworthy Semantic Webs
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Database System Architecture
Data Model.
Database Architecture
Information Retrieval and Web Design
Presentation transcript:

1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet

2 Accessing multiple data sources Where? Which? How? Disease information Target structure Chemical structure Disease models Clinical trials Metabolism, toxicology Genomics DISCOVERY

3 Access to multiple data sources-Problems Users need good knowledge on where the required information is stored and how it can be accessed Representation of an entity in different data sources can be different. Same name in different data sources can refer to different entities.

4 Queries over multiple data sources Find PubMed publications on diseases of certain insulin sequences SWISS-PROT OMIMPubMed Find Divide & order Execute Combine get relevant diseases get IDs of publications get publications

5 Access to multiple data sources - steps Decide which data sources should be used Divide query into sub-queries to the data sources Decide in which order to send sub-queries to the data sources Send sub-queries to the data sources - use the terminology of the data sources Merge results from the data sources to an answer for the original query  mistake in any step can lead to inefficient processing of the query or failure to get a result

6 Sub-query1 query Sub-query1 Answer1 Answer2 Answer3 Answer1 Answer2 Answer3 Answer1 Answer2 Answer3

7 query Answer1 Answer2 Answer3 Sub-query2(answer1) Answer1.1 Answer1.2 Answer1.1 Answer1.2 Answer1.1 Answer1.2

8 query Answer1 Answer2 Answer3 Sub-query2(answer2) Answer2.1 Answer2.2 Answer2.1 Answer2.2 Answer1.1 Answer1.2 Answer2.1 Answer2.2

9 query Answer1 Answer2 Answer3 Sub-query2(answer3) Answer3.1 Answer1.1 Answer1.2 Answer2.1 Answer2.2 Answer3.1

10 query Answer1 Answer2 Answer3 Answer1.1 Answer1.2 Answer2.1 Answer2.2 Answer3.1 Subquery3(Answer1.1,Answer1.2, Answer2.1,Answer2.2,Answer3.1) Subquery3(Answer1.1,Answer1.2, Answer2.1,Answer2.2,Answer3.1) Answer.a Answer.b Answer.c Answer.d Answer.e Answer.f Answer.a Answer.b Answer.c Answer.d Answer.e Answer.f result

11 Problem formulation Data source properties –Autonomous data sources –Different data models –Differences in terminology –Overlapping, redundant data Integration aims to provide transparent access to multiple heterogenous data sources –uniform query language –uniform representation of results Data source 2 Structure(name, structure, organism) date>2000 Data source 1 Protein(name, authors, date, organism) Article(authors, title, year) date>1995

12 Problem formulation Data source properties –Autonomous data sources –Different data models –Differences in terminology –Overlapping, redundant data Integration aims to provide transparent access to multiple heterogenous data sources –uniform query language –uniform representation of results Data source 2 Structure(name, structure, organism) date>2000 Protein(name, date, organism) ProteinStructure(name, structure) Data source 1 Protein(name, authors, date, organism) Article(authors, title, year) date>1995

13 Methods for integration  Link driven federations Explicit links between data sources.  Warehousing Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.  Mediation or View integration A global schema is defined over all data sources.

14 Link driven federations Creates explicit links between data sources query: get interesting results and use web links to reach related data in other data sources

15 Link driven federations User Interface Retrieval Engine Answer Assembler Index A->B Data source B Data source A Index B->A Query interpreter wrapper

16 SRS Integrates more than 300 resources Possible to add own resources interface: SRSWWW, getz

17 SRS – query language text search [swissprot-des:kinase] documents in swissprot that contain ’kinase’ in the ’description’-field [swissprot-des:kin*] documents in swissprot that contain a word that starts with ’kin’ in the ’description’-field

18 SRS – query language boolean operators: and (&), or (|), andnot (!) [swissprot-des:(adrenergic & receptor) ! (alpha1A)] documents in swissprot that contain ’adrenergic’ and ’receptor’ in the ’description’-field, but not ’alpha1A’

19 SRS – query language boolean operators: and (&), or (|), andnot (!) [swissprot-des:kinase] & [swissprot-org:human] documents in swissprot that contain ’kinase’ in the ’description’-field and ’human’ in the ’organism’- field

20 SRS – query language links [swissprot-des:kinase] > PDB documents in PDB that are referred to from documents in swissprot that contain ’kinase’ in the ’description’-field

21 SRS – query language links [swissprot-id: acha_human] > prosite > swissprot documents in swissprot that are referred to from documents in prosite that are referred to from documents in swissprot that contain ’acha_human’ in the ’id’- field

22 SRS – query language links [swissprot-org:human] > [swissprot-features:transmem] documents in swissprot that contain ’transmem’ in the ’features’-field and that are referred to from documents in swissprot that contain ’human’ in the ’organism’-field

23 SRS – query language multiple sources [{swissprot sptremb}-des:kinase] [dbs={swissprot sptremb}-des:kinase] & [dbs-org:human]

24 Link driven federations Advantages - complex queries - fast Disadvantages - require good knowledge - syntax based - terminology problem not solved

25

26 Mediation Define a global schema over the data sources high level query language

27 Mediation User Interface Retrieval Engine Query Interpreter and ExpanderAnswer Assembler Ontology Base Data source Knowledge Base Data source Data source wrapper

28 Mediation Advantages - complex queries - requires less knowledge - solution for terminology problem - semantics based

29 Mediation Disadvantages - more computation - view maintenance

30 Mediation Query problem How to answer queries expressed using the global schema. Modeling problem How to model the global schema, data sources and mappings. Application Global schema Local schema Query Wrapper Source Mediator

31 Queries use the global schema Conjunctive queries  select-project-join queries Mediator reformulates queries in terms of a set of queries that use the local schemas. Equivalence and containment of queries needs to be preserved. - Q1 is contained in Q2 if the result of Q1 is a subset of the result of Q2. Queries p(X,Z) :- a(X,Y), b(Y,Z) head body/subgoals if q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)

32 Mediator Query Wrapper Mediator Source Mediator is responsible for query processing – reformulation of queries, decide query plan – query optimization – execution of query plan, assemble results into final answer Issues: – Semantically correct reformulation – Access only relevant data sources Query Reformulation Query Optimization Query Execution Engine

33 Knowledge Application Global schema Local schema Query Wrapper Source Mediator Mappings Description of data source content – global schema (domain model/ontology) – local schema (data source model) Information for integration – mapping Capabilities – attributes and constraints – processing capabilities – completeness – cost of query answering – reliability Used for – selection of relevant data sources – query plan formulation – query plan optimization

34 Mapping Relation between domain and data source content  Global as view The global schema is defined in terms of source terminology Global schema: Protein(name, date, organism) ProteinStructure(name, structure) Data source local schema: DS1(name, authors, date, organism) DS2(name, structure, organism) Protein(name, date, organism) :- DS1(name, authors, date, organism) ProteinStructure(name, structure) :- DS2(name, structure, organism)

35 Mapping Relation between domain and data source content  Local as view The sources are defined in terms of the global schema. Global schema: Protein(name, date, organism) ProteinStructure(name, structure) DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000 Data source local schema: DS1(name, authors, date, organism) DS2(name, structure, organism)

36 Query processing in GAV No explicit representation of data source content Mapping gives direct information about which data satisfies the global schema. Query is processed by expanding the query atoms according to their definitions. Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) GAV: Protein(name, date, organism) :- DS1(name, authors, date, organism) ProteinStructure(name, structure) :- DS2(name, structure, organism) New query: q(name, structure) :- DS1(name, authors, 2001, ’human’), DS2(name, structure, organism)

37 Query processing in LAV Mapping does not give direct information about which data satisfies the global schema. To answer the query it needs to be inferred how the mappings should be used. Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) LAV: DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000

38 Query processing in LAV Bucket algoritm (Information Manifold)  For each sub-goal in query create bucket of relevant views.  Define rewritings of query. Each rewriting consists of one conjunct from every bucket. Check whether the resulting conjunction is contained in the query.  The result is the union of the rewritings. New query: q(name, structure) :- DS1(name, authors, 2001, ’human’), DS2(name, structure, organism) Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) LAV: DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000

39 Comparison GAV - LAV Global as view  Clear how data sources interact  When a data source is added, the global schema can change  Query processing is easy Local as view  Each data source is specified in isolation  Easy to add data sources  Easier to specify constraints on the contents of sources  Query processing requires reasoning

40 Capabilities Most common capabilities describe attributes  f - free, attribute can be specified or not  b - bound, a value must be specified for the attribute, all values are permitted  u - unspecified, not permitted to specify a value for the attribute  c[S] - value should be one of the values in finite set S  o[S] - value is not specified or one of the values in finite set S DS1: (name, authors, date, organism)f f b c[human mouse]

41 Mediation User Interface Retrieval Engine Query Interpreter and ExpanderAnswer Assembler Ontology Base Data source Knowledge Base Data source Data source wrapper Research issues: - knowledge representation - creation and maintenance of global schema

42 Mediation User Interface Retrieval Engine Query Interpreter and ExpanderAnswer Assembler Ontology Base Data source Knowledge Base Data source Data source wrapper Research issues: - knowledge representation - creation of ontologies - aligning/merging of ontologies

43 Mediation User Interface Retrieval Engine Query Interpreter and ExpanderAnswer Assembler Ontology Base Data source Knowledge Base Data source Biological databank wrapper Research issues: - unified query language - knowledge representation - strategies for query expansion - query optimization

44 Mediation User Interface Retrieval Engine Query Interpreter and ExpanderAnswer Assembler Ontology Base Data source Knowledge Base Data source Data source wrapper Research issue: - semi-automatic generation of wrappers

45