DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.

Slides:



Advertisements
Similar presentations
BI Web Intelligence 4.0. Business Challenges Incorrect decisions based on inadequate data Lack of Ad hoc reporting and analysis Delayed decisions.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Flexible Transform U.S. DEPARTMENT OF ENERGY Semantic Translation for Cyber Threat Indicators.
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
Developing an Ontology-based Metadata Management System for Heterogeneous Clinical Databases By Quddus Chong Winter 2002.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
1 Dan Quinlan, Markus Schordan, Qing Yi Center for Applied Scientific Computing Lawrence Livermore National Laboratory Semantic-Driven Parallelization.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Automatic Data Ramon Lawrence University of Manitoba
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Database Systems.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Bioinformatics.
Database Design - Lecture 1
Message Brokers and B2B Application Integration Chap 13 B2B Application Integration Sungchul Hong.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
PART IV: REPRESENTING, EXPLAINING, AND PROCESSING ALIGNMENTS & PART V: CONCLUSIONS Ontology Matching Jerome Euzenat and Pavel Shvaiko.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Using SAS® Information Map Studio
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Dimitrios Skoutas Alkis Simitsis
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
Value Set Resolution: Build generalizable data normalization pipeline using LexEVS infrastructure resources Explore UIMA framework for implementing semantic.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
1 © 1999 Microsoft Corp.. Microsoft Repository Phil Bernstein Microsoft Corp.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
Interoperability & Knowledge Sharing Advisor: Dr. Sudha Ram Dr. Jinsoo Park Kangsuk Kim (former MS Student) Yousub Hwang (Ph.D. Student)
Automating Context-Aware Application Development Ted McFadden and Karen Henricksen CRC for Enterprise Distributed Systems Technology (DSTC) Jadwiga Indulska.
Data Integration and Management A PDB Perspective.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Ling Liu, Calton Pu GT Reagan Moore, Bertam Ludaescher, SDSC Amarnath Gupta.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
A Rule Driven Bi-Directional Translation System for Remapping Queries and Result Sets Between a Mediated Schema and Heterogeneous Data Sources R. Shaker.
1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman.
Platinum DecisionBase1 DW Product Platinum - Computer AssociatesDecisionBase Hyunsook Lim Database Laboratory Dept. of CSE.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Leveraging the Business Intelligence Features in SharePoint 2010
Tools and Services Workshop
Joslynn Lee – Data Science Educator
UCSD Neuron-Centered Database
Flexible Extensible Digital Object Repository Architecture
Flexible Extensible Digital Object Repository Architecture
Phil Bernstein Microsoft Corp.
Database Management System (DBMS)
Ahmet Fatih Mustacoglu
Grid Based Data Integration with Automatic Wrapper Generation
Metadata The metadata contains
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof Fidelis Biology and Biotechnology Research Program Lawrence Livermore National Laboratory IBC Bioinformatics October 1, 1999

Outline l Motivation l DataFoundry’s integration strategy l Improving user interfaces l Beyond fully integrated data l Conclusions

Need to find the ss of all disulfide bridges within 3 AAs of active sites in small proteins. Current environment l Data is U Hard to find U Hard to understand U Hard to reconcile U Hard to analyze Scientists waste time and energy doing data management. SCoP SWISS-PROT PDB User tasks Parse input Map similar concepts Transform data format Access the data User applications

What is our ideal environment? Businesses use data warehouses to accomplish this. SCoP SWISS-PROT PDB I need to find the... Parse input Map similar concepts Transform data format Access data User applications User task Now that I have the data I need, I can…. A single location that provides effective access to a consistent view of data from many sources through an intuitive and useful interface.

Data warehouses Wrapper Mediator Wrapper Data Warehouse Swiss Prot dbEST SCoPPDB A data warehouse is a repository that provides a single access point to a collection of data obtained from a set of distributed, heterogeneous sources.

l Interfaces U provide intuitive access to the data U possibly change data format to meet user expectations l Warehouse U stores a consistent view of data in a local repository l Mediator U transform data from source format to warehouse format l Wrappers U read data from source into internal representation Data warehouses Wrapper Mediator Wrapper Data Warehouse Swiss Prot dbEST SCoPPDB

Warehouses don’t work in dynamic domains When schemata are modified, or new sources are added, wrappers and mediators break. Wrapper Mediator Wrapper Data Warehouse Swiss Prot dbEST SCoPPDB

Key insight API Extensive use of meta-data can dramatically reduce maintenance costs. Wrapper Mediator Wrapper Data Warehouse Swiss Prot dbEST SCoPPDB

The DataFoundry approach: APIAPI SourcesWarehouse Wrapper Mediator Meta-Data Mediator

Four types of meta-data are required APIAPI MediatorWrapper source rep data manip target rep pop code warehouse db descr abstraction transformation mapping abstraction

Generating the mediators Transformation Calls Population Code SQL Interface Mediator Class Mediator Interface Meta-Data Abstractions Transformation Descriptions Data Mappings Database Description User-defined methods Data Access Translation Code Method Description Data Definition API Translation Library Mediator Generato r

The translation library and mediator class are used by the wrapper APIAPI MediatorWrapper parser source rep mediator semantic mapping high-level object APIAPI

Activity/ integration style manual meta-data diff%diff understanding SCOP writing wrapper % modifying schema writing mediator modifying meta-data (1.0) --- total time in days % Results: Integrating SCoP into warehouse that already contains PDB and SWISS-PROT.

Activity/ integration style manual meta-data diff%diff understanding SCOP writing wrapper % modifying schema writing mediator modifying meta-data (1.0) --- total time in days % Results: Integrating SCoP into warehouse that already contains PDB and SWISS-PROT.

Activity/ integration style manual meta-data diff%diff understanding SCOP writing wrapper % modifying schema writing mediator modifying meta-data (1.0) --- total time in days % Integrating SCoP into warehouse that already contains PDB and SWISS-PROT. Results:

Improving Data Access Scientists need l Better access to the data U combine data from multiple sources U annotate data U perform complex queries l Better functionality U customized notification messages U integrated interface to tools U personalized responses to queries By extending our meta-data representation, we can provide powerful, customizable access to data.

DataFoundry’s meta-data driven interface

Src Name score e-val organism SP P Human SP P e-99 Mouse SP P e-98 Human. dbEST zl85d08.r e-78 Homo sapiens colon. PDB 4AAH Methylophilus meth PDB 1ZAP Candida albicans. PDB 4AAH Methylophilus meth Descr Expand Reduce Annotate alignment structure. blast. has kwd. Enter. Query: 644 VGGGDRWCWHLLDKEAKVRLSSPCFKDGTGNPIP 677 +GGG W W+ D + + F G+GNP P Sbjct: 231 IGGGTNWGWYAYDPKLNL------FYYGSGNPAP 258

Beyond fully integrated data l There are over 500 genomics data sources available on the web. l Scientists want as much relevant information as possible. l Integrating data from all of these sources is impossible. Semantically integrating critical data sources, and providing basic access to others, offers the best possible solution.

Summary U integrated data k allows complex queries k is easier to understand k limits the number of sites U non-integrated data k allows more sites k is more flexible k is harder to query Scientists need intuitive access to data from both internal and external sites.

Conclusions DataFoundry is building on its meta-data based infrastructure to develop a scalable, flexible, and useable system. l Meta-data provides a way to U reduce the cost of integrating new sources U reduce the cost of accessing non-integrated sources U provide a powerful, and intuitive, query mechanism U customize the user interface

DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory Work performed under the auspices of the U.S. DOE by LLNL under contract No. W-7405-ENG-48. UCRL-JC