Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.

Slides:



Advertisements
Similar presentations
Defining Decision Support System
Advertisements

SDM center All-hands breakout session notes March 2002 Gatlinburg TN.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Database Systems Chapter 1
DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Future Access to the Scientific and Cultural Heritage – A shared Responsibility Birte Christensen-Dalsgaard State and University Library.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Automatic Data Ramon Lawrence University of Manitoba
The information integration wizard (Iwiz) project Report on work in progress Joachim Hammer Presented by Muhammed Al-Muhammed.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Database Systems.
Overview of Search Engines
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Research team members Adaptive Complex Enterprise Data Warehousing Repository Generation Semantic Web Knowledge Extraction.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Database Systems Chapter 1
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Understanding Data Warehousing
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Chapter 1 Introduction to Data Mining
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Using SAS® Information Map Studio
1 Chapter 1 Database Systems Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Access to Information in Digital Libraries: Users and Digital Divide Gobinda G. Chowdhury Graduate School of Informatics, Department of Computer and Information.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.
Interoperability & Knowledge Sharing Advisor: Dr. Sudha Ram Dr. Jinsoo Park Kangsuk Kim (former MS Student) Yousub Hwang (Ph.D. Student)
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
1 CMPT 275 High Level Design Phase Modularization.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Ling Liu, Calton Pu GT Reagan Moore, Bertam Ludaescher, SDSC Amarnath Gupta.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Crime Analysis and Mapping Jonathan Lewin. Impacting Crime Considerations in Developing a Crime Mapping Application.
GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing.
Yazd University, Electrical and Computer Engineering Department Course Title: Advanced Software Engineering By: Mohammad Ali Zare Chahooki The Project.
1 Chapter 1 Database Systems Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Lecture VIII: Software Architecture
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
Lawrence Livermore National Laboratory
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Data Warehousing Concepts
Data Mining.
Supporting High-Performance Data Processing on Flat-Files
Information Retrieval and Web Design
Presentation transcript:

Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory PITTCON 2001 March 6, 2001 UCRL-VG

Outline l A day in the life of a geneticist/biologist. l The business approach. (present) l The impact of the web.(future) l The key to success.

Hey! Today, I finally get to analyze the sequencing data I have been collecting for the last week.

Wow! I need to look at 500 clones. That will take a couple of weeks.

Man, this sure would be easier if I could move between these different sites without having to reformat the data every time.

Different users end up doing the same thing. Currently, accessing genomics data is challenging. Source Specific Schema The user is required to perform all data management tasks. dbEST SCoP SWISS-PROT User applications Transform Map data format similar concepts ParseAccess input/the data output PDB

The result of the current environment: Scientists limit their searches to a small number of sources they are familiar with. A huge amount of relevant information is not utilized. Resources are wasted. Time is wasted. Knowledge is wasted.

What is our ideal environment? A single location that provides effective access to a consistent view of data from many sources through an intuitive and useful interface. PDB dbEST SCoP SWISS-PROT User applications Transform Map data format similar concepts ParseAccess input/the data output

What is our ideal environment? Businesses use data warehouses to accomplish this. A single location that provides effective access to a consistent view of data from many sources through an intuitive and useful interface. PDB dbEST SCoP SWISS-PROT User applications Transform Map data format similar concepts ParseAccess input/the data output

The business approach.

Data warehouses Swiss ProtSCoPPDB l Interfaces  provide intuitive access to the data  possibly change data format to meet user expectations l Warehouse  stores a consistent view of data in a local repository Ingest Code l Mediator  transform data from source format to warehouse format l Wrappers  read data from source into internal representation Wrapper Mediator Wrapper dbEST Data Warehouse

Typical approaches combine wrapper and mediator functionality. User applications User task Parsing Transformations Data Entry SWISS-PROT Parsing Transformations Data Entry dbEST Parsing Transformations Data Entry SCoP Parsing Transformations DataEntry APIAPI Global Schema Source Specific Schema Transform Map data format similar concepts ParseAccess input/the data output PDB Data Warehouse

When a source changes, propagating the change can be time consuming. User applications User task PDB Parsing Transformations Data Entry SWISS-PROT Parsing Transformations Data Entry dbEST Parsing Transformations Data Entry SCoP Parsing Transformations DataEntry APIAPI Global Schema Source Specific Schema Transform Map data format similar concepts ParseAccess input/the data output Data Warehouse

When a source changes, propagating the change can be time consuming. User applications User task PDB Parsing Transformations Data Entry SWISS-PROT Parsing Transformations Data Entry dbEST Parsing Transformations Data Entry SCoP Parsing Transformations DataEntry APIAPI Global Schema Source Specific Schema Need an architecture capable of managing change Transform Map data format similar concepts ParseAccess input/the data output Data Warehouse

DataFoundry separates these tasks. User applications User task PDB wrapper SWISS-PROT wrapper dbEST wrapper mediator Mediator Generator Meta-data SCoP wrapper APIAPI Global Schema Source Specific Schema Transform Map data format similar concepts ParseAccess input/the data output Data Warehouse

The impact of the Web.

4 Years Ago (when we started) The web has changed how scientific data is shared. PDB dbEST SCoP SWISS-PROT Scientists access data from 1-10 community resources using the web :::::: ~ 30 sources distributed over the WWW

~ 500 sources distributed over the WWW Today PDB dbEST SCoP SWISS-PROT Scientists access data from 4-15 community resources using the web :::::: DF The web has changed how scientific data is shared.

So, what makes this so much more difficult? l No central directory  How do you find the sources in the first place? l Different data formats and semantics  Once you find a source, how to you make sense of it? l Complex analysis  Queries require more than simple data retrieval, they need to invoke complex programs. l Lots of data sources  Too much work to be done manually.  Need to select appropriate subset for each user / query.  How do you present the results in an useful format?

A hybrid approach to information access holds the most promise. Detailed analysis PDB wrapper SWISS-PROT wrapper dbEST wrapper mediator Mediator Generator Meta-data SCoP wrapper APIAPI Global Schema Source Specific Schema Data Warehouse Dynamic access to large numbers of non-integrated sources. Query Engine Dynamic source access wrapper Preproces s Meta-data Detailed analysis PDB wrapper SWISS-PROT wrapper dbEST wrapper mediator Mediator Generator Meta-data SCoP wrapper Source Specific Schema Data Warehouse

The key to success is management support.

Ideally, management fully supports the bioinformatics effort. Good management understands the complexity of the problems and provides the required resources. The technical problems surrounding bioinformatics can be overcome. l Schedules are reasonable l Goals are realistic l Staff is properly trained l Funding is adequate and stable

Unfortunately, that is usually not the case. Poor management underestimates the complexity of the problems and provides inadequate resources.

Unfortunately, that is usually not the case. Poor management underestimates the complexity of the problems and provides inadequate resources. You are contributing to the problems facing bioinformatics instead of their solutions. l Schedules are unreasonable l Goals are unrealistic l Morale is low & staff is poorly trained l Funding is minimal and unstable

Conclusions The good news The technical challenges surrounding bioinformatics can be overcome. Bioinformatics has a long way to go before it can support scientific exploration. The bad news The major challenge is political, not technical.

Conclusions Be part of the solution, not part of the problem.

Questions?

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W ENG-48.