Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,

Slides:



Advertisements
Similar presentations
Chapter 13 The Data Warehouse
Advertisements

Databases MMG508. DB Properties  Definition of a database: “A database is a collection of interrelated data items that are managed as a single unit”
Data Modeling and Database Design Chapter 1: Database Systems: Architecture and Components.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
Components and Architecture CS 543 – Data Warehousing.
1 Data and Knowledge Management. 2 Data Management: A Critical Success Factor The difficulties and the process Data sources and collection Data quality.
MARS: Microarray analysis, retrieval, and storage system Albert F. Cervantes.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Lecture-8/ T. Nouf Almujally
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA ebay
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Chapter 1 Course Orientation. Outline Definition of data source management Definition of data source management Importance data source management to organization.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Data Warehousing at STC MSIS 2007 Geneva, May 8-10, 2007 Karen Doherty Director General Informatics Branch Statistics Canada.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
Datawarehouse Objectives
Case 2: Emerson and Sanofi Data stewards seek data conformity
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Warehouse. Group 5 Kacie Johnson Summer Bird Washington Farver Jonathan Wright Mike Muchane.
Patricia HernandezGeneva, 28 th September 2006 Swiss Bio Grid: Proteomics Project (PP)
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Business Intelligence Training Siemens Engineering Pakistan Zeeshan Shah December 07, 2009.
Chapter 11 Information and Data Management Discovering Computers Technology in a World of Computers, Mobile Devices, and the Internet.
Advanced Database Concepts
© 2012 Saturn Infotech. All Rights Reserved. Oracle Hyperion Data Relationship Management Presented by: Prasad Bhavsar Saturn Infotech, Inc.
IIC Information Flow Interesting ions? Priority list of interesting ions Empty priority list? QA/QC? Peptide identification Protein identification External.
High throughput biology data management and data intensive computing drivers George Michaels.
Platinum DecisionBase1 DW Product Platinum - Computer AssociatesDecisionBase Hyunsook Lim Database Laboratory Dept. of CSE.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
1 Data Warehousing Data Warehousing. 2 Objectives Definition of terms Definition of terms Reasons for information gap between information needs and availability.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Information Retrieval in Practice
Dr.S.Sridhar,Ph.D., RACI(Paris),RZFM(Germany),RMR(USA),RIEEEProc.
Intro to MIS – MGS351 Databases and Data Warehouses
Chapter 13 The Data Warehouse
Introduction.
Databases and Data Warehouses Chapter 3
9/22/2018.
Data Warehousing and Data Mining
An Introduction to Data Warehousing
C.U.SHAH COLLEGE OF ENG. & TECH.
Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Database Management Systems
Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Web Application Development Using PHP
Best Practices in Higher Education Student Data Warehousing Forum
Resources.
Presentation transcript:

Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt, Switzerland

What is Proteomics ? Separation (CEX, RP ) MS Sample BioInformatics processes DB Manual analysis Protein EST Genomic Peptide P1 P2 P3 P4 P5 E1 E2 E3 G1

About DBs for Proteomics at GeneProt Needs Data Transactional DB Data Warehouse Data Mining

Data Management Challenges  A high-throughput environment requires near real time processing  Quick response to evolving laboratory procedures and evolving user needs  Accomodate to heterogeneous data types  Manage a constantly rising flood of data  Need for a convenient data access at all levels of granularity via analysis software and web front ends  Adapt to demand for global queries across all proteomics studies  Adapt and innovate to offer new tools:  Statistics,  Data mining.

Data Flow Data export experimental data (LIMS) Identification of peptides and proteins external data sources annotation DB XML

Data details  Experimental data:  Store MS and MS/MS peak lists  Store all meta data  Identification :  Load peptide matches, identified proteins, scores  Automatic annotation and analysis:  Give access to data, store results  Expert annotation:  Give interactive access to data using a Web interface, store manual validation and annotation  External data sources:  Import information from external data sources: taxonomy, ontologies, bibliography…  Export data:  Export all or a subset of data Flat file Database dump  Misc:  Access control, security an confidentiality  data consistency/integrity checks  Error checks and corrections  Run statistics  backup and archive

Data production per project  Raw data (spectra) : >  Identified peptides : >  Identified sequences: >  Database size : 15G -> 140G  Nbr projects: 16  1Tb of databases files

Implementation: transactional  Intended to capture all relevant information from proteomics experiment, protein identification automatic and manual annotation and validation.  Each proteome is isolated in its own ProtDB (16 at present).  Complex and generic data model for efficient data storage.  Built in data consistency and error checks.  A layer of « views » provides fast query access.  Web front end: interactive means to visualize, update and validate data.

Limitations  We have 16 projects on-line:  High cost of maintenance to keep all database schemas compatible.  Space : could we archive some of the projects ? New spectrometers produce more data  Inter databases queries:  Technique « exists » but implementation is often awkward and there is no efficient solution in our case.

What about overcoming these limitations and take advantage of this wealth of data ?  Decide what data are actually important in the long term.  Merge the data from all the projects.  Clean and consolidate the data.  Implement an update procedure to keep this « merged data system » up to date  (archive old projects)

Data Warehouse ?  This looks very much like the definition of a data warehouse !  Data consolidation and integration  Non instantaneous accuracy, non volatility  Comprehensive data structure  Query throughput

ProtWare: proteomics data warehouse 1. Stores consolidated and final analysis results, centralises data common to proteins in all proteome studies. 2. Is read-only, not real time, asynchronous updates are run weekly. 3. Data model is focused on proteome to proteome comparisons. 4. Comprehensive data structure which enhance the performance of analysis queries. 5. Ideally suited for statistical analysis and data mining tools. 6. Provides a decision support system.

ProtDB and ProtWare data flow analyses & statistical queries E xtraction T ransformation L oading classification, taxonomy… annotation P2P2 PnPn P1P1 export … … flat file export bytes bytes DB dump XMLXML flat file bytes identification automatic annotation website ProtWare

ProtDB vs ProtWare ProtDB: transactional system  Data input, real time acces to data  Data updates, annotation, validation  Error and consistency checks  Stores experimental data  Stores all steps of data annotation and validation (keep history)  In depth queries on a given proteome ProtWare: data warehouse  Read-only, asynchronous updates from ProtDB  Consolidated data and final results of annotation and validation (no history)  No experimental data  Queries oriented to proteomes comparisons, statistics, data mining  Decision support system

The needle in a haystack  Of course we are looking for the Holy Grail !  Find the interesting proteins in all our data that: Can be used for diagnostic, Can explain a disease, Can be used to cure a disease.

KDD and Data Mining  Knowledge Discovery in Databases is « the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data ».  Data Mining is the discovery stage of the KDD.  Data mining tools provide additional possibilities to explore a database.

Data Mining tools  ProtWare: the data warehouse model is protein query oriented.  R package: statistics and clustering tools  Oracle 10g new data mining functions

Database infrastructure  Data input files use XML.  RDBMS: Oracle 9i moving to Oracle 10g on Linux  ProtWare uses ANSI SQL, portable to other ANSI SQL compliant systems (PostgreSQL).  Web interface built using standard technologies:  PERL, CGI, DBI, HTML, Javascript, SVG.