Download presentation
Presentation is loading. Please wait.
1
Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology Center Lawrence Berkeley National Lab Nikos C. Kyrpides Natalia N. Ivanova Microbial Genome Analysis Program Joint Genome Institute A Case Study in Biological Data Management Different views on biological data management ( VLDB 2004 Panel on Biological Data Management) Computer Scientists Source of problems for database research Publication in database papers Prototypes Biologists Vehicle for rapid data analysis Publication in biology papers Immediate solutions
2
Page 2 Biological Data Management Problem Effective data analysis involves combining data from multiple sources single data type data generation & collection multiple data types data association in the context of inherently imprecise data
3
Page 3 Background: Microbial Genomes Jan 04: 532 microbial genome projects Mar 05: 847 microbial genome projects Applications: Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production
4
Page 4 Microbial Genome Data Analysis Context
5
Page 5 Data Analysis Example: Occurrence Profiles Key Challenges oRepresenting abstract concepts with experimental data oSpecifying individual and composite operations oData coherence, completeness, integration Genome Y y 4 y 3 y 2 y 1 ? Proteins from same cellular pathway are expected to co-occur in the majority of organisms from a phylogenetic branch R4 (e4) R3 (e3) R4 (e2) R1 (e1) Pathway Genome X Genes: x 1 x 2 x 3 x 4 ? Functionally related genes tend to cluster on chromosome
6
Page 6 Microbial Genomes: Data Generation & Collection Process oRaw data Small DNA sequence fragments Assembled sequence fragments (contigs) Complete (one contiguous) sequence oInterpreted data Gene prediction (models) Functional prediction (annotations) Expert data validation (cleaning) Expert annotations Key Challenges oDiversity of data sources Differences in models, depth/breadth of annotations oConsistency of the data transformation process Evolution & diversity of Technology platforms Algorithms & parameters Experimental, data collection conditions Data Processing & Refinement
7
Page 7 Data Transformation Process Example Microbial Genome Annotation Pipeline (ORNL) ORF Calling Preliminary Functional Annotation Post Fetch Sequence Data Files Annotation Data Files IMG Loading IMG Load Report Replace Microbial Genome Annotation Review & Correction (JGI) Reference Genes NR IMG Download Data For Review Download Annotation Data Files Data Review Data CleansingFinal Review & Lock Revised Annotation Data Files
8
Page 8 Microbial Genomes: Data Association Organisms Functions Key Challenges oData quality/precision for different types of data, sources oTransience of identifiers, relationships Predicted Genes
9
Page 9 Biological Data Management Problem Revisited Effective data analysis involves combining data from multiple sources in the context of inherently imprecise data while addressing Data quality –Data semantics, precision, integrity, provenance System quality –Comprehensibility, performance, reliability, scalability Development strategy –Choice of technologies –Devising (cost, time) effective solutions Challenging in academic settings
10
Page 10 Needed: System Development Framework Deploy System Requirements Specification Requirement Examples Requirements Analysis Prototype Database, Tools Use Scenarios Case Studies Data Model Abstraction Definitions Design & Planning Plans & Schedules Develop System* Development Documents System Stages Docs Tools Program Test Revise & Refine Document Final Release Preliminary Release * System Development Time /Cost Constraints
11
Page 11 Requirement Analysis Example: IMG Data Analysis Query construction Query results Collect genes of interest “Similar” gene analysis Chromosomal neighborhood analysis Find “unique” genes in a genome of interest Ψ 0 wrt related genomes: Ψ 1, …, Ψ k Iterate
12
Page 12 Data Model Abstraction Motivation oAdds precision oAllows reasoning in an established framework Analogies to traditional data domain Biological data modeling oData warehouse concepts Proven technology for large scale biological data management applications oData Structure Multidimensional data space –Gene, genome, function/ pathway oOperations Multidimensional space selections, projections, aggregations –Slice & dice, roll up, drill down… analogies
13
Page 13 Data Model Abstraction Example: IMG Data Model KEGG Expasy EC COG GO EBI Reviews Meta Genomes Pfam Interpro JGI Genomes Native Pathways LIGAND/ ChEBI Native Terms Scaffold Feature Transcript (ORF) Gene Protein IPR Family Pfam Family Enzyme Compound KEGG Pathway Image ROI IMG Pathway IMG Reaction Pathway Network COG Reaction Meta Genome Fragment Eco Sample Chromosome / Plasmid TaxonAssembly Othologs Paralogs GO Ontology IMG Term IMG Cluster Gene Genome Chromosome Function Pathway
14
Page 14 Data Model Abstraction Example: IMG Operations Genes Functions/ Pathways Genomes Gene occurrence profile across genomes Gene occurrence profiles across pathways Pathways shared by genomes Genes “in” G 1 “in” G 2 “not in” G 3 “in” G 4 “in” G 5 G 1 G 2 G 3 G 4 G 5 g3g3 g2g2 g1g1 + + + + + + + - + + + - - - -
15
Page 15 Data Analysis Example: Searching for Unique Genes parasite in horses Causes human disease in tropical areas (melioidosis)
16
Page 16 Identifying Unique Genes of Interest Genes involved in adherence and invasion
17
Page 17 Exploring Unique Gene Details
18
Page 18 Summary Needed Effective solutions for academic biological data management oEmploying appropriate technologies and methods oDeveloped within (time, cost) constraints IMG Case Study oSystem development process framework essential for Continuously evolving content –aiming at coherence, completeness Developing meaningful data analysis tools Clarity of methods, parameters, results oMetric for success Community adoption and support Increase in analysis productivity and value
19
Page 19 Summary Biological Data Management in Academic Settings oProblems discussed in numerous forums since 1990 oTools, techniques - poorly understood & used Potential Causes o“… biologists have been ineffective in the “care and feeding” of databases… that now extends to poor maintenance of genomics databases… ” American Academy of Microbiology Report, 2002 oComputer scientists in pursuit of “insignificant or misunderstood problems” Bio Data Management Workshop, 2003 Have little interest in tedious, repetitive, data management tasks o“… diminished responsibility for biological databases …. Is correlated with lack of enthusiasm for funding these efforts …” AAM Report 2002 oPoor industry support
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.