Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology.

Slides:



Advertisements
Similar presentations
Pre-SIG meeting " Genome Annotation" A BioSapiens initiative Goal of the workshop were - to create an open forum to discuss current problems on function.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Gene Ontology John Pinney
Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Office of Science Office of Biological and Environmental Research Susan K. Gregurick, Ph.D. Program Manager Computational Biology & Bioinformatics Biological.
GenSpace: Exploring Social Networking Metaphors for Knowledge Sharing and Scientific Collaborative Work Chris Murphy, Swapneel Sheth, Gail Kaiser, Lauren.
Introduction to the Pathway Tools Software David Walsh and Simon Eng bigDATA Workshop—May 29, 2010.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
4th June 2010IASSIST 2010 conference 1 APPLICATIONS OF SOCIAL NETWORKING IN INTERNATIONAL COLLABORATION, MULTISITE-RESEARCH, KNOWLEDGE RE-USE AND DATA.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Metagenomic Analysis Using MEGAN4
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Introduction to Interactive Media The Interactive Media Development Process.
Networks and Interactions Boo Virk v1.0.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
ASCAC-BERAC Joint Panel on Accelerating Progress Toward GTL Goals Some concerns that were expressed by ASCAC members.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Copyright OpenHelix. No use or reproduction without express written consent1.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Western New York Genetics in Research Partnership Expanding Exposure, Career Exploration and Interactive Projects in Basic Genome Analysis and Bioinformatics.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
The (IMG) Systems for Comparative Analysis of Microbial Genomes & Metagenomes: N America: 1,180 Europe: 386 Asia: 235 Africa: 6 Oceania: 81 S America:
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
The Integrated Microbial Genome (IMG) systems
The Integrated Microbial Genome (IMG) systems
The Integrated Microbial Genome (IMG) systems
Data challenges in the pharmaceutical industry
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
INFORMATION FLOW AARTHI & NEHA.
Overview of Microbial Pathway and Genome Databases
A User’s Guide to GO: Structural and Functional Annotation
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
Presentation transcript:

Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology Center Lawrence Berkeley National Lab Nikos C. Kyrpides Natalia N. Ivanova Microbial Genome Analysis Program Joint Genome Institute A Case Study in Biological Data Management Different views on biological data management ( VLDB 2004 Panel on Biological Data Management)  Computer Scientists Source of problems for database research Publication in database papers Prototypes  Biologists Vehicle for rapid data analysis Publication in biology papers Immediate solutions

Page 2 Biological Data Management Problem Effective data analysis involves combining data from multiple sources single data type data generation & collection multiple data types data association in the context of inherently imprecise data

Page 3 Background: Microbial Genomes Jan 04: 532 microbial genome projects Mar 05: 847 microbial genome projects Applications: Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production

Page 4 Microbial Genome Data Analysis Context

Page 5 Data Analysis Example: Occurrence Profiles  Key Challenges oRepresenting abstract concepts with experimental data oSpecifying individual and composite operations oData coherence, completeness, integration Genome Y y 4 y 3 y 2 y 1 ? Proteins from same cellular pathway are expected to co-occur in the majority of organisms from a phylogenetic branch R4 (e4) R3 (e3) R4 (e2) R1 (e1) Pathway Genome X Genes: x 1 x 2 x 3 x 4 ? Functionally related genes tend to cluster on chromosome

Page 6 Microbial Genomes: Data Generation & Collection  Process oRaw data Small DNA sequence fragments Assembled sequence fragments (contigs) Complete (one contiguous) sequence oInterpreted data Gene prediction (models) Functional prediction (annotations) Expert data validation (cleaning) Expert annotations  Key Challenges oDiversity of data sources Differences in models, depth/breadth of annotations oConsistency of the data transformation process Evolution & diversity of Technology platforms Algorithms & parameters Experimental, data collection conditions Data Processing & Refinement

Page 7 Data Transformation Process Example Microbial Genome Annotation Pipeline (ORNL) ORF Calling Preliminary Functional Annotation Post Fetch Sequence Data Files Annotation Data Files IMG Loading IMG Load Report Replace Microbial Genome Annotation Review & Correction (JGI) Reference Genes NR IMG Download Data For Review Download Annotation Data Files Data Review Data CleansingFinal Review & Lock Revised Annotation Data Files

Page 8 Microbial Genomes: Data Association Organisms Functions  Key Challenges oData quality/precision for different types of data, sources oTransience of identifiers, relationships Predicted Genes

Page 9 Biological Data Management Problem Revisited Effective data analysis involves combining data from multiple sources in the context of inherently imprecise data while addressing Data quality –Data semantics, precision, integrity, provenance System quality –Comprehensibility, performance, reliability, scalability Development strategy –Choice of technologies –Devising (cost, time) effective solutions Challenging in academic settings

Page 10 Needed: System Development Framework Deploy System Requirements Specification Requirement Examples Requirements Analysis Prototype Database, Tools Use Scenarios Case Studies Data Model Abstraction Definitions Design & Planning Plans & Schedules Develop System* Development Documents System Stages Docs Tools Program Test Revise & Refine Document Final Release Preliminary Release * System Development Time /Cost Constraints

Page 11 Requirement Analysis Example: IMG Data Analysis Query construction Query results Collect genes of interest “Similar” gene analysis Chromosomal neighborhood analysis Find “unique” genes in a genome of interest Ψ 0 wrt related genomes: Ψ 1, …, Ψ k Iterate

Page 12 Data Model Abstraction  Motivation oAdds precision oAllows reasoning in an established framework Analogies to traditional data domain  Biological data modeling oData warehouse concepts Proven technology for large scale biological data management applications oData Structure Multidimensional data space –Gene, genome, function/ pathway oOperations Multidimensional space selections, projections, aggregations –Slice & dice, roll up, drill down… analogies

Page 13 Data Model Abstraction Example: IMG Data Model KEGG Expasy EC COG GO EBI Reviews Meta Genomes Pfam Interpro JGI Genomes Native Pathways LIGAND/ ChEBI Native Terms Scaffold Feature Transcript (ORF) Gene Protein IPR Family Pfam Family Enzyme Compound KEGG Pathway Image ROI IMG Pathway IMG Reaction Pathway Network COG Reaction Meta Genome Fragment Eco Sample Chromosome / Plasmid TaxonAssembly Othologs Paralogs GO Ontology IMG Term IMG Cluster Gene Genome Chromosome Function Pathway

Page 14 Data Model Abstraction Example: IMG Operations Genes Functions/ Pathways Genomes Gene occurrence profile across genomes Gene occurrence profiles across pathways Pathways shared by genomes Genes “in” G 1 “in” G 2 “not in” G 3 “in” G 4 “in” G 5 G 1 G 2 G 3 G 4 G 5 g3g3 g2g2 g1g

Page 15 Data Analysis Example: Searching for Unique Genes parasite in horses Causes human disease in tropical areas (melioidosis)

Page 16 Identifying Unique Genes of Interest Genes involved in adherence and invasion

Page 17 Exploring Unique Gene Details

Page 18 Summary  Needed Effective solutions for academic biological data management oEmploying appropriate technologies and methods oDeveloped within (time, cost) constraints  IMG Case Study oSystem development process framework essential for Continuously evolving content –aiming at coherence, completeness Developing meaningful data analysis tools Clarity of methods, parameters, results oMetric for success Community adoption and support Increase in analysis productivity and value

Page 19 Summary  Biological Data Management in Academic Settings oProblems discussed in numerous forums since 1990 oTools, techniques - poorly understood & used  Potential Causes o“… biologists have been ineffective in the “care and feeding” of databases… that now extends to poor maintenance of genomics databases… ” American Academy of Microbiology Report, 2002 oComputer scientists in pursuit of “insignificant or misunderstood problems” Bio Data Management Workshop, 2003 Have little interest in tedious, repetitive, data management tasks o“… diminished responsibility for biological databases …. Is correlated with lack of enthusiasm for funding these efforts …” AAM Report 2002 oPoor industry support