Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Integrated Microbial Genome (IMG) systems

Similar presentations


Presentation on theme: "The Integrated Microbial Genome (IMG) systems"— Presentation transcript:

1 The Integrated Microbial Genome (IMG) systems
Nikos Kyrpides 1

2 Data analysis Data Integration Comparative Analysis

3 Data management system for comparative analysis of biological data
What is the Matrix? Data management system for comparative analysis of biological data IMG Genes Genomes Functions Metadata Clusters SNPs Proteomics Regulons Transcriptomes I M G

4 Integrated Microbial Genomes (IMG) [It’s easier to analyze 1000 genomes than a single one]
What is IMG: IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context. Mission: To become the Home of Microbial Genome and Metagenome Analysis Background:  Launched on March 2005  3 Releases/Year  >5,000 unique visitors per month  >300 citations Current Status: 8939 Genomes 24 Million Genes Bacteria: 3930 Archaea: Eukarya: 177 Plasmids: 1205 Viruses: USERS CAN Search data Browse data Compare data Export data Gfragments:654

5 http://img.jgi.doe.gov/ USERS CAN Search data Browse data Compare data
Export data USERS CAN Submit data Annotate data

6 Data Model Abstraction Example: IMG Operations
Genes present in G1 and absent from G2, G3, G4 and G5 G1 G2 G3 G4 G5 g3 g2 g1 Gene occurrence profile across genomes Genes Gene occurrence profiles across pathways Genomes Pathways shared by genomes Perhaps you can mention that the dimensional modeling approach has a positive impact in data exploration. 1 and 2 are examples of slice and dice and the result is data reduction and focus on relevant to the question data set. Functions/ Pathways

7 IMG Data Integration Genes Genomes Functions 24.2M 8939 1.1M
COG GO Pfam TIGRfam InterPro KEGG BioCyc SEED Protein product MyIMG IMG Terms IMG Pathways IMG Networks Groupings Phylogenetic Phenotypic Ecotypic Disease Geographical Isolation RNAs, Proteins Sequence Clusters Positional clusters Regulatory clusters Fusions Operons Expression Genes 24.2M Genomes Functions 8939 1.1M

8 IMG Toolkit Chromosome Map Function Profile Gene Synteny Abundance
Profiles Functional Categories Projects IMG Pathway Metadata Search Phylogenetic Genome Clustering Compare Annotations KEGG Maps Distribution Chromosomal Artemis VISTA Recruitment Plot Fragment

9 Challenges and Opportunities
Annotations Annotations Quality Metadata Genes Functions Data Analysis New data types and tools Integration # genes and genomes Scaling

10 Metadata Curation Metadata Types Organism Information
K. Liolios Metadata Types Organism Information Genome Project Information Sequencing Information Environmental Metadata Host Metadata Organism Metadata

11 Metagenome Classification
Genomes vs Metagenomes

12 The negative example or why we need high quality data
If we run the same 3-click query, we find that Burkholderia mallei (the smaller one) still has 548 genes with no match in Burkholderia pseudomallei. Are these genes important for its lifestyle? No, they are not. 89% of these genes are actually due to the difference in gene prediction algorithms used by different sequencing centers. That’s why in IMG we have implemented a validation and correction procedures. We have started with JGI genomes and will further extend them to all public genomes, so that users will be able to see only the real differences between organisms and not the false positives. Phylogenetic profiler finds 548 unique genes in B. mallei However, 497 of them in fact exist in B. pseudomallei, but they have not been called as real genes. The difference in gene models reveals 89.2% error rate in unique genes

13 Gene Prediction - Standards
Re-Annotation workshops - January 19-20, JGI - February 29, Ivanova Usage of Reference Genomes Annotation of isolate genomes, single cells and metagenomes Computation of Gene cassettes Computation of Pangenomes Problems with current Public Reference Genomes lack of provenance for the predicted features presence of artificial (non-biological) variation between the genomes including variation in gene content and variation within protein families

14 Constant Benchmarking
Evaluation of Annotation Quality with constantly changing: K. Mavrommatis Sequencing technologies Read lengths Gene calling methods Similarity methods Clustering methods Functional annotation methods

15 Blat & Uclust vs Blast

16 Program Informatics Challenges and Opportunities
Annotations Quality Data Analysis New data types and tools Integration # genes and genomes # genes and genomes Scaling

17 Why annotate unassembled reads?
Kansas soil Total size 102,722,384 (2x150) reads Assembled contigs 1,375,950 contigs Assembled (reported by the CLC workbench assembler) 38,094,033 reads 5060 different pfams Assembled reads Mapped (by bwa) 11,778,925 reads Genes called on unassembled reads 64,737,444 genes 7481 different pfams 8,373,641 (12%) genes Similar to genes on contigs1 Genes with similarity to isolate genomes 40,778,854 genes Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome

18 Annotating unassembled Illumina data
SEPTEMBER 2011 Samples 937 DNA (bps) 84 B Private Genes 188 M Average Illumina Metagenome Sequences 673,374,734 Bases 64,545,005,513 Genes 667,966,495 Genes with COGs 14% Genes with Pfam 8.8% Genes with KO 6% MAY 2012 Samples 997 DNA (bps) 608 B Private Genes 6.03 B

19 Where do we go from here?

20 DELUGE AVALANCHE FLOOD TSUNAMI OPPORTUNITY

21 Program Informatics Challenges and Opportunities
Annotations / Publications Quality Data Analysis New data types and tools New data types and tools Integration # genes and genomes Scaling

22 Challenges and Opportunities
Gene Clustering Metagenome Classification Data Analysis

23 MGM Workshop Attendees
Europe: Belgium Czech Rep Denmark Estonia Finland France Germany Greece Ireland Italy Hungary Netherlands 4 Norway Russia Portugal Poland Spain Sweden Switzerland 1 UK 10 Asia: China Hong Kong India Israel Japan Korea Malaysia Philipines Saudi Arabia 4 Singapore Taiwan Thailand Turkey North America: 356 Canada Mexico USA South America: 21 Argentina Brazil Chile Colombia Ecuador Peru Uruguay Africa: Algeria Egypt Ethiopia Oceania: Australia New Zeeland 2 545 /48 Countries April 20, 2012

24 Unique properties of IMG
Largest metadata integration from GOLD Largest integration of Genes (> 4 Billion) Clustering of all metagenomic genes IMG QC of all gene predictions from isolate genomes Metagenome classification scheme Large array of function & pathway analysis tools


Download ppt "The Integrated Microbial Genome (IMG) systems"

Similar presentations


Ads by Google