Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA
2 Large and Complex Eukaryotes
3 Outline Eukaryotic Genome Annotation Fungal Genomics Program MycoCosm
4 Started with Human Genome Project
5 genome.jgi.doe.gov IMG MycoCosm 150+ annotated eukaryotic genomes
6 Genomic assembly and ESTs Annotation Pipeline Gene predictions Protein annotations Reference data mapping Repeat masking Manual curation (optional) Annotation Pipeline Analysis Gene families Gene expression Phylogenomics Proteomics Protein targeting etc Annotation Validations
7 Protein-based methods build CDS exons around known protein alignments. (Fgenesh, GeneWise) GenBank protein Transcript-based methods map or assemble transcripts on the genome, including UTRs (EST_map, Combest) EST contig Predict model Ab initio methods use knowledge of known genes’ structures to predict start, stop, and splice sites in CDS only. (Fgenesh+, GeneMark) Train on known genes ATG TGA GT AG exonsintrons 5’UTR 3’UTR Promoter PolyA Gene model Eukaryotic Gene Prediction
8 More Gene Prediction Use ESTs/cDNAs to extend, correct or predict gene models ESTEXT Predicted model ESTs Extended model 5’UTR3’UTR ATG TGA ATGTGA Detect orthologs with poor alignments and refine with synteny based methods FGENESH2 Genome A Genome B FGENESH Representative set GENEWISE EXTERNAL MODELS Non-redundant gene set is built from “the best” models from each locus according to homology and ESTs, followed by manual curation
9 Combine Gene Predictors for Better Quality EugeneGenemarkFgeneshJGI Pipe Number of gene models11,5479,6098,40912,270 Models with partial EST support with full length EST support EST coverage per gene77.7%68.2%80.8%79.1% supported splice sites41,58140,80845,49847,671 Models with homology support with strong homology support (80+%ide, 80+%cov.) model coverage64%60%68%69% Models with homology and EST support Heterobasidion annosum v1.0
10 Re-annotation Using Comparative Genomics MAKERJGI pipelineRe-annot # of predicted gene models 9,94012,29012,802 with Swissprot hits6,5217,3567,900 With non-repeat PFAM domains 5,3656,0106,353 with EST support9,25210,79611,105 with >90% EST support 7,7299,1789,444 # of unique PFAM domains 2,2072,2452,322 EST coverage per gene 93.0%93.3% # EST-supported splice sites 99,627102,200104,246 Asaf Salamov
11 Predicted protein Protein Annotation Higher order assignments: Gene Ontology terms EC numbers --> KEGG pathways Gene families, with and without other species Possible orthologs (in nr, SwissProt, KEGG, KOG) Possible paralog (Blastp+MCL) Domain (InterPro, tmhmm) Signal peptide (signalP)
12 Validation with Transcriptomics Sanger454Illumina EST profile Processing RNA-Seq with CombEST models ESTs Old Sanger Days Transformation of EST sequencing
13 Validation with Proteomics Wright et al, BMC Genomics (2009)
14 Gene Cluster Analysis Comparative analysis
15 Genome Portal Framework
16 Many Genes of Eco-responsive Daphnia pulex First crustacean, aquatic animal sequenced, new model organism 30,940 predicted D.pulex genes in ~200Mb genome 85% supported by 1+ lines of evidence Colbourne et al, Science, 2011
17 Half of Daphnia Genes: no Homologs, Experessed Under Environmental Stress With Evgeny Zdobnov’s group (Univ. Genève) * Of 716 highly conserved single copy orthologs, Daphnia is missing only two Colbourne et al, 2011
18 Outline Eukaryotic Genome Annotation Fungal Genomics Program MycoCosm
19 Fungal Genomics for Energy & Environment Grow Grow Degrade Degrade Lignocellulose degradation Plant symbionts and pathogens Sugar Fermentation Ferment Ferment Bio-refinery GOAL: Scale up sequencing and analysis of fungal diversity for DOE science and applications
20 GOLD (October 2011) 758 fungal projects
21 Chapter 1: Plant health Symbiosis Plant Pathogenicity Biocontrol Chapter 2: Biorefinery Lignocellulose degradation Sugar fermentation Industrial organisms Chapter 3: Diversity Phylogentics Ecology Genomic Encyclopedia of Fungi
22 Genome-Centric View Comparative View fungal genomes visitors/month
23 Comparative Genome Analysis
24 Strategy: 1000 Fungal Genomes Goal: Sequencing 1000 fungal genomes from across the Fungal Tree of Life will provide references for research on plant-microbe interactions and environmental metagenomics.
25 Strategy: Fungal Systems Lichen: alga+ fungus ECM: plant+ fungus T.terrestris Forest soil metagenomes S.commune Model fungi Simple systems Complex environments
26 Model Mushroom Development Ohm et al, 2010 SEQUENCE FUNCTIONMODEL WT S.commune Gene knock-outs Modeling regulatory cascades
27 Summary Eukaryotic Annotation Recipe: Combine gene predictors, experimental data, and community expertise Fungal Genomics: we aim to scale-up sequencing & comparative analysis of fungi relevant for energy & environment (jgi.doe.gov/fungi)
28 Enjoy Algae as well!
29 Acknowledgements JGI Staff Our Users
30 Outline Eukaryotic Genome Annotation Fungal Genomics Program MycoCosm