Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Tucson High School Biotechnology Course Spring 2010.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Phylogeny - based on whole genome data
Genome sequencing and assembling
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Metagenomics Binning and Machine Learning
De-novo Assembly Day 4.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding.
Probes can be designed in an evolutionary hierarchy.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
The Changing Face of Sequencing
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et. al (2004) Presented by Ken Vittayarukskul Steven S. White.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Animal phylogeny The animal kingdom comprises 34 large groups, or phyla - species within a phylum share a basic body plan, or set of characters that allows.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Introduction to Bioinformatics Resources for DNA Barcoding
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Phylogeny - based on whole genome data
Metafast High-throughput tool for metagenome comparison
Metagenomic assembly Cedric Notredame
Research in Computational Molecular Biology , Vol (2008)
Very important to know the difference between the trees!
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Human Gut Microbiome: Function Matters
Mattew Mazowita, Lani Haque, and David Sankoff
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Schematic representation of a transcriptomic evaluation approach.
Genome resolved metagenomics
Comparison of species and function profiles with ultradeep sequencing data. Comparison of species and function profiles with ultradeep sequencing data.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley

State of metagenomics In July 2005, 9 projects had been completed. General challenges were becoming apparent Paper focuses on computational problems

Assembling communities Goal ◦ Retrieval of nearly complete genomes from the environment Challenges ◦ Need sufficient read depth- species must be prominent ◦ Avoid mis-assembling across species while maximizing contig size

Comparative assembly Align all reads to a closely-related “reference” genome Infer contigs from read alignments Rearrangements limit effectiveness Pop M. et al. Comparative genome assembly. Briefings in Bioinformatics 2004.

“Assisted” Assembly De novo assembly Complement by aligning reads to reference genome(s) Short overlaps can be trusted Single mate links can be trusted Mis-assemblies can be detected Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.

Assisted Assembly Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.

Assisted Assembly Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.

Assisted Assembly Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.

Metagenomics application Pros: ◦ Low coverage species ◦ If conservative, unlikely to hurt Cons ◦ Exotic microbes may have no good references ◦ Potential to propagate mis-assemblies

Overlap-layout-consensus Species-level ◦ Increased polymorphism ◦ Reads come from different individuals ◦ Missed overlaps System-level ◦ Homologous sequence ◦ False overlaps

Polymorphic diploid eukaryotes Reads sequenced from 2 chromosomes Single reference sequence expected Keep duplications separate Keep polymorphic haplotypes together

Strategy 1 Form contigs aggressively Detect alignments between contigs and resolve Avoid merging duplications by respecting mate pair distances Jones, T. et al. The diploid genome sequence of Candida albicans. PNAS 2004.

Strategy 2 Assemble chromosomes separately Erase overlaps with splitting rule Vinson et al. Assembly of polymorphic genomes: Algorithms and application to Ciona savignyi. Genome Research 2005.

Back to metagenomics Strategy 1 ◦ Assemble aggressively ◦ Detect mis-assemblies and fix Strategy 2 ◦ Separate reads or filter overlaps

Binning Presence of informative genes ◦ E.g. 16S rRNA Machine learning ◦ K-mers ◦ Codon bias Worked well only with big scaffolds Lots of progress in this area since 2005

Abundances Depth of read coverage suggests relative abundance of species in sample Difficult if polymorphism is significant ◦ Separate individuals  too low ◦ Merge species  too high ◦ Depends on good classification

How much sequencing G = genome size (or sum of genomes) c = global coverage k = local coverage n k = bp w/ coverage k

Poisson model “Interval” =[x – l r, x] “Events” = read starts “ λ ” = coverage xx-l r

Gene Finding Focus on genes, rather than genomes Bacterial gene finders are very accurate Assemble and run on scaffolds BLAST leftover reads against protein db

Partial genes Tested GLIMMER on simulated 10 Kb contigs Many genes crossed borders ◦ GLIMMER often predicted a truncated version Gene finding models could be adjusted to account for this case

Gene-centric analysis Cluster genes by orthology ◦ Orthology refers to genes in different species that derive from a common ancestor Express sample as vector of abundances

UPGMA on KEGG vectors

PCA on KEGG vectors Principal components may correspond to interesting pathways or functions

How much sequencing N = # genes in community f = fraction found Coupon collector’s problem

Phylogeny Apply multiple sequence alignment and phylogeny reconstruction to gene sequences

Partial sequences Bad for common msa programs Semi-global alignment is required

Supertree methods Construct tree from multiple subtrees Split gene into segments? Construct subtree on sequences that align fully to segment?

Thanks!