Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.

Similar presentations


Presentation on theme: "Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012."— Presentation transcript:

1 Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012

2 Advancing Science with DNA Sequence Metagenome is a collective genome of microbial community, AKA microbiome (native, enriched, sorted, etc.). Metagenomic library (or libraries) is constructed from isolated DNA (native, enriched, etc.). Metagenomic library can be single-end (AKA standard) or paired-end Metagenome definitions

3 Advancing Science with DNA Sequence Single-end (standard) metagenomic library will produce contigs upon assembly (i. e. longer sequences based on overlap between reads) Any Ns found in contigs correspond to low quality bases Paired-end metagenomic library will produce scaffolds upon assembly (non-contigous joining of reads based on read pair information) Ns found in scaffolds correspond either to low quality bases or to gaps of unknown size ATGCAAAGGCCGCATCCAGCAGGTT TACGTTTCCGGCGTAGGTCGTCCAA ATGCAAAGGCCGCATCC TACGTTTCCGGCGTAGG AGCAGGTT TCGTCCAA NNNNNN Metagenome definitions

4 Advancing Science with DNA Sequence Amplified and Unamplified Libraries Fragmentation (1ug) A-tailing with Klenow exo- End repair / Phosphorylation DNA ChipHeat Inactivation Double SPRI Fragmentation (1ug) A-tailing with Klenow exo- Adaptor Ligation End repair / Phosphorylation DNA Chip Double SPRI SPRI Clean PCR 10-cycle Amplification Amplified Library Unamplified Library Adaptor Ligation DNA Chip qPCR Quantification SPRI Clean DNA Chip qPCR Quantification SPRI Clean

5 Advancing Science with DNA Sequence Unless the community has very low complexity (i. e. dominated by one or a few clonal populations), assembly at 100% nucleotide identity will be very fragmented. What to do with k-mer based assemblies? Use multiple k-mer settings, combine assemblies with an overlap-layout consensus assembler like minimus2 using minimal % identity of 95%. Tradeoff between overlap length and % identity. Metagenome definitions (contd): overlap = alignment of reads at x% identity

6 Advancing Science with DNA Sequence Reasoning behind combining multiple assemblies

7 Advancing Science with DNA Sequence Assembly Pipeline v.0.9 Trimming does not appear to be ideal for this process Picking best kmer – manual process CPU time intensive, no known metagenomic Kmer prediction algorithm 7 A snapshot of older (454- Illumina) metagenome assembly pipeline

8 Advancing Science with DNA Sequence Assembly of sequences at less than 100% identity => population contigs and scaffolds representing a consensus sequence of species population isolate contigspecies population contigs Metagenome definitions (contd): overlap = alignment of reads at x% identity

9 Advancing Science with DNA Sequence 2 more important definitions 1.Sequence coverage (AKA read depth) How many times each base has been sequenced => needs to be considered when calculated protein family abundance Per-contig average coverage Per-base coverage => per-gene coverage 2. Bins Scaffolds, contigs and unassembled reads can be binned into sets of sequences (bins) that likely originated from the same species population or a population from a broader taxonomic lineages

10 Advancing Science with DNA Sequence What IMG does and doesn’t do Scaffolds and contigs are generated by assembly – not provided in IMG/M Sequence coverage can be computed by the assembler based on alignments it generates (preferable) or can be added later by aligning reads to contigs – the latter can be provided in IMG/M Bins are generated by binning software – not provided in IMG/M Scaffolds, contigs and unassembled reads are annotated with non-coding RNAs, repeats (CRISPRs), and protein coding genes (CDSs); the latter are assigned to protein families (COGs, Pfams, TIGRfams, KEGG Orthology, EC numbers, internal clusters) – is provided in IMG/M

11 Advancing Science with DNA Sequence What’s the difference between IMG and MG-RAST, IMG and CAMERA? We prefer to assemble the data longer sequences -> better quality of gene prediction and functional annotation longer sequences -> chromosomal context and binning -> population-level analysis But we don’t provide assembly services except for metagenomes sequenced at the JGI we may be able to help with assembly of 454 we’re not equipped to assemble massive amounts of Illumina data http://galaxy.jgi-psf.org Contact person: Ed Kirton, ESKirton@lbl.gov IMG does not provide tools for analysis of 16S data from the metagenome itself we do assembly -> assembled 16S sequences are generally not very reliable BLASTn of reads matching conserved regions is misleading we do pyrotags or i-tags for every metagenome sequenced at the JGI http://pyrotagger.jgi-psf.org


Download ppt "Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012."

Similar presentations


Ads by Google