Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov
Complexity Who is there? (phylogenetic content) What does it do? Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity 1 10 100 1000 10000
? Dataset processing Analysis Feature prediction QC Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning
Dataset processing (v 3.0a) Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity File for gene calling fasta
Dataset processing Feature prediction pipeline (v 3.0a) File for gene calling fasta CRISPR detection. crt / pilercr RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs Conflict resolution Concatenation of all results. Creation of final output file File for IMG IMG
Dataset processing Quality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)
Dataset processing Low complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa using dust (NCBI) Remove sequences with less than 80 informative bases
Dataset processing Dereplication
Dataset processing Sequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat using uclust 95% identity (global alignment). Identical prefix (5nt)
Dataset processing Evaluation of processing tools Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. Simulated datasets: Using sequences extracted from finished genomes (Perfect sequences) Using reads that have been used to assemble finished genomes (Real errors). Evaluation and development of new tools/wrappers.
Dataset processing Feature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate CORRECT MISSED WRONG NEW metagenome
Method performance
Quality effect
Trimming
454 Ti(no errors)
454Ti(with errors)
Illumina 115 bp
Illumina 74 bp
Contigs frameshift Wrong prediction
Why annotate unassembled reads? Sample Total size 102,722,384 (2x150) reads Assembled contigs 1,375,950 contigs 5060 different pfams Assembled reads Mapped (by bwa) 11,778,925 reads Genes called on unassembled reads 64,737,444 genes 7481 different pfams 8,373,641 (12%) genes Similar to genes on contigs1 Genes with similarity to isolate genomes 40,778,854 genes Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome
Processing time(metagenomes) Highlight metrics. Things that Show what I think should be the best metric for predcition for 2012 Total submissions Processing time Data size (bp) 336 2.45 days (annotation) 24 days (integration) 174,719,855 (average) 58,006,992,092 (total)
Processing time(isolates) Total submissions Processing time Data size (bp) 3630 10 hours(annotation) 12 days (integration) 1,658,242 (average) 4,114,099,773 (total)
Thank you for your attention