Download presentation
Presentation is loading. Please wait.
1
The bioinformatics behind
shotgun metagenomic sequencing Roche UG 2012 Rob Edwards
2
Outline Metagenomics Annotation Virus:host prediction
3
How Much has Been sequenced?
Environmental sequencing 100 bacterial genomes First bacterial genome 1,000 bacterial genomes Number of known sequences Year
4
How Much will Be Sequenced?
Everybody in USA Everybody in the Indy 500 infield One genome from every species 100 people Most major microbial environments All cultured Bacteria Training people for the future X-Prize competition, sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome
5
Metagenomics Sequencing the World
6
Shark Metagenomes
7
Shark Metagenomes
8
Metagenomics Analysis Steps
1. clean the data (prinseq) Check the quality of the data Remove bad reads 2. annotate the data (rtmg) What is there? Who is there? 3. analyze the data
9
Clean, dereplicate, check quality, analyze
Prinseq Clean, dereplicate, check quality, analyze PRINSEQ PRINSEQ Results Rob
10
Data Processed May 2012 Datasets processed: 6,223
Sequences processed: 8,698,402,976 Bases processed: ,472,118,033
11
Length Matters
12
Preprocessing Data
13
Annotating Metagenomes
Identify functions // organisms present in the sample BLAST is very slow Immediate processing of data RTMG
14
A Better Way To Annotate Metagenomes
Real time metagenomics RTMg RTMG
15
Shark Microbe Metabolism
Microbes on sharks struggle for iron
16
Host – Virus Interactions
Can we predict the host a virus infects just from the sequence? Bas Michiyo
17
How To Predict Hosts Count all 2-, 3-, 4-, 5-, 6-, 7- bp sequences
AA, AG, AC, AT, GA, GG, GC, GT … AAA, AAG, AAC, AAT, AGA, AGG, AGC, AGT … AAAA, AAAG, AAAC, AAAT, AAGA, AAGG, … Count in host and virus Use machine learning (Random Forest) to identify which hosts and which viruses match
18
Classification Accuracy
Using all known viruses and their hosts Oligonucleotide length Classification Error %
19
Length Matters! Using all known viruses and their hosts 200 bp reads
Using known samples as control: ~5% of reads classified Sequencing error little effect 50 40 30 20 10 Correct predictions Wrong predictions Prediction percent
20
Predicted host Actual host
21
Predicted Actual 89 % of misclassifications are near-neighbors
11% are outside the near neighbors
22
Shark virus Hosts Virus hosts include eukaryotes, bacteria, and plants
23
Thanks Liz Forest Stuart Katie Alan
24
Take Home Points Check your data (e.g. prinseq)
Annotate the data (e.g. RTMg) Analyze your data
25
The Lab Ramy Jeremy Bas Dave Sajia Rob Kate Joakim Brad Steve Sheridan
Stephanie Adam Carny Rima Josh Daniel Michiyo Vasken Matt S Matt H Bianca Andrés Nick C Nick T Brian Geni Jimmy Amanda
26
Funding PhAnToMe TUES Viral Dark Matter Brazil-US Marine
Sciences Consortium Coral Reef Image Analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.