Presentation is loading. Please wait.

Presentation is loading. Please wait.

The bioinformatics behind

Similar presentations


Presentation on theme: "The bioinformatics behind"— Presentation transcript:

1 The bioinformatics behind
shotgun metagenomic sequencing Roche UG 2012 Rob Edwards

2 Outline Metagenomics Annotation Virus:host prediction

3 How Much has Been sequenced?
Environmental sequencing 100 bacterial genomes First bacterial genome 1,000 bacterial genomes Number of known sequences Year

4 How Much will Be Sequenced?
Everybody in USA Everybody in the Indy 500 infield One genome from every species 100 people Most major microbial environments All cultured Bacteria Training people for the future X-Prize competition, sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome

5 Metagenomics Sequencing the World

6 Shark Metagenomes

7 Shark Metagenomes

8 Metagenomics Analysis Steps
1. clean the data (prinseq) Check the quality of the data Remove bad reads 2. annotate the data (rtmg) What is there? Who is there? 3. analyze the data

9 Clean, dereplicate, check quality, analyze
Prinseq Clean, dereplicate, check quality, analyze PRINSEQ PRINSEQ Results Rob

10 Data Processed May 2012 Datasets processed: 6,223
Sequences processed: 8,698,402,976 Bases processed: ,472,118,033

11 Length Matters

12 Preprocessing Data

13 Annotating Metagenomes
Identify functions // organisms present in the sample BLAST is very slow Immediate processing of data RTMG

14 A Better Way To Annotate Metagenomes
Real time metagenomics RTMg RTMG

15 Shark Microbe Metabolism
Microbes on sharks struggle for iron

16 Host – Virus Interactions
Can we predict the host a virus infects just from the sequence? Bas Michiyo

17 How To Predict Hosts Count all 2-, 3-, 4-, 5-, 6-, 7- bp sequences
AA, AG, AC, AT, GA, GG, GC, GT … AAA, AAG, AAC, AAT, AGA, AGG, AGC, AGT … AAAA, AAAG, AAAC, AAAT, AAGA, AAGG, … Count in host and virus Use machine learning (Random Forest) to identify which hosts and which viruses match

18 Classification Accuracy
Using all known viruses and their hosts Oligonucleotide length Classification Error %

19 Length Matters! Using all known viruses and their hosts 200 bp reads
Using known samples as control: ~5% of reads classified Sequencing error little effect 50 40 30 20 10 Correct predictions Wrong predictions Prediction percent

20 Predicted host Actual host

21 Predicted Actual 89 % of misclassifications are near-neighbors
11% are outside the near neighbors

22 Shark virus Hosts Virus hosts include eukaryotes, bacteria, and plants

23 Thanks Liz Forest Stuart Katie Alan

24 Take Home Points Check your data (e.g. prinseq)
Annotate the data (e.g. RTMg) Analyze your data

25 The Lab Ramy Jeremy Bas Dave Sajia Rob Kate Joakim Brad Steve Sheridan
Stephanie Adam Carny Rima Josh Daniel Michiyo Vasken Matt S Matt H Bianca Andrés Nick C Nick T Brian Geni Jimmy Amanda

26 Funding PhAnToMe TUES Viral Dark Matter Brazil-US Marine
Sciences Consortium Coral Reef Image Analysis


Download ppt "The bioinformatics behind"

Similar presentations


Ads by Google