Are Roche 454 shotgun reads giving a accurate picture of the genome?

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

RNA-seq library prep introduction
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Proprietary Signal Generation and Imaging Photons Generated Reagent Flow PicoTiterPlate Wells Sequencing By Synthesis 1600K field of addressable wells.
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genome sequencing and assembling
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
High Throughput Sequencing
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Introduction to next generation sequencing Rolf Sommer Kaas.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
The Changing Face of Sequencing
RNA Sequencing I: De novo RNAseq
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)
Accurate estimation of microbial communities using 16S tags
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
From Reads to Results Exome-seq analysis at CCBR
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Virginia Commonwealth University
bacteria and eukaryotes
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
The Transcriptional Landscape of the Mammalian Genome
Sequencing technologies
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Gene expression from RNA-Seq
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Research in Computational Molecular Biology , Vol (2008)
Reads aligned into contigs
Pyrocleaner Cleaning data from pyrosequencing Jérôme Mariette
Section 3: Gene Technologies in Detail
RNA-seq Replicate 1 RNA-seq Replicate 2 DNA
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Exploring and Understanding ChIP-Seq data
Identification and Characterization of pre-miRNA Candidates in the C
Visualising and Exploring BS-Seq Data
Rotation review Gaurav Moghe Genetics Program
Eric Samorodnitsky, Jharna Datta, Benjamin M
A Sequenciação em Análises Clínicas
CSCI 1810 Computational Molecular Biology 2018
Single-Molecule Sequencing: Towards Clinical Applications
Figure 1. Nanopore methylation calls are consistent with expected results and established technologies. (A) Metaplot of ... Figure 1. Nanopore methylation.
Canadian Bioinformatics Workshops
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
The MLPA assay and application to diagnosis of DGS
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Are Roche 454 shotgun reads giving a accurate picture of the genome?

The material ● 1 Titanium E. coli run from the local platform = test run used to validate the sequencer (two half plates) + control sequences ● 4 GS Flex E. coli runs found in the NCBI Short Read Archive ● 1 Titanium Erwinia run from the local platform (eight lanes)

The references ● Escherichia coli str. K-12 substr. MG1655, complete genome – NCBI / LOCUS NC_ – 4,639,675 bp – circular BCT DNA – 08-MAY-2009 ● Erwinia amylovora – Sanger Center – 3,805,874 bp – 30-SEPT-2008

The questions ● Are 454 shotgun reads reflecting the genome? – Are the reads corresponding to the genome (possible alignment / errors : substitutions, gaps,...) – Are the reads randomly sorted? ● What is the quality of the sequences? ● What are the biases? ● Are there criteria permitting to filter the low quality sequences?

Sequence length : E coli sequences sequences Total : sequences

Sequence quality sequences sequences

Mapping against the reference genome ● out of can be mapped ● forward / reverse ● unmapped reads ● 64 contigs produced AMOScmp-shortReads + hawkeye

Unmapped sequences ● Many short reads ● The average quality is not affected Read length Read average quality

Unmapped sequences clustering Contig length in Log scale Contig depth Cap3 clustering : contigs ( reads) singlets

Unmapped sequence annotation ● Megablast vs procaryotes ● out of can be annotated with the procaryotes NCBI database ● (0.15%) sequences can not be clustered nor annotated ● A very low number of reads could not be linked to the genome.

Mapped sequences uncertainties (Ns) and quality ● 64 contigs / reads ● Per block / per read – Nb substitutions – Nb insertions – Nb deletions – Nb uncertainties (Ns)

Mapped sequences error rate ● Number of sequences and error rate (log)

Mapped sequences Ns rate ● Distribution of average nb of Ns along the reads

Mapped sequences reads ● % of the reads match perfectly the consensus ● 6.24% have one or more Ns ● 19.37% have one or more substitutions ● 37.58% have one or more insertions ● 59.57% have one or more deletions

Mapped sequences blocs ● Attendre les nouvelles donndées

Mapped sequences ● Attendre les nouvelles donndées

Duplicated sequences ● Laurence Drouilhet : Phd student ● False SNPs linked to reads having the same start

Duplicated read search ● Reference : – Splitting E. coli in sequences and looking for duplicated reads ● Strategies : – Using the alignment – Cutting the sequences and sorting them – Aligning the sequences and selecting those having the same start

Building the reference ● NC_ : 4,639,675 bp random selection of sequences Number of duplicated reads per length

Duplicated reads of the 454 ● Two half runs (absolute / relative)

Duplicated reads and complexity ● Distance between two adjacent reads / complexity

What is the structure of the duplicated read graph? ● Number of couples, triplets,...

Where are the duplicated reads located on the plate? ● No specific location ● But the half runs have different profiles

Where are the duplicated reads on the genome ● No specific location

Have the half runs the same duplicated reads? ● No, the number of couples should drop ● Only 922 reads out of from the second half run exist also in the first half run Cluster size Number of sequences

Have duplicated reads specific patterns? ● No specific pattern : – GC % – Di-nucleotide % – Tri-nucleotide %

What happens when we are less stringent? ● Using megablast and same start (-p 98 -s 140) ● Same start alignment result strand : – forward/forward – forward/reverse

Less stringent clustering clview

Validation of the observation in other runs ● GS FLX : NCBI SRA (absolute / relative)

Validation of the observation in other runs ● Erwinia (absolute)

Number of reads for Erwinia ● relative

Duplicated reads location Erwinia ● Differences between the lanes duplicated reads all reads

What are the impacts of n-plicated reads ● Longer assembly processing ● False SNPs ● Wrong expression measurement

Example of the false SNP ● Detection depends on the depth ● Origin : PCR errors,...

Impact of n-plicated reads in SNP detection in ESTs ● Number of SNPs removed with the removal of n-plicated sequences. – Quail scc1 : > 3493 = 33,5% – Duck sap1 : > 1638 = 33,8% – Chicken sgg9 : > = 6,5%

Where are the big n-plicated clusters located Are the sequences from the same cluster aligned at the same place on the genome? ● First half run ● Cluster > 6 reads with same start ● 643 out of 1245 clusters have all reads in the same contig starting at the same position ● They don't come from replicated regions of the genome

Is the Roche 454 suited for expression analysis? ● The n-plicated reads limit the possible use of the absolute number of reads in the contig as the expression level of the mRNA ● It is possible to use the contig average depth instead after n-plicated reads removal

Conclusions ● The overall quality of the reads is good : – Number of matching reads is high – Alignment of the reads is good (the close to the awaited length the better) – N-plicated read search has to be conducted on all runs ● 454 has perhaps no cloning bias but it has an n- plicated read bias – Withdrawal of the n-plicated reads before assembly, SNP search and expression analysis – No criteria found to do it really properly

Epilogue ● Mail from Roche, July 6 th 2009 – In our experience, we typically observe an increase in redundant/duplicate reads in FLX Titanium sequencing as compared to Standard series chemistry. For standard (non-amplified) shotgun libraries, this generally translates to 20-25% redundancy with Titanium versus 15-20% for Standard series methods. – Please note that the new kits with the improved capture beads and the oil, will decrease the redundancy.

Pyrocleaner ● Removing the n-plicated sequences to be as close as possible to the random sorting results ● Using the start and end positions