Are Roche 454 shotgun reads giving a accurate picture of the genome?
The material ● 1 Titanium E. coli run from the local platform = test run used to validate the sequencer (two half plates) + control sequences ● 4 GS Flex E. coli runs found in the NCBI Short Read Archive ● 1 Titanium Erwinia run from the local platform (eight lanes)
The references ● Escherichia coli str. K-12 substr. MG1655, complete genome – NCBI / LOCUS NC_ – 4,639,675 bp – circular BCT DNA – 08-MAY-2009 ● Erwinia amylovora – Sanger Center – 3,805,874 bp – 30-SEPT-2008
The questions ● Are 454 shotgun reads reflecting the genome? – Are the reads corresponding to the genome (possible alignment / errors : substitutions, gaps,...) – Are the reads randomly sorted? ● What is the quality of the sequences? ● What are the biases? ● Are there criteria permitting to filter the low quality sequences?
Sequence length : E coli sequences sequences Total : sequences
Sequence quality sequences sequences
Mapping against the reference genome ● out of can be mapped ● forward / reverse ● unmapped reads ● 64 contigs produced AMOScmp-shortReads + hawkeye
Unmapped sequences ● Many short reads ● The average quality is not affected Read length Read average quality
Unmapped sequences clustering Contig length in Log scale Contig depth Cap3 clustering : contigs ( reads) singlets
Unmapped sequence annotation ● Megablast vs procaryotes ● out of can be annotated with the procaryotes NCBI database ● (0.15%) sequences can not be clustered nor annotated ● A very low number of reads could not be linked to the genome.
Mapped sequences uncertainties (Ns) and quality ● 64 contigs / reads ● Per block / per read – Nb substitutions – Nb insertions – Nb deletions – Nb uncertainties (Ns)
Mapped sequences error rate ● Number of sequences and error rate (log)
Mapped sequences Ns rate ● Distribution of average nb of Ns along the reads
Mapped sequences reads ● % of the reads match perfectly the consensus ● 6.24% have one or more Ns ● 19.37% have one or more substitutions ● 37.58% have one or more insertions ● 59.57% have one or more deletions
Mapped sequences blocs ● Attendre les nouvelles donndées
Mapped sequences ● Attendre les nouvelles donndées
Duplicated sequences ● Laurence Drouilhet : Phd student ● False SNPs linked to reads having the same start
Duplicated read search ● Reference : – Splitting E. coli in sequences and looking for duplicated reads ● Strategies : – Using the alignment – Cutting the sequences and sorting them – Aligning the sequences and selecting those having the same start
Building the reference ● NC_ : 4,639,675 bp random selection of sequences Number of duplicated reads per length
Duplicated reads of the 454 ● Two half runs (absolute / relative)
Duplicated reads and complexity ● Distance between two adjacent reads / complexity
What is the structure of the duplicated read graph? ● Number of couples, triplets,...
Where are the duplicated reads located on the plate? ● No specific location ● But the half runs have different profiles
Where are the duplicated reads on the genome ● No specific location
Have the half runs the same duplicated reads? ● No, the number of couples should drop ● Only 922 reads out of from the second half run exist also in the first half run Cluster size Number of sequences
Have duplicated reads specific patterns? ● No specific pattern : – GC % – Di-nucleotide % – Tri-nucleotide %
What happens when we are less stringent? ● Using megablast and same start (-p 98 -s 140) ● Same start alignment result strand : – forward/forward – forward/reverse
Less stringent clustering clview
Validation of the observation in other runs ● GS FLX : NCBI SRA (absolute / relative)
Validation of the observation in other runs ● Erwinia (absolute)
Number of reads for Erwinia ● relative
Duplicated reads location Erwinia ● Differences between the lanes duplicated reads all reads
What are the impacts of n-plicated reads ● Longer assembly processing ● False SNPs ● Wrong expression measurement
Example of the false SNP ● Detection depends on the depth ● Origin : PCR errors,...
Impact of n-plicated reads in SNP detection in ESTs ● Number of SNPs removed with the removal of n-plicated sequences. – Quail scc1 : > 3493 = 33,5% – Duck sap1 : > 1638 = 33,8% – Chicken sgg9 : > = 6,5%
Where are the big n-plicated clusters located Are the sequences from the same cluster aligned at the same place on the genome? ● First half run ● Cluster > 6 reads with same start ● 643 out of 1245 clusters have all reads in the same contig starting at the same position ● They don't come from replicated regions of the genome
Is the Roche 454 suited for expression analysis? ● The n-plicated reads limit the possible use of the absolute number of reads in the contig as the expression level of the mRNA ● It is possible to use the contig average depth instead after n-plicated reads removal
Conclusions ● The overall quality of the reads is good : – Number of matching reads is high – Alignment of the reads is good (the close to the awaited length the better) – N-plicated read search has to be conducted on all runs ● 454 has perhaps no cloning bias but it has an n- plicated read bias – Withdrawal of the n-plicated reads before assembly, SNP search and expression analysis – No criteria found to do it really properly
Epilogue ● Mail from Roche, July 6 th 2009 – In our experience, we typically observe an increase in redundant/duplicate reads in FLX Titanium sequencing as compared to Standard series chemistry. For standard (non-amplified) shotgun libraries, this generally translates to 20-25% redundancy with Titanium versus 15-20% for Standard series methods. – Please note that the new kits with the improved capture beads and the oil, will decrease the redundancy.
Pyrocleaner ● Removing the n-plicated sequences to be as close as possible to the random sorting results ● Using the start and end positions