Download presentation
Presentation is loading. Please wait.
Published byEvelyn Ramsey Modified over 8 years ago
1
Are Roche 454 shotgun reads giving a accurate picture of the genome?
2
The material ● 1 Titanium E. coli run from the local platform = test run used to validate the sequencer (two half plates) + control sequences ● 4 GS Flex E. coli runs found in the NCBI Short Read Archive ● 1 Titanium Erwinia run from the local platform (eight lanes)
3
The references ● Escherichia coli str. K-12 substr. MG1655, complete genome – NCBI / LOCUS NC_000913 – 4,639,675 bp – circular BCT DNA – 08-MAY-2009 ● Erwinia amylovora – Sanger Center – 3,805,874 bp – 30-SEPT-2008
4
The questions ● Are 454 shotgun reads reflecting the genome? – Are the reads corresponding to the genome (possible alignment / errors : substitutions, gaps,...) – Are the reads randomly sorted? ● What is the quality of the sequences? ● What are the biases? ● Are there criteria permitting to filter the low quality sequences?
5
Sequence length : E coli 671 856 sequences 529 653 sequences Total :1 201 509 sequences
6
Sequence quality 671 856 sequences 529 653 sequences
7
Mapping against the reference genome ● 1 160 389 out of 1 201 509 can be mapped ● 580 120 forward / 580 269 reverse ● 41 120 unmapped reads ● 64 contigs produced AMOScmp-shortReads + hawkeye
8
Unmapped sequences ● Many short reads ● The average quality is not affected Read length Read average quality
9
Unmapped sequences clustering Contig length in Log scale Contig depth Cap3 clustering : 4 120 contigs (33 241 reads) 7 879 singlets
10
Unmapped sequence annotation ● Megablast vs procaryotes ● 10 241 out of 11 999 can be annotated with the procaryotes NCBI database ● 1 758 (0.15%) sequences can not be clustered nor annotated ● A very low number of reads could not be linked to the genome.
11
Mapped sequences uncertainties (Ns) and quality ● 64 contigs / 1 160 389 reads ● Per block / per read – Nb substitutions – Nb insertions – Nb deletions – Nb uncertainties (Ns)
12
Mapped sequences error rate ● Number of sequences and error rate (log)
13
Mapped sequences Ns rate ● Distribution of average nb of Ns along the reads
14
Mapped sequences reads ● 22.66 % of the reads match perfectly the consensus ● 6.24% have one or more Ns ● 19.37% have one or more substitutions ● 37.58% have one or more insertions ● 59.57% have one or more deletions
15
Mapped sequences blocs ● Attendre les nouvelles donndées
16
Mapped sequences ● Attendre les nouvelles donndées
17
Duplicated sequences ● Laurence Drouilhet : Phd student ● False SNPs linked to reads having the same start
18
Duplicated read search ● Reference : – Splitting E. coli in sequences and looking for duplicated reads ● Strategies : – Using the alignment – Cutting the sequences and sorting them – Aligning the sequences and selecting those having the same start
19
Building the reference ● NC_000913 : 4,639,675 bp random selection of 550 000 sequences Number of duplicated reads per length
20
Duplicated reads of the 454 ● Two half runs (absolute / relative)
21
Duplicated reads and complexity ● Distance between two adjacent reads / complexity
22
What is the structure of the duplicated read graph? ● Number of couples, triplets,...
23
Where are the duplicated reads located on the plate? ● No specific location ● But the half runs have different profiles
24
Where are the duplicated reads on the genome ● No specific location
25
Have the half runs the same duplicated reads? ● No, the number of couples should drop ● Only 922 reads out of 57 334 from the second half run exist also in the first half run Cluster size Number of sequences
26
Have duplicated reads specific patterns? ● No specific pattern : – GC % – Di-nucleotide % – Tri-nucleotide %
27
What happens when we are less stringent? ● Using megablast and same start (-p 98 -s 140) ● Same start alignment result strand : – 1 216 659 forward/forward – 110 896 forward/reverse
28
Less stringent clustering clview
29
Validation of the observation in other runs ● GS FLX : NCBI SRA (absolute / relative)
30
Validation of the observation in other runs ● Erwinia (absolute)
31
Number of reads for Erwinia ● relative
32
Duplicated reads location Erwinia ● Differences between the lanes duplicated reads all reads
33
What are the impacts of n-plicated reads ● Longer assembly processing ● False SNPs ● Wrong expression measurement
34
Example of the false SNP ● Detection depends on the depth ● Origin : PCR errors,...
35
Impact of n-plicated reads in SNP detection in ESTs ● Number of SNPs removed with the removal of n-plicated sequences. – Quail scc1 : 5254 -> 3493 = 33,5% – Duck sap1 : 2476 -> 1638 = 33,8% – Chicken sgg9 : 27774 -> 25976 = 6,5%
36
Where are the big n-plicated clusters located Are the sequences from the same cluster aligned at the same place on the genome? ● First half run ● Cluster > 6 reads with same start ● 643 out of 1245 clusters have all reads in the same contig starting at the same position ● They don't come from replicated regions of the genome
37
Is the Roche 454 suited for expression analysis? ● The n-plicated reads limit the possible use of the absolute number of reads in the contig as the expression level of the mRNA ● It is possible to use the contig average depth instead after n-plicated reads removal
38
Conclusions ● The overall quality of the reads is good : – Number of matching reads is high – Alignment of the reads is good (the close to the awaited length the better) – N-plicated read search has to be conducted on all runs ● 454 has perhaps no cloning bias but it has an n- plicated read bias – Withdrawal of the n-plicated reads before assembly, SNP search and expression analysis – No criteria found to do it really properly
39
Epilogue ● Mail from Roche, July 6 th 2009 – In our experience, we typically observe an increase in redundant/duplicate reads in FLX Titanium sequencing as compared to Standard series chemistry. For standard (non-amplified) shotgun libraries, this generally translates to 20-25% redundancy with Titanium versus 15-20% for Standard series methods. – Please note that the new kits with the improved capture beads and the oil, will decrease the redundancy.
40
Pyrocleaner ● Removing the n-plicated sequences to be as close as possible to the random sorting results ● Using the start and end positions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.