Presentation is loading. Please wait.

Presentation is loading. Please wait.

Are Roche 454 shotgun reads giving a accurate picture of the genome?

Similar presentations


Presentation on theme: "Are Roche 454 shotgun reads giving a accurate picture of the genome?"— Presentation transcript:

1 Are Roche 454 shotgun reads giving a accurate picture of the genome?

2 The material ● 1 Titanium E. coli run from the local platform = test run used to validate the sequencer (two half plates) + control sequences ● 4 GS Flex E. coli runs found in the NCBI Short Read Archive ● 1 Titanium Erwinia run from the local platform (eight lanes)

3 The references ● Escherichia coli str. K-12 substr. MG1655, complete genome – NCBI / LOCUS NC_000913 – 4,639,675 bp – circular BCT DNA – 08-MAY-2009 ● Erwinia amylovora – Sanger Center – 3,805,874 bp – 30-SEPT-2008

4 The questions ● Are 454 shotgun reads reflecting the genome? – Are the reads corresponding to the genome (possible alignment / errors : substitutions, gaps,...) – Are the reads randomly sorted? ● What is the quality of the sequences? ● What are the biases? ● Are there criteria permitting to filter the low quality sequences?

5 Sequence length : E coli 671 856 sequences 529 653 sequences Total :1 201 509 sequences

6 Sequence quality 671 856 sequences 529 653 sequences

7 Mapping against the reference genome ● 1 160 389 out of 1 201 509 can be mapped ● 580 120 forward / 580 269 reverse ● 41 120 unmapped reads ● 64 contigs produced AMOScmp-shortReads + hawkeye

8 Unmapped sequences ● Many short reads ● The average quality is not affected Read length Read average quality

9 Unmapped sequences clustering Contig length in Log scale Contig depth Cap3 clustering : 4 120 contigs (33 241 reads) 7 879 singlets

10 Unmapped sequence annotation ● Megablast vs procaryotes ● 10 241 out of 11 999 can be annotated with the procaryotes NCBI database ● 1 758 (0.15%) sequences can not be clustered nor annotated ● A very low number of reads could not be linked to the genome.

11 Mapped sequences uncertainties (Ns) and quality ● 64 contigs / 1 160 389 reads ● Per block / per read – Nb substitutions – Nb insertions – Nb deletions – Nb uncertainties (Ns)

12 Mapped sequences error rate ● Number of sequences and error rate (log)

13 Mapped sequences Ns rate ● Distribution of average nb of Ns along the reads

14 Mapped sequences reads ● 22.66 % of the reads match perfectly the consensus ● 6.24% have one or more Ns ● 19.37% have one or more substitutions ● 37.58% have one or more insertions ● 59.57% have one or more deletions

15 Mapped sequences blocs ● Attendre les nouvelles donndées

16 Mapped sequences ● Attendre les nouvelles donndées

17 Duplicated sequences ● Laurence Drouilhet : Phd student ● False SNPs linked to reads having the same start

18 Duplicated read search ● Reference : – Splitting E. coli in sequences and looking for duplicated reads ● Strategies : – Using the alignment – Cutting the sequences and sorting them – Aligning the sequences and selecting those having the same start

19 Building the reference ● NC_000913 : 4,639,675 bp random selection of 550 000 sequences Number of duplicated reads per length

20 Duplicated reads of the 454 ● Two half runs (absolute / relative)

21 Duplicated reads and complexity ● Distance between two adjacent reads / complexity

22 What is the structure of the duplicated read graph? ● Number of couples, triplets,...

23 Where are the duplicated reads located on the plate? ● No specific location ● But the half runs have different profiles

24 Where are the duplicated reads on the genome ● No specific location

25 Have the half runs the same duplicated reads? ● No, the number of couples should drop ● Only 922 reads out of 57 334 from the second half run exist also in the first half run Cluster size Number of sequences

26 Have duplicated reads specific patterns? ● No specific pattern : – GC % – Di-nucleotide % – Tri-nucleotide %

27 What happens when we are less stringent? ● Using megablast and same start (-p 98 -s 140) ● Same start alignment result strand : – 1 216 659 forward/forward – 110 896 forward/reverse

28 Less stringent clustering clview

29 Validation of the observation in other runs ● GS FLX : NCBI SRA (absolute / relative)

30 Validation of the observation in other runs ● Erwinia (absolute)

31 Number of reads for Erwinia ● relative

32 Duplicated reads location Erwinia ● Differences between the lanes duplicated reads all reads

33 What are the impacts of n-plicated reads ● Longer assembly processing ● False SNPs ● Wrong expression measurement

34 Example of the false SNP ● Detection depends on the depth ● Origin : PCR errors,...

35 Impact of n-plicated reads in SNP detection in ESTs ● Number of SNPs removed with the removal of n-plicated sequences. – Quail scc1 : 5254 -> 3493 = 33,5% – Duck sap1 : 2476 -> 1638 = 33,8% – Chicken sgg9 : 27774 -> 25976 = 6,5%

36 Where are the big n-plicated clusters located Are the sequences from the same cluster aligned at the same place on the genome? ● First half run ● Cluster > 6 reads with same start ● 643 out of 1245 clusters have all reads in the same contig starting at the same position ● They don't come from replicated regions of the genome

37 Is the Roche 454 suited for expression analysis? ● The n-plicated reads limit the possible use of the absolute number of reads in the contig as the expression level of the mRNA ● It is possible to use the contig average depth instead after n-plicated reads removal

38 Conclusions ● The overall quality of the reads is good : – Number of matching reads is high – Alignment of the reads is good (the close to the awaited length the better) – N-plicated read search has to be conducted on all runs ● 454 has perhaps no cloning bias but it has an n- plicated read bias – Withdrawal of the n-plicated reads before assembly, SNP search and expression analysis – No criteria found to do it really properly

39 Epilogue ● Mail from Roche, July 6 th 2009 – In our experience, we typically observe an increase in redundant/duplicate reads in FLX Titanium sequencing as compared to Standard series chemistry. For standard (non-amplified) shotgun libraries, this generally translates to 20-25% redundancy with Titanium versus 15-20% for Standard series methods. – Please note that the new kits with the improved capture beads and the oil, will decrease the redundancy.

40 Pyrocleaner ● Removing the n-plicated sequences to be as close as possible to the random sorting results ● Using the start and end positions


Download ppt "Are Roche 454 shotgun reads giving a accurate picture of the genome?"

Similar presentations


Ads by Google