Presentation is loading. Please wait.

Presentation is loading. Please wait.

A statistical base-caller for the Illumina Genome Analyzer Wally Gilks University of Leeds.

Similar presentations


Presentation on theme: "A statistical base-caller for the Illumina Genome Analyzer Wally Gilks University of Leeds."— Presentation transcript:

1 A statistical base-caller for the Illumina Genome Analyzer Wally Gilks University of Leeds

2 DNA sequencing technologies Sanger sequencing “Next-Generation” sequencing Roche 454 ABI SOLiD Illumina (Solexa) “Next-Next (3 rd ) Generation” sequencing VisiGen Helicos Oxford Nanopore

3 Illumina Genome Analyzer Description of technology Technological problems Our statistical model for base-calling Comparing our accuracy with Illumina ’s

4 Illumina Genome Analyzer Flow cell sbss.cap.ed.ac.uk/solexa lanes

5 Layout of a flow cell 1234 5678 lanes tile (330 per lane) control lane

6 One tile of a flow cell Chi, K.R., Nature Methods - 5, 11 - 14 (2008) sequence clusters (30,000 per tile) tile

7 DNA sample preparation (over-simplified) 1) Extract DNA 2) Randomly shatter 3) Attach adapter sequence

8 4) Attach to flow-cell surface 5) PCR-amplify into clusters

9 Sequence clusters on the flow cell A C T G A A...... adapter sequence fragment C T G A...... T G C G...... T T G A Cluster 1 Cluster 2Cluster 3 adapter sequence flow-cell surface A C T G A A...... A C T G A A...... C T G A...... T G C G...... T T G A C G...... T T G A C G...... T T G A

10 Sequencing cycle 1 A C T G A A...... add free adapters and dye-labelled bases

11 A C T G A A...... add block Sequencing cycle 1

12 A C T G A A...... Fire laser record intensities Sequencing cycle 1

13 Light detector Frequency spectrum

14 A C T G A A...... remove block Sequencing cycle 1

15 A C T G A A...... add dye-labelled bases Sequencing cycle 2

16 A C T G A A...... Fire laser record intensities Sequencing cycle 2

17 A C T G A A...... Fire laser record intensities Sequencing cycle 3

18 A C T G A A...... Fire laser record intensities Sequencing cycle 4

19 Illumina Genome Analyzer Description of technology Technological problems Our statistical model for base-calling Comparing our accuracy with Illumina ’s

20

21

22

23

24 The “sticky-T” problem A C T G A A...... mixed signal non-specific accumulation of T dye

25 Sticky-T: solution Regress intensity for cluster c against cycle number i, for each dye k. Normalise

26

27

28

29

30 Illumina Genome Analyzer Description of technology Technological problems Our statistical model for base-calling Comparing our accuracy with Illumina ’s

31 The “cross-talk” problem Ideally, base “A” would produce a strong and distinct intensity on the A dye. Similarly for the other bases. But in reality, base “A” can produce a signal on the “C” dye, and so on. This is called dye “cross-talk”.

32 Light detector Frequency spectrum

33 What is the true base at cycle 1 in cluster 1 ? Observations:

34 What is the true base at cycle 18 in cluster 1 ? Observations:

35 What is the true base at cycle 36 in cluster 1 ? Observations:

36 Cross-talk: solution Model the normalised intensity at cycle i in cluster c: as a 4-dimensional multivariate normal distribution whose mean vector   and variance matrix V depend on cycle number i and true base b.

37 The “phase” problem A C T G A A...... Cycle 4: ideal A C T G A A...... Cycle 4: misphased A C T G A A......

38 Phase problem: solution Assume probability  c of a base-incorporation error at a given cycle i, constant over all cycles, but depending on cluster c. This implies a probability of of being correctly phased at cycle i.

39 The “drop-off” problem A C T G A A...... Cycle 4: ideal A C T G A A...... Cycle 4: dropped off Sequencing reactions terminated, perhaps due to failure of block release

40 Drop-off problem: solution Assume probability of dropping off at a given cycle i, constant over all cycles and clusters. This implies a probability of of not having dropped off before cycle i.

41 Putting it all together We do not know when a molecule becomes misphased or drops off. We integrate over these events. Many identical molecules in each cluster: assume their independence, motivating normal theory.

42 The resulting model of the mean intensity vector at cycle i in cluster c when the true base is b, is : where fixed parameters cluster-specific parameter known base frequency

43

44

45

46

47

48

49 Illumina Genome Analyzer Description of technology Technological problems Our statistical model for base-calling Comparing our accuracy with Illumina ’s

50 Base-calling Posterior probability that cluster c at cycle i has base b is: where as described above. Call b to maximise this posterior.

51

52

53

54

55

56

57

58

59 BLASTing reads Study should be designed with many replicates BLAST is used to group similar reads A consensus sequence is called for each group

60 Conclusion Currently, our method performs about as well as the Illumina pipeline. Our method produces a posterior probability of correctness of each base call. Further work addressing heavy tails in the residuals should improve results. Others are trying to estimate the phase at each cycle for each cluster.

61 Thanks to: Irina Abnizova Tom Skelly Nava Whiteford Klaus Maisinger Next-Gen Sequencing Group, Sanger Inst. Illumina Oxford Nanopore Technologies


Download ppt "A statistical base-caller for the Illumina Genome Analyzer Wally Gilks University of Leeds."

Similar presentations


Ads by Google