Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.

Similar presentations


Presentation on theme: "High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010."— Presentation transcript:

1 High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010

2 High Throughput Sequencers read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers ABI capillary sequencer 454 GS FLX pyrosequencer (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) 1 Mb (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 100 Gb From Gabor Marth, BC

3 DNA ligationDNA base extension Church, 2005 Sequencing chemistries

4 Massively parallel sequencing Church, 2005

5 Features of HTS data Short (for now) sequence reads –200-400bp: 454 (Roche) –35-100bp Solexa(Illumina), SOLiD(AB) Huge amount of sequence per run –Up to 10s of gigabases per run Huge number of reads per run –Up to 100’s of millions Higher error (compared with Sanger) –Different error profile

6 The Raw Data Machine Readouts are different Read length, accuracy, and error profiles are variable. All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

7 454 Pyrosequencer error profile multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal  the majority of errors are INDELs error rates are nucleotide-dependent

8 Illumina/Solexa base accuracy Error rate grows as a function of base position within the read A large fraction of the reads contains 1 or 2 errors

9 3’5’ N N N T G z z z 3’5’ N N N G A z z z 3’5’ N N N A T z z z 2-base, 4-color: 16 probe combinations ●4 dyes to encode 16 2-base combinations ●Detect a single color indicates 4 combinations & eliminates 12 ●Each color reflects position, not the base call ●Each base is interrogated by two probes ●Dual interrogation eases discrimination –errors (random or systematic) vs. SNPs (true polymorphisms) ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 AB SOLiD System dibase sequencing

10 The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G 1 00 1 23022 1 00 1 23022 4 Possible Sequences Converting dibase (color) into letters

11 A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error SOLiD error checking code

12 SOLiD Error rate & QVs

13 Pacific Biosystems (PacBio)

14

15 Current and future application areas De novo genome sequencing Short-read sequencing will be (at least) an alternative to microarrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery DEL SNP reference genome

16 What’s in it for us? VISION/GraphicsMachine LearningString Algorithms SystemsDatabases Human-Computer Interaction Image Management Base calling Probabilistic Models Variant Calling Read Mapping Assembly Data Storage Cloud Computing Data Management Data Integrity Data representation for Biologists

17 Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 3. Dealing with non- uniqueness in the genome: resequenceability 2. Alignment of billions of reads

18 Informatics challenges (cont’d) 5. Data visualization 4. SNP and short INDEL, and structural variation discovery 6. Data storage & management

19 High Throughput Sequencing: Technologies & Applications Questions?

20 Fast Mapping Algorithm - Spaced seed hashing - Vectored (very fast) Smith Waterman - Handles micro insertions/deletions Specialized algorithm for aligning color-space (AB SOLiD) reads Computes p-values (and other statistics) SHRiMP: SHort Read Mapping Package

21 Cell being computed Previously computed cells A C T A G A C T T G TCCAGTTCCAGT Regular Smith-Waterman

22 Fast Local Alignment BLAST FASTA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Altschul et al 1990Pearson 1987

23 Genome Reads SHRiMP Hashing SHRiMP uses spaced seeds Vectored Smith-Waterman

24 Modern computers provide with capacity for performing same operation on several elements (SIMD) Can we take advantage of vectorized instruction in Smith-Waterman? 6 3 9 1 4 8 4 5 + = max = 10 11 13 6 6 3 9 1 4 8 4 5 6 8 9 5 Vectored Instructions

25 Cell being computed Previously computed cells A C T A G A C T T G TCCAGTTCCAGT Vectorizing Smith-Waterman (1st try)

26 Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT Wozniak, 1997 Vectorizing Smith-Waterman (Wozniak)

27 + - - - + Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT T G A C C T + - - - + Vectorizing Smith-Waterman (SHRiMP)

28 UnvectoredWozniakFarrarSHRiMP Xeon97261335338 Core2105285533537 SW within SHRiMP while mapping 50,000 reads against a 4Mb contig of C. savignyi SHRiMP Speed SHRiMP performance for mapping 11,200 AB SOLiD 25 bp reads to 180Mb Ciona savignyi genome K-mer(7,8)(8,9)(9,10)(10,11)(12,13) % in SW45%25%12%7%3% Time (S)2066520255195205

29 A G CT 00 00 11 2 2 33 ACGT A0123 C1032 G2301 T3210 Color-space (dibase) Sequencing

30 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 Mapping reads in Color-space INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333

31 Mapping reads in Letter Space A G CT 00 00 11 2 2 33 G: TGACTTATGGAT ||||| TTGAGTCGCAAGC CCAGACTATGGAT R: 012212331023 |||||||

32 SOLiD Translations Given the following read, there are 4 translations (we need an initial base): 012233102 AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC

33 SOLiD Translations Reads begin with a known primer (‘T’) –The translation is: T T G A G C G T T C 012233102 AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC

34 SOLiD Translations 010233102 AACCTATGGA CCAAGCGTTC GGTTCGCAAG TTGGATACCT What if we had a sequencing error? –The right translation was: T T G A G C G T T C

35 Colour-space Smith-Waterman Think of 4 SW matrices stacked above one another If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1Frame 2Frame 3Frame 4 Letter

36 Combined Color/Letter Space SW A G CT 00 00 11 2 2 33 A C 3232 AC GT TG CA CA TG

37 Combined Color/Letter Space SW A G CT 00 00 11 2 2 33 A C 3232 AC GT TG CA CA TG

38 G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T0311101130121221211313211 24 p<.05p<.01 Reads mapped 20%9% SNP rate.039.024 Indel rate.004.003 Error rate.024.020 SHRiMP on Ciona savignyi C. savignyi is a chordate with a very large SNP rate (5%) Mapped 22 million AB SOLiD reads to the reference C. savignyi genome (6 hours on 200 CPUs).

39 Fast mapping of short reads to a genome -- Handles indels & color-space reads -- Easy to parallelize -- Small memory footprint Computation of p-values & other statistics for hits Publicly available & free SHRiMP Summary

40 Acknowledgments Stephen Rumble UofT Phil Lacroute Anton Valouev Arend Sidow http://compbio.cs.toronto.edu/shrimp FUNDING: NSERC, CFI, NIH Stanford

41 Acknowledgments Stephen Rumble UofT Phil Lacroute Anton Valouev Arend Sidow http://compbio.cs.toronto.edu/shrimp FUNDING: NSERC, CFI, NIH Stanford

42 SNP discovery Error correction with letter & color reads (assembly) Can fix errors without (explicit) overlap Don’t just do everything in color space! Why is color-space good? R1: 0 TAGACCACGGTCACACTTGCATCAC 24 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R2: 0 T0311101130121221211313211 24 T: TACACCACGGTCAGACTTGCATCAC R1: T0311101130121221211013211 24 R2: T2113013122121101321103111 24 R3: T2212110132110311121130131 24

43 What are structural variations? Various examples of structural variations

44 Type of Structural Variations (1) Insertion A REF

45 Type of Structural Variations (2) Deletion A REF

46 Type of Structural Variations (3) Inversion A REF 5’ 3’ 5’3’ 5’3’

47 Type of Structural Variations (4) Translocation chr1 chr2

48 Clone-end Sequencing Approaches 1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005] Mapping matepairs onto the reference genome If mappings of matepairs are not consistent, then there exist structural variations. 2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007] Proposed high-throughput and massive paired end mapping technique Detailed types of structural variations

49 Motivation Tuzun & Korbel used scores which are combination of several factors. (e.g. length, identity, quality of the sequences, concordance) Reads can map to many locations on the genome. How do we choose between them?

50 Probabilistic Framework (1) p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes We play with p(Y) to describe our probabilistic framework

51 Probabilistic Framework (2) Insertion μ Y = (s+r) P(X i, X j |ins=r) = P(X i |ins=r)P(X j |ins=r) P(X i |ins=r) = 1 - P(μ Y - δ ≤Y≤μ y + δ) where δ= |μ Y - (s+r)|, s = mapped distance μ y - δ p(Y)

52 Probabilistic Framework (3) Deletion μ Y = (s-r) P(X i, X j |del=r) = P(X i |del=r)P(X j |del=r) P(X i |del=r) = 1 - P(μ Y - δ ≤ Y ≤μ y + δ) where δ= |μ Y - (s-r)|, s = mapped distance μ y - δ p(Y)

53 Probabilistic Framework (4) c - d = s(X1) - s(X2) P(X i, X j |inv) = 1 - P(μ |Y1-Y2| - δ ≤|Y1-Y2|≤μ |Y1-Y2| + δ) where δ= |μ |Y1-Y2| – (c – d)| μ |Y1-Y2| -δ p(|Y1-Y2|) Inversion

54 Probabilistic Framework (5) μ |Y1-Y2| -δ (c – a) – (d – b) = s(X1) - s(X2) P(X i, X j |trans) = 1 - P(μ |Y1-Y2| - δ ≤ |Y1- Y2| ≤μ |Y1-Y2| + δ) where δ= |μ |Y1-Y2| – (c – a) – (d – b) | p(|Y1-Y2|) Translocation

55 Remove very similar mappings Flow of our Framework (1) 1. Preprocessing step Remove short mappings Make all possible combinations of mappings Discard concordant matepairs Remove invalid strands (-,+) Get top K mappings Mask repeats

56 Flow of our Framework (2) 2. Clustering 3. Finding structural variations Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) Find a (locally) optimal configuration Learn parameters for the objective function Find initial configuration

57 X2 Hierarchical Clustering (1) (ex) Insertion A REF A cluster is a set of maped locations explaining the same structural variant Linkage distance is D(X1, X2) = - ln P(X1, X2|C) X1 X2 C={X1, X2}

58 Hierarchical Clustering (2) Linkage distance is Find two closest clusters; if D(C u,C v )< cutoff, merge. R1 R2 C2 C1 123 4 5

59 Find a Unique Mapped Location Assign matepairs to unique mapped locations (and hence unique clusters). R1 R2 C2 C1 C2C1 R2 R1 123 4 5 M 1,4 M 2,4 M 3,5

60 Which Location is Best? We define a objective Function J(ω) –ƒ 1 corresponds to BLAT hit scores –ƒ 2 corresponds to the probability –ƒ 3 corresponds to the size of clusters

61 Finding the “Best” Location Find the initial configuration greedily. –Assign matepairs to clusters starting with those with fewest mapped locations Learn parameters for objective function J(ω). –We used hill climbing search to maximize the log likelihood of P(ω|λ i ). Finally, find a configuration, locally maximizing J(ω) using hill climbing search.

62 Clustering Results We started with ~2,984,000 matepair ~93% were uniquely mapped ~94% had a concordant position (mapped at  ± 2  ) Through the clustering procedure we found (FDR 0.05) 795 Insertion clusters (691 had a uniquely mapped read) 1289 Deletion clusters (1120) ~200 Inversion clusters (~150) 164 Translocation (cross-chromosome) cluster (all were required to have a uniquely mapped read)

63 Example Deletion

64 Agreement with Previous Results We have compared All of the correlations (besides the one) are significant (p-values < 0.001 via Monte Carlo) TypeAllTuzunLevyKorbelDGV-All Insertion795(691)50(36)/139109(101)/3191(1)/34209(169)/2216 Deletion1289(1120)84(70)/102194(188)/344275(236)/742539(446)/4697 Inversion~200(~150)198(46)/56N/A67(55)/105111(87)/164

65 Translocations 47% of the translocations were close to the centromeres She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart These could also be mis-assemblies. Distance to centromere <10 6 (10 6, 4.5*10 6 ]>4.5*10 6 <10 6 383619 (10 6,4.5*10 6 ]33 >4.5*10 6 65

66 Summary (Structural Variation) Introduced a probabilistic framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. Isolated hundreds of insertions, deletions, and inversions between the reference public human genome and the JCVI donor. These results show statistically significant correlation with previous variation studies About 2/3 of the structural variants we isolate is not found in the Database of Genomic Variants

67 What about Copy Number Variants? Copy Number Variants are the result of duplications and deletions of large genomic segments Currently mainly found using microarray technology (ROMA, CGH) There is no algorithm for CNV finding with short reads (?) Goal: predict the number of times a certain segment appears in the genome

68 A Little Bit of Math Let C = #reads / length of genome Let i be a read Let x i be # of times it was sampled. Assembled genome should contain every read about x i / C times. For example, let C = 3, x i = 7

69 More formally Let n = number of reads, N = length of the genome The probability P i that the read i was sampled x i times given that it appears in the genome g i times is We want to maximize the likelihood that all of the reads were sampled from the genome: However there is an additional constraint

70 The additional constraint… ATCGGCACTG GATCGGCACT TATCGGCACT g 1 + g 2 = g 3

71 Solving for all g i … Simultaneously! ATCGGCACTG GATCGGCACT TATCGGCACT This is just min-cost network flow with convex costs! Instead of Maximizing the product minimize sum of the logs:

72 Copy Count Prediction Results Simulated reads from E.Coli bacteria (4.5Mb) How to scale this to Human??? CCopy-Count Error -20+1+2+3 50x43973.9 M170186 75X074.3 M2200 100X024.5 M600 200X004.5 M400

73 Discovering Variation SHRiMP -- SHort Read Mapping Package –Computes p-values & other statistics –Specialized Color-space alignment Algorithm for Structural Variation Discovery –Will it scale to short reads? A model for Copy Count Prediction –Works well with reads from E. coli, but how to scale to Human?


Download ppt "High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010."

Similar presentations


Ads by Google