Presentation is loading. Please wait.

Presentation is loading. Please wait.

TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.

Similar presentations


Presentation on theme: "TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto."— Presentation transcript:

1 TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto

2 Outline Lab focus Our tools SHRiMP: read mapper VARiD: SNP and indel finder Savant: genome browser Discussion

3 main focus: genomic analysis using output from high- throughput sequencing (HTS) machines higher throughput: sequence billions of nucleotides per week poorer data quality: “reads” are shorter; error profiles are poorly understood What we do

4 HTS Pipeline

5 Problem 1. Assembly ASSEMBLY reconstruct the complete (or most) of the sequenced genome our tool: unreleased

6 Problem 2. Alignment ALIGNMENT find region in a “reference” genome that matches closely with each read; suggests similar origin from “donor” our tool: SHRiMP

7 Problem 3. Genetic Variation Discovery GENETIC VARIATION DISCOVERY find differences between two genomes between donor and reference between two samples (e.g. tumour vs. normal) our tools: VARiD, MODiL, CNVer

8 Genetic Variation Single Nucleotide Polymorphism (SNP): genomes have different nucleotides at corresponding positions VARiD – VARiation IDentification Insertions and Deletions (Indels): genomes have additional sequence put in or sequence taken out at corresponding locations MODiL – Mixtures of Distributions Indel Locator Copy Number Variation (CNV): genomes have a different number of the same sequence CNVer

9 Our Tools READ MAPPING (SHRiMP) SNP DETECTION (VARiD) SNP DETECTION (VARiD) INDEL DETECTION (MODiL) INDEL DETECTION (MODiL) CNV DETECTION (CNVer) CNV DETECTION (CNVer) ASSEMBLY (UNNAMED) ASSEMBLY (UNNAMED) VISUALIZATION (SAVANT) VISUALIZATION (SAVANT)

10 SHRIMP – SHORT READ MAPPING PACKAGE Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

11 Key SHRiMP Features High Sensitivity Support for common formats (SAM, FASTQ, etc) Flexible seeding framework Multi-threading Full support for SOLiD and Illumina (and 454) reads Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

12 Sensitivity/Specificity Comparison Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

13 Runtime comparison Savant Genome Browser - http://compbio.cs.toronto.edu/savant/ Unpaired 50bp Reads Paired 75bp Reads Mapping 6 million reads to C. Savignyi (180 Mb)

14 VARID – SNP AND INDEL DETECTION Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

15 motivation | methods | results | summary Variation detection from NGS reads Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTG Aligned reads: ATCCATTGCA GATCCACTGCAC Determine differences (variation) between reference and donor using NGS reads of the donor

16 Motivation Color-space and Letter-space platforms bring them together Motivation Color-space and Letter-space platforms bring them together Methods Summary Results 16

17 motivation | methods | results | summary Sequencing Platforms letter-space Sanger, 454, Illumina, etc > NC_005109.2 | BRCA1 SX3 TCAGCATCGGCATCGACTGCACAGG color-space AB SOLiD less software tools available > NC_005109.2 | BRCA1 AF3 T212313230313232121311120 many differences -> useful to combine this information sequencing biases inherent errors advantages 17

18 Color Space motivation | methods | results | summary Translation MatrixTranslation Automata 18

19 Translating > T212313230313232121311120 > T Sequencing Error vs SNP Sequencing Error > T212313230313232121311120 > T212313230310232121311120 > TCAGCATCGGCAAGCTGACGTGTCC SNP > TCAGCATCGGCATCGACTGCACAGG > TCAGCATCGGCAGCGACTGCACAGG > T212313230312332121311120 A G C T CAGC ATCGGCATCGACTGCACA GG Color Space motivation | methods | results | summary 19

20 Color Space motivation | methods | results | summary 20 clear distinction between a sequencing error and a SNP can this help us in SNP detection? sounds like it! single color change  error, 2 colors changed  (likely) SNP. Easy snp call Well covered bases Difficult Case reference in color-space reads position

21 Detection Heterozygous SNPs Homozygous SNPs Tri-allelic SNPs small indels account for various errors, quality values & misalignments Motivation variation caller to handle both letter-space & color-space reads Motivation motivation | methods | results | summary 21 VARiD system to make inferences on the donor bases variation detection

22 Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Motivation Summary Results 22

23 Statistical model for a system - states Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state We can observe the state’s emission (output) each state has a probability distribution over outputs Hidden Markov Model (HMM) motivation | methods | results | summary 23 S1S1 S2S2 S3S3 e1e1 e2e2 e1e1 e2e2 e1e1 e2e2

24 Hidden Markov Model (HMM) motivation | methods | results | summary 24 Apply HMM to variation detection: we don’t know the state (donor), but we can observe some output determined by the state (reads)

25 Hidden Markov Model (HMM) motivation | methods | results | summary 25..... B 6 B 7 B 8 B 9..... B 6 B 7 B 7 B 8 B 8 B 9 A ACAC color 0 color 1 A ACAC color 0 color 1 A ACAC color 0 color 1 unknown donor

26 Why pairs of letters? Handle colors. AA and TT gives the same colors. Can’t just model colors The donor could be: letters: AA color 0 letters: AC color 1 : letters: TT color 0 16 combinations States motivation | methods | results | summary 26 B 6 B 7

27 States and Transitions motivation | methods | results | summary 27... B 6 B 7 B 8... B 6 B 7 B 7 B 8 AA CA AT TT : : GA CT : : AA TT.................. Possible States Transitions only certain transitions allowed when allowed, p(X t |X t-1 ) = freq(X t ) each state depends only on the previous states (Markov Process) States 16 possible states only look at second letter

28 T01020100311223 T1030101311223 T20100311223 ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC Unknown genome Color reads Letter reads Emissions motivation | methods | results | summary 28..... B 6 B 7 B 8 B 9..... B 7 B 8 color 0 color 1 A ACAC color 0 A

29 A color 1 color 2 color 3 letters A letters C 1 – 3ε ε ε ε emission probability p(em|AA) letters Tξ 1- 3ξ letters Gξ ξ Emission Probabilities motivation | methods | results | summary 29 Same color emission distribution T Different letter emission distribution T

30 E.g. For state CC: Combining emission probabilities probability that this state emitted these reads. motivation | methods | results | summary Emission Probabilities 30 T01020100311223 T1030101311223 T20100311223 ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC..... B 6 B 7 B 8 B 9.....

31 Summary unknown state donor pair at location transitions transition probabilities emissions reads at location emission probabilities motivation | methods | results | summary Simple HMM 31 B 6 B 7 AA AC color 0 color 1

32 Have set-up a form of an HMM run Forward-Backward algorithm get probability distribution over states at some position AA CA ATAT TT : : GA CT : : likely state motivation | methods | results | summary Forward-Backward Algorithm 32 Variation Detection: compare most likely state with reference: ref: GCTATCCA don:...AT...

33 Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Motivation Summary Results 33

34 Simple HMM only detects homozygous SNPs Extended HMM: short indels heterozygous SNPs complex error profiles & quality values motivation | methods | results | summary Extended HMM 34

35 Expand states Have states that include gaps emit: gap or color A- -- -G AG TG T- Have larger states, for diploids Transitions built in similar fashion as before Same algorithm, but in all we have 1600 states with very sparse transitions Expansion: Gaps and het. SNPs motivation | methods | results | summary 35

36 Emission probabilities o Support quality values o Use variable error rates for emissions Translate through the first letter o first color is incorrect o letter-space signal Donor: ACAGCATCGGCATCGACTGC 1123132303123321213 read: >T2123132303123321213 > C123132303123321213 Expansion motivation | methods | results | summary 36 Post-process putative SNPs o uncorrelated adjacent errors may support het SNPs o check putative SNPs

37 motivation | methods | results | summary blue: varid steps Summary

38 Results Motivation Methods Summary 38

39 Results motivation | methods | results | summary Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) Color-space dataset: Compare random subsets: Corona (with AB mapper) VARiD (with SHRiMP) VARiD (with AB mapper) Conclusions: Using F-measure, the three pipelines perform very similarly. High-coverage results is as good as can be achieved 39

40 Results motivation | methods | results | summary Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) Letter-space dataset: Compare random subsets : GigaBayes (with Mosaik) VARiD (with SHRiMP) VARiD (with Mosaik) Conclusion: Using F-measure the three pipelines perform very similarly. High-coverage results is as good as can be achieved 40

41 letter-space F-meas.0x1x2.5x5x10x Color-space 0x0.019.443.559.171.7 10x47.851.859.469.576.5 20x60.658.965.373.480.3 50x73.169.873.680.083.5 100x75.675.277.982.786.0 Results VARiD: Combining Letter-space and Color-space Data to achieve increased accuracy in at-cost comparison motivation | methods | results | summary 41

42 Summary Motivation Methods Results 42

43 Summary of VARiD HMM modeling underlying donor Treats color-space and letter-space together in the same framework no translation – take advantage of each technology’s properties accurately calls short SNPs, indels in both color- and letter-space improved results with hybrid data. Summary motivation | methods | results | summary Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: varid@cs.utoronto.ca Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: varid@cs.utoronto.ca 43

44 SAVANT GENOME BROWSER Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

45 Challenge in Genomic Data Analysis genomic data is generated in high volumes interpretation and analysis challenge typical pipeline employs many separate tools for computation and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

46 Tools for HTS data analysis ToolCostComputationVisualization Read Alignment e.g. Bowtie, BWA FreeYN File Format Conversion e.g. Galaxy, SAMTools FreeYN Other Comand-line Tools e.g. Genetic Variation Discovery, Comparitive Genomics, etc. FreeYN UCSC Genome BrowserFreeNY Integrative Genomics ViewerFreeNY GBrowseFreeNY CLC Genomics Workbench$$$YY Savant Genome Browser - http://compbio.cs.toronto.edu/savant/ substantial disconnect between the processes of computational analysis and visualization

47 Tools for HTS data analysis ToolCostComputationVisualization Read Alignment e.g. Bowtie, BWA FreeYN File Format Conversion e.g. Galaxy, SAMTools FreeYN Other Comand-line Tools e.g. Genetic Variation Discovery, Comparitive Genomics, etc. FreeYN UCSC Genome BrowserFreeNY Integrative Genomics ViewerFreeNY GBrowseFreeNY CLC Genomics Workbench$$$YY Savant Genome BrowserFreeYY Savant Genome Browser - http://compbio.cs.toronto.edu/savant/ substantial disconnect between the processes of computational analysis and visualization

48 Savant Genome Browser platform for integrated visual analysis of genomic data feature-rich genome browser computationally extensible via plugin framework Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

49 (Very) Short List of Features FASTA, BAM (local and remote), BED, WIG, GFF, tab-delimited Data Format Support very fast data access (<1s); small memory footprint (<250 MB) Speed and Interactivity pack, squish for BED and GFF tracks; mismatch, SNP, matepair modes for BAM tracks Alternative Visualization Modes enables powerful analytic extensions to the browser Plugin Framework sessions, bookmarking of interesting regions, track locking, data selection Extras works on all major operating systems; virtually no hardware requirements System Requirements Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

50 FEATURE DEMONSTRATION Savant Genome Browser - http://compbio.cs.toronto.edu/savant/ INTERFACE HTS READ ALIGNMENTS EXAMPLE PLUGINS: SNP FINDER

51 Power of visual analytics task: find the correct parameter for command-line tool Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

52 Plugin Framework unlocks the potential for performing visual analytics mutually beneficial for both users and tool developers for users: perform complex data analyses on-the-fly within a visual environment for programmers: platform for simple development and deployment of various programs Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

53 Plugin Development plugin development is easy API contains over a hundred prebuilt functions (e.g. get track data, add bookmarks, draw custom graphics, etc.) SDK includes API and example plugin project on website Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

54 CONCLUSIONS Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

55 Conclusions Savant is a platform for integrated visualization and analysis of genomic data stand-alone genome browser novel features: e.g. table view, visualization modes, data selection, etc. computationally extensible through plugin framework makes interpretation and analysis of genomic data easier and more efficient Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

56 Acknowledgements RecepAndrewVladMike Brudno YueMarc Vanessa Orion JoeNilgun Paul Vera MiskoYoni

57 Questions? SHRiMP http://compbio.cs.toronto.edu/shrimp VARiD http://compbio.cs.toronto.edu/varid Savant Genome Browser http://compbio.cs.toronto.edu/savant


Download ppt "TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto."

Similar presentations


Ads by Google