TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.

Slides:

Advertisements

Similar presentations

Markov models and applications

Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

High Throughput Sequencing

VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.

Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.

درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.

Introduction to Short Read Sequencing Analysis

. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.

Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Comparative ab initio prediction of gene structures using pair HMMs

Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

NGS Analysis Using Galaxy

Next generation sequencing Xusheng Wang 4/29/2010.

High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Whole Exome Sequencing for Variant Discovery and Prioritisation

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.

Li and Dewey BMC Bioinformatics 2011, 12:323

Todd J. Treangen, Steven L. Salzberg

Introduction to Short Read Sequencing Analysis

Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology

BINF6201/8201 Hidden Markov Models for Sequence Analysis

GBS Bioinformatics Pipeline(s) Overview

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

NGS data analysis CCM Seminar series Michael Liang:

Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.

1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,

Quick introduction to genomic file types Preliminary quality control (lab)

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.

Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.

Identification of Copy Number Variants using Genome Graphs

VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.

Introduction to RNAseq

California Pacific Medical Center

SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.

SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College

1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

Canadian Bioinformatics Workshops

1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.

Canadian Bioinformatics Workshops

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

Day 5 Mapping and Visualization

Canadian Bioinformatics Workshops

Disease risk prediction

CSC2431 February 3rd 2010 Alecia Fowler

Data formats Gabor T. Marth Boston College

BF528 - Genomic Variation and SNP Analysis

Canadian Bioinformatics Workshops

Presentation transcript:

TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto

Outline Lab focus Our tools SHRiMP: read mapper VARiD: SNP and indel finder Savant: genome browser Discussion

main focus: genomic analysis using output from high- throughput sequencing (HTS) machines higher throughput: sequence billions of nucleotides per week poorer data quality: “reads” are shorter; error profiles are poorly understood What we do

HTS Pipeline

Problem 1. Assembly ASSEMBLY reconstruct the complete (or most) of the sequenced genome our tool: unreleased

Problem 2. Alignment ALIGNMENT find region in a “reference” genome that matches closely with each read; suggests similar origin from “donor” our tool: SHRiMP

Problem 3. Genetic Variation Discovery GENETIC VARIATION DISCOVERY find differences between two genomes between donor and reference between two samples (e.g. tumour vs. normal) our tools: VARiD, MODiL, CNVer

Genetic Variation Single Nucleotide Polymorphism (SNP): genomes have different nucleotides at corresponding positions VARiD – VARiation IDentification Insertions and Deletions (Indels): genomes have additional sequence put in or sequence taken out at corresponding locations MODiL – Mixtures of Distributions Indel Locator Copy Number Variation (CNV): genomes have a different number of the same sequence CNVer

Our Tools READ MAPPING (SHRiMP) SNP DETECTION (VARiD) SNP DETECTION (VARiD) INDEL DETECTION (MODiL) INDEL DETECTION (MODiL) CNV DETECTION (CNVer) CNV DETECTION (CNVer) ASSEMBLY (UNNAMED) ASSEMBLY (UNNAMED) VISUALIZATION (SAVANT) VISUALIZATION (SAVANT)

SHRIMP – SHORT READ MAPPING PACKAGE Savant Genome Browser -

Key SHRiMP Features High Sensitivity Support for common formats (SAM, FASTQ, etc) Flexible seeding framework Multi-threading Full support for SOLiD and Illumina (and 454) reads Savant Genome Browser -

Sensitivity/Specificity Comparison Savant Genome Browser -

Runtime comparison Savant Genome Browser - Unpaired 50bp Reads Paired 75bp Reads Mapping 6 million reads to C. Savignyi (180 Mb)

VARID – SNP AND INDEL DETECTION Savant Genome Browser -

motivation | methods | results | summary Variation detection from NGS reads Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTG Aligned reads: ATCCATTGCA GATCCACTGCAC Determine differences (variation) between reference and donor using NGS reads of the donor

Motivation Color-space and Letter-space platforms bring them together Motivation Color-space and Letter-space platforms bring them together Methods Summary Results 16

motivation | methods | results | summary Sequencing Platforms letter-space Sanger, 454, Illumina, etc > NC_ | BRCA1 SX3 TCAGCATCGGCATCGACTGCACAGG color-space AB SOLiD less software tools available > NC_ | BRCA1 AF3 T many differences -> useful to combine this information sequencing biases inherent errors advantages 17

Color Space motivation | methods | results | summary Translation MatrixTranslation Automata 18

Translating > T > T Sequencing Error vs SNP Sequencing Error > T > T > TCAGCATCGGCAAGCTGACGTGTCC SNP > TCAGCATCGGCATCGACTGCACAGG > TCAGCATCGGCAGCGACTGCACAGG > T A G C T CAGC ATCGGCATCGACTGCACA GG Color Space motivation | methods | results | summary 19

Color Space motivation | methods | results | summary 20 clear distinction between a sequencing error and a SNP can this help us in SNP detection? sounds like it! single color change  error, 2 colors changed  (likely) SNP. Easy snp call Well covered bases Difficult Case reference in color-space reads position

Detection Heterozygous SNPs Homozygous SNPs Tri-allelic SNPs small indels account for various errors, quality values & misalignments Motivation variation caller to handle both letter-space & color-space reads Motivation motivation | methods | results | summary 21 VARiD system to make inferences on the donor bases variation detection

Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Motivation Summary Results 22

Statistical model for a system - states Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state We can observe the state’s emission (output) each state has a probability distribution over outputs Hidden Markov Model (HMM) motivation | methods | results | summary 23 S1S1 S2S2 S3S3 e1e1 e2e2 e1e1 e2e2 e1e1 e2e2

Hidden Markov Model (HMM) motivation | methods | results | summary 24 Apply HMM to variation detection: we don’t know the state (donor), but we can observe some output determined by the state (reads)

Hidden Markov Model (HMM) motivation | methods | results | summary B 6 B 7 B 8 B B 6 B 7 B 7 B 8 B 8 B 9 A ACAC color 0 color 1 A ACAC color 0 color 1 A ACAC color 0 color 1 unknown donor

Why pairs of letters? Handle colors. AA and TT gives the same colors. Can’t just model colors The donor could be: letters: AA color 0 letters: AC color 1 : letters: TT color 0 16 combinations States motivation | methods | results | summary 26 B 6 B 7

States and Transitions motivation | methods | results | summary B 6 B 7 B 8... B 6 B 7 B 7 B 8 AA CA AT TT : : GA CT : : AA TT Possible States Transitions only certain transitions allowed when allowed, p(X t |X t-1 ) = freq(X t ) each state depends only on the previous states (Markov Process) States 16 possible states only look at second letter

T T T ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC Unknown genome Color reads Letter reads Emissions motivation | methods | results | summary B 6 B 7 B 8 B B 7 B 8 color 0 color 1 A ACAC color 0 A

A color 1 color 2 color 3 letters A letters C 1 – 3ε ε ε ε emission probability p(em|AA) letters Tξ 1- 3ξ letters Gξ ξ Emission Probabilities motivation | methods | results | summary 29 Same color emission distribution T Different letter emission distribution T

E.g. For state CC: Combining emission probabilities probability that this state emitted these reads. motivation | methods | results | summary Emission Probabilities 30 T T T ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC..... B 6 B 7 B 8 B

Summary unknown state donor pair at location transitions transition probabilities emissions reads at location emission probabilities motivation | methods | results | summary Simple HMM 31 B 6 B 7 AA AC color 0 color 1

Have set-up a form of an HMM run Forward-Backward algorithm get probability distribution over states at some position AA CA ATAT TT : : GA CT : : likely state motivation | methods | results | summary Forward-Backward Algorithm 32 Variation Detection: compare most likely state with reference: ref: GCTATCCA don:...AT...

Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Methods Simple HMM Model states, emissions, transitions, FB Extended HMM Model gaps, diploids, exceptions Motivation Summary Results 33

Simple HMM only detects homozygous SNPs Extended HMM: short indels heterozygous SNPs complex error profiles & quality values motivation | methods | results | summary Extended HMM 34

Expand states Have states that include gaps emit: gap or color A- -- -G AG TG T- Have larger states, for diploids Transitions built in similar fashion as before Same algorithm, but in all we have 1600 states with very sparse transitions Expansion: Gaps and het. SNPs motivation | methods | results | summary 35

Emission probabilities o Support quality values o Use variable error rates for emissions Translate through the first letter o first color is incorrect o letter-space signal Donor: ACAGCATCGGCATCGACTGC read: >T > C Expansion motivation | methods | results | summary 36 Post-process putative SNPs o uncorrelated adjacent errors may support het SNPs o check putative SNPs

motivation | methods | results | summary blue: varid steps Summary

Results Motivation Methods Summary 38

Results motivation | methods | results | summary Human dataset from Harismendy et al, (NA17156,17275,17460,17773) Color-space dataset: Compare random subsets: Corona (with AB mapper) VARiD (with SHRiMP) VARiD (with AB mapper) Conclusions: Using F-measure, the three pipelines perform very similarly. High-coverage results is as good as can be achieved 39

Results motivation | methods | results | summary Human dataset from Harismendy et al, (NA17156,17275,17460,17773) Letter-space dataset: Compare random subsets : GigaBayes (with Mosaik) VARiD (with SHRiMP) VARiD (with Mosaik) Conclusion: Using F-measure the three pipelines perform very similarly. High-coverage results is as good as can be achieved 40

letter-space F-meas.0x1x2.5x5x10x Color-space 0x x x x x Results VARiD: Combining Letter-space and Color-space Data to achieve increased accuracy in at-cost comparison motivation | methods | results | summary 41

Summary Motivation Methods Results 42

Summary of VARiD HMM modeling underlying donor Treats color-space and letter-space together in the same framework no translation – take advantage of each technology’s properties accurately calls short SNPs, indels in both color- and letter-space improved results with hybrid data. Summary motivation | methods | results | summary Website: (VARiD freely available) Contact: Website: (VARiD freely available) Contact: 43

SAVANT GENOME BROWSER Savant Genome Browser -

Challenge in Genomic Data Analysis genomic data is generated in high volumes interpretation and analysis challenge typical pipeline employs many separate tools for computation and visualization Savant Genome Browser -

Tools for HTS data analysis ToolCostComputationVisualization Read Alignment e.g. Bowtie, BWA FreeYN File Format Conversion e.g. Galaxy, SAMTools FreeYN Other Comand-line Tools e.g. Genetic Variation Discovery, Comparitive Genomics, etc. FreeYN UCSC Genome BrowserFreeNY Integrative Genomics ViewerFreeNY GBrowseFreeNY CLC Genomics Workbench$$$YY Savant Genome Browser - substantial disconnect between the processes of computational analysis and visualization

Tools for HTS data analysis ToolCostComputationVisualization Read Alignment e.g. Bowtie, BWA FreeYN File Format Conversion e.g. Galaxy, SAMTools FreeYN Other Comand-line Tools e.g. Genetic Variation Discovery, Comparitive Genomics, etc. FreeYN UCSC Genome BrowserFreeNY Integrative Genomics ViewerFreeNY GBrowseFreeNY CLC Genomics Workbench$$$YY Savant Genome BrowserFreeYY Savant Genome Browser - substantial disconnect between the processes of computational analysis and visualization

Savant Genome Browser platform for integrated visual analysis of genomic data feature-rich genome browser computationally extensible via plugin framework Savant Genome Browser -

(Very) Short List of Features FASTA, BAM (local and remote), BED, WIG, GFF, tab-delimited Data Format Support very fast data access (<1s); small memory footprint (<250 MB) Speed and Interactivity pack, squish for BED and GFF tracks; mismatch, SNP, matepair modes for BAM tracks Alternative Visualization Modes enables powerful analytic extensions to the browser Plugin Framework sessions, bookmarking of interesting regions, track locking, data selection Extras works on all major operating systems; virtually no hardware requirements System Requirements Savant Genome Browser -

FEATURE DEMONSTRATION Savant Genome Browser - INTERFACE HTS READ ALIGNMENTS EXAMPLE PLUGINS: SNP FINDER

Power of visual analytics task: find the correct parameter for command-line tool Savant Genome Browser -

Plugin Framework unlocks the potential for performing visual analytics mutually beneficial for both users and tool developers for users: perform complex data analyses on-the-fly within a visual environment for programmers: platform for simple development and deployment of various programs Savant Genome Browser -

Plugin Development plugin development is easy API contains over a hundred prebuilt functions (e.g. get track data, add bookmarks, draw custom graphics, etc.) SDK includes API and example plugin project on website Savant Genome Browser -

CONCLUSIONS Savant Genome Browser -

Conclusions Savant is a platform for integrated visualization and analysis of genomic data stand-alone genome browser novel features: e.g. table view, visualization modes, data selection, etc. computationally extensible through plugin framework makes interpretation and analysis of genomic data easier and more efficient Savant Genome Browser -

Acknowledgements RecepAndrewVladMike Brudno YueMarc Vanessa Orion JoeNilgun Paul Vera MiskoYoni

Questions? SHRiMP VARiD Savant Genome Browser