Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Introduction to Short Read Sequencing Analysis
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
High Throughput Sequencing
Bioinformatics and BLAST
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Short Read Sequencing Analysis
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT) Donavan T. Cheng, Talia N. Mitchell, Ahmet Zehir, Ronak.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
From Reads to Results Exome-seq analysis at CCBR
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Welcome to Introduction to Bioinformatics
Identifying templates for protein modeling:
Sequencing Data Analysis
Assessing Copy Number Alterations in Targeted, Amplicon-Based Next-Generation Sequencing Data  Catherine Grasso, Timothy Butler, Katherine Rhodes, Michael.
Jin Zhang, Jiayin Wang and Yufeng Wu
Bioinformatics and BLAST
2nd (Next) Generation Sequencing
Clinical Application of Picodroplet Digital PCR Technology for Rapid Detection of EGFR T790M in Next-Generation Sequencing Libraries and DNA from Limited.
Sequence alignment, Part 2
Donavan T. Cheng, Talia N. Mitchell, Ahmet Zehir, Ronak H
Detection of Mutations in Myeloid Malignancies through Paired-Sample Analysis of Microdroplet-PCR Deep Sequencing Data  Donavan T. Cheng, Janice Cheng,
Basic Local Alignment Search Tool (BLAST)
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Dara S. Ross, Ahmet Zehir, Donavan T
Sequence alignment, E-value & Extreme value distribution
Sequencing Data Analysis
Presentation transcript:

Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th, 2015

Bioinformatics is a data science Molecular Profiling Bioinformatics Bioinformatics ~ molecular profiling “Application of quantitative methods in statistics and computer science to manage and analyze increasingly large datasets produced by high throughput technologies.” SangerArrays Next Generation Sequencing

Applications of bioinformatics in NGS Sequencer output Sequence alignment Variant calling 123 Variant Annotation/Filte ring 4 1) DNA-Seq: Identify mutations (deviations from reference genome) Targeted capture (Amplicon/Hybridization capture) Exome capture Whole genome sequencing 2) RNA-Seq: Applications: i) Evaluate transcript abundance ii) Identify alternatively spliced transcript isoforms iii) Identify gene fusions

Webinar agenda 1)Background: Overview of NGS sequencing technologies 2)Overview of sequence alignment methods: Introduction to global and local alignment methods Scoring matrices for alignment Introduction to BLAST 3)A typical analysis workflow for DNA-Seq 4)Variant calling SNVs, Indels, Copy Number Variants and Structural Rearrangements

Next Generation Sequencing Platforms Illumina HiSeq 2000 Illumina MiSeq Ion Torrent PGM Current: 300 – 600 Gb 6 – 11 days 1.5 Gb 1 day 10 Mb – 1 Gb 6 hours Slide courtesy of Michael Berger Illumina HiSeq Gb 27 hours (Rapid Run) Ion Torrent PGM

Next Generation Sequencing Platforms Illumina Library Prep Cluster Amplification Sequencing by Synthesis Illumina, inc. Slide courtesy of Michael Berger

Next Generation Sequencing Platforms Ion Torrent Life Technologies Slide courtesy of Michael Berger

Next Generation Sequencing Platforms Ion Torrent Life Technologies Slide courtesy of Michael Berger

Mapping sequencing output to the reference genome How do we go from: >Sequence1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC to: >Sequence1: chr TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC for millions of reads in a timely manner?

An introduction to sequence alignment algorithms Global alignment -Alignment is performed from beginning till end of sequence to find best possible alignment -Every position in the sequence is evaluated. -Useful if query and reference sequences are of similar length. Local alignment -Finds regions of high local similarity between query and reference sequences. -Works even if query and reference sequences are dissimilar in length. Needleman-Wunsch Smith-Waterman

Dynamic programming Query Situation/Score Match+1 Mismatch-1 Gap-1 Global alignment: Needleman-Wunsch algorithm T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty

Dynamic programming 2. Induction step Query T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) 0 + (-1) = -1 T(i-1,j) + Gap penalty-1 + (-1) = -2 T(i,j-1) + Gap penalty-1 + (-1) = -2

Dynamic programming 2. Repeat for every (i,j) position in the matrix.

Dynamic programming 3. Traceback COELACANTH -PELICAN- COELACANTH - COELACANTH -P COELACANTH -PE GAP ALIGN GAP COELACANTH -PELICAN-- GAP …

Local alignment: Smith waterman COELACANTH P0 E0 L0 I0 C0 A0 N0 1.Initialization step

Local alignment: Smith waterman COELACANTH P0 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH P00 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)-6 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH P E0004 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)+4 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH P E L I C A N Traceback step -Start from the highest value cell (16) -Follow arrows while score is still positive (greater than zero) COEL-ACANTH || ||| ELI-CAN Local: COELACANTH -PELICAN-- Global:

Scoring matrices 1. Identity Identical=+1 Mismatch=-1 2. For DNA Identical=+3 Transitions=-1 Transversions=-3 “User-defined” Observed/Empirically derived 3. For Amino Acid Alignments PAMBLOSUM Point accepted mutations Block substitution matrix -Global alignments of very similar proteins (> 85% similarity) -PAM1: 1 accepted point mutation per 100 amino acids Use for global alignments Lower PAM number = strict Higher PAM number= permissive -Multiple ungapped local alignments of related proteins -BLOSUM62: Proteins used were at least 62% similar Use for local alignments. Lower BLOSUM number= permissive Higher BLOSUM number = strict

Example of scoring matrices PAM 1 scoring matrix: 1 accepted point mutation per 100 amino acids ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV

BLAST -Solution for rapid searching of a sequence database (i.e. Genbank) for sequences that match a query sequence. 1) Generate a word seed dictionary from query sequence Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215

BLAST -Solution for rapid searching of a sequence database (i.e. Genbank) for sequences that match a query sequence. 1) Generate a word seed dictionary from query sequence Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215

BLAST 2) Generate ‘neighborhood’ of possible matching words Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 PQG20 x 20 x 20 possible words PRG PEG PQA … Apply scoring matrix PRG14 PEG15 PQA12 Keep only words scoring higher than threshold (> 13) PRG14 PEG15

BLAST 1) Word dictionary from query sequence: PQG PRG PEG … Genbank sequence QUERY 2) Identify regions of exact matching 3) Seed and extend High scoring pair (HSP)

BLAST 4) Report sequences in database with HSPs. E-value takes into account: a)significance of a pairwise alignment b)size of database queried c)scoring system used E-value of 0.05 indicates there is 1 in 20 probability that this result occurred by chance. Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215

Flavors of BLAST Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 Query Reference database Nucleotide sequence 6 frame translation of query nucleotide sequence Amino acid sequence Nucleotide database 6 frame translation of reference nucleotide database Amino acid database blastn blastx blastp PSI-blast tblastn tblastx

DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes B BB B B B Hybridize and select (NimbleGen SeqCap) Sequence to X (2 lanes of HiSeq 2500) BB B B Probes for 340 cancer genes Prepare libraries adapted from Wagle, Berger et al., Cancer Discovery, 2:82-93, Jan 2012 Align to genome and analyze Genomics Core Lab Agnes Viale Berger Lab Slide courtesy of Michael Berger

DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes >30x coverage in Normal >30x coverage in Tumor Tumor Cells Normal Cells ~ bp genomic DNA bp read “Paired end” sequencing Slide courtesy of Michael Berger

Analysis workflow in DNA-Seq General best practice workflow 1)Adaptor trimming 1)Mapping reads to the human genome [Sequence alignment] 1)Marking PCR duplicates 1)Realignment of indels 2)Base quality recalibration

1. Adaptor Trimming Sheared genomic DNA (150bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 50bp remaining100bp sequenced 50bp remaining

1. Adaptor Trimming Degraded FFPE DNA (70bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 100bp sequenced (70bp actual/30bp adaptor)

Alignment algorithms for NGS Li H. et al, Bioinformatics, 2010

BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, )Burrows-Wheeler Transform: a data structure representation that reduces the footprint of loading the entire human genome into memory. 2)Seed and extend method, which uses Smith-Waterman local alignment to assess regions of high similarity. 3)BWA-MEM is an updated version of BWA – it provides more consistent performance across a wider range of read lengths and permits multiple primary alignments.

BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, 2010 Simulated Single end reads Simulated Paired end reads

3. Marking PCR duplicates Hybridization capture: B B Capture Post-Capture PCR Unique coverage post-deduplication

3. Marking PCR duplicates Amplicon PCR: Amplicon PCR Non-unique coverage -By definition, all amplicons have same start and stop positions. Unable to deduplicate. -Representation of template molecules in final PCR output may differ from input mix.

M Efficient PCR M M Single round PCR All template molecules amplified M Inefficient PCR M M M M M M M M M M M M Multiple rounds of PCR Not all template molecules are amplified each round. M Mutant readsVariant Frequency = 67% VF= 67%VF= 60%VF= 55% 3. Marking PCR duplicates

4. Indel realignment Allows accurate phasing of indels Ronak Shah

4. Indel realignment Ronak Shah

5. Base quality calibration Before After Base quality ~ f(read group, machine cycle, reported quality score, nucleotide context)

Variant calling in DNA-Seq Types of mutations that can be identified using DNA-Seq (MSK-IMPACT as example) 1)Point mutations (Single nucleotide variants – SNVs) 1)Insertions and deletions (Indels) 1)Copy number variants (CNVs) 1)Structural variants (i.e. translocations, inversions etc)

SNV calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Somatic Nonsense Mutation BRCA2 (E1395*) Loss of Heterozygosity BRCA1 (R1751*) G: 968 T: 605 G: 968 T: 605 G: 1021 T: 1 G: 1021 T: 1 G: 1 A: 936 G: 1 A: 936 G: 749 A: 771 G: 749 A: 771 Slide courtesy of Michael Berger

SNV calling in MSK-IMPACT Cibulskis K. et al, Nature Biotech, 2013

Copy number variant calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair NOTCH4 MLL3MET FGFR3 CDKN2A/B PTEN ERG Copy Number Alterations Slide courtesy of Michael Berger

Copy number variant calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair NOTCH4 MLL3MET FGFR3 CDKN2A/B PTEN ERG Copy Number Alterations Slide courtesy of Michael Berger

Adjustment for GC bias Ostensibly, Copy number = Sequence coverage for gene in tumor sample Sequence coverage for gene in normal sample BUT sequence coverage can be skewed by GC content of capture probes

Segmentation: Identifying levels in Copy Number data RATIO SEGMENTED

IMPACT aCGH (non-NGS confirmation)

Structural variant detection in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Structural Rearrangement chr9 baitsno baits 240-kb deletion including CDKN2A/B Slide courtesy of Michael Berger

Paired-end RNA-Seq Can Reveal Fusion Transcripts BCR-ABL1 fusion in K-562 CML line Discordant read pairs Slide courtesy of Michael Berger Chr22Chr9

Paired-end RNA-Seq Can Reveal Fusion Transcripts Discordant read pairs Fusion-spanning individual reads Slide courtesy of Michael Berger Chr22Chr9

Acknowledgements Berger Lab Ronak Shah Rose Brannon Sasinya Scott Helen Won Gregory McDermott Nancy Bouvier Michael F. Berger Clinical Bioinformatics Ahmet Zehir Aijazuddin Syed Raghu Chandramohan Abhinita Mohanty Meera Prasad Mustafa Syed Sumit Middha Elsa Wang Ryan Ptashkin Jack Birnbaum Zhen Yu (Tony) Liu Molecular Diagnostics Service Marc Ladanyi Maria Arcila Liying Zhang Ryma Benayed Talia Mitchell Catherine O’ Reilly Jacklyn Casanova Angela Yannes Anna Plentsova Michael Zaidinski Yvonne Chekaluk Nana Mensah Justyna Sadowska Doudja Nafa Dara Ross Jackie Hechtman Laetitia Borsu Meera Hameed David Klimstra Bioinformatics Core Manda Wilson Nicholas Socci Chris Pepper Joanne Edington JianJiong Gao Niki Schultz Genomics Core Agnes Viale