Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th, 2015
Bioinformatics is a data science Molecular Profiling Bioinformatics Bioinformatics ~ molecular profiling “Application of quantitative methods in statistics and computer science to manage and analyze increasingly large datasets produced by high throughput technologies.” SangerArrays Next Generation Sequencing
Applications of bioinformatics in NGS Sequencer output Sequence alignment Variant calling 123 Variant Annotation/Filte ring 4 1) DNA-Seq: Identify mutations (deviations from reference genome) Targeted capture (Amplicon/Hybridization capture) Exome capture Whole genome sequencing 2) RNA-Seq: Applications: i) Evaluate transcript abundance ii) Identify alternatively spliced transcript isoforms iii) Identify gene fusions
Webinar agenda 1)Background: Overview of NGS sequencing technologies 2)Overview of sequence alignment methods: Introduction to global and local alignment methods Scoring matrices for alignment Introduction to BLAST 3)A typical analysis workflow for DNA-Seq 4)Variant calling SNVs, Indels, Copy Number Variants and Structural Rearrangements
Next Generation Sequencing Platforms Illumina HiSeq 2000 Illumina MiSeq Ion Torrent PGM Current: 300 – 600 Gb 6 – 11 days 1.5 Gb 1 day 10 Mb – 1 Gb 6 hours Slide courtesy of Michael Berger Illumina HiSeq Gb 27 hours (Rapid Run) Ion Torrent PGM
Next Generation Sequencing Platforms Illumina Library Prep Cluster Amplification Sequencing by Synthesis Illumina, inc. Slide courtesy of Michael Berger
Next Generation Sequencing Platforms Ion Torrent Life Technologies Slide courtesy of Michael Berger
Next Generation Sequencing Platforms Ion Torrent Life Technologies Slide courtesy of Michael Berger
Mapping sequencing output to the reference genome How do we go from: >Sequence1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC to: >Sequence1: chr TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC for millions of reads in a timely manner?
An introduction to sequence alignment algorithms Global alignment -Alignment is performed from beginning till end of sequence to find best possible alignment -Every position in the sequence is evaluated. -Useful if query and reference sequences are of similar length. Local alignment -Finds regions of high local similarity between query and reference sequences. -Works even if query and reference sequences are dissimilar in length. Needleman-Wunsch Smith-Waterman
Dynamic programming Query Situation/Score Match+1 Mismatch-1 Gap-1 Global alignment: Needleman-Wunsch algorithm T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty
Dynamic programming 2. Induction step Query T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) 0 + (-1) = -1 T(i-1,j) + Gap penalty-1 + (-1) = -2 T(i,j-1) + Gap penalty-1 + (-1) = -2
Dynamic programming 2. Repeat for every (i,j) position in the matrix.
Dynamic programming 3. Traceback COELACANTH -PELICAN- COELACANTH - COELACANTH -P COELACANTH -PE GAP ALIGN GAP COELACANTH -PELICAN-- GAP …
Local alignment: Smith waterman COELACANTH P0 E0 L0 I0 C0 A0 N0 1.Initialization step
Local alignment: Smith waterman COELACANTH P0 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty 0 Match+4 Mismatch -6 Gap-2
Local alignment: Smith waterman COELACANTH P00 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)-6 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2
Local alignment: Smith waterman COELACANTH P E0004 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)+4 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2
Local alignment: Smith waterman COELACANTH P E L I C A N Traceback step -Start from the highest value cell (16) -Follow arrows while score is still positive (greater than zero) COEL-ACANTH || ||| ELI-CAN Local: COELACANTH -PELICAN-- Global:
Scoring matrices 1. Identity Identical=+1 Mismatch=-1 2. For DNA Identical=+3 Transitions=-1 Transversions=-3 “User-defined” Observed/Empirically derived 3. For Amino Acid Alignments PAMBLOSUM Point accepted mutations Block substitution matrix -Global alignments of very similar proteins (> 85% similarity) -PAM1: 1 accepted point mutation per 100 amino acids Use for global alignments Lower PAM number = strict Higher PAM number= permissive -Multiple ungapped local alignments of related proteins -BLOSUM62: Proteins used were at least 62% similar Use for local alignments. Lower BLOSUM number= permissive Higher BLOSUM number = strict
Example of scoring matrices PAM 1 scoring matrix: 1 accepted point mutation per 100 amino acids ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV
BLAST -Solution for rapid searching of a sequence database (i.e. Genbank) for sequences that match a query sequence. 1) Generate a word seed dictionary from query sequence Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215
BLAST -Solution for rapid searching of a sequence database (i.e. Genbank) for sequences that match a query sequence. 1) Generate a word seed dictionary from query sequence Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215
BLAST 2) Generate ‘neighborhood’ of possible matching words Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 PQG20 x 20 x 20 possible words PRG PEG PQA … Apply scoring matrix PRG14 PEG15 PQA12 Keep only words scoring higher than threshold (> 13) PRG14 PEG15
BLAST 1) Word dictionary from query sequence: PQG PRG PEG … Genbank sequence QUERY 2) Identify regions of exact matching 3) Seed and extend High scoring pair (HSP)
BLAST 4) Report sequences in database with HSPs. E-value takes into account: a)significance of a pairwise alignment b)size of database queried c)scoring system used E-value of 0.05 indicates there is 1 in 20 probability that this result occurred by chance. Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215
Flavors of BLAST Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 Query Reference database Nucleotide sequence 6 frame translation of query nucleotide sequence Amino acid sequence Nucleotide database 6 frame translation of reference nucleotide database Amino acid database blastn blastx blastp PSI-blast tblastn tblastx
DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes B BB B B B Hybridize and select (NimbleGen SeqCap) Sequence to X (2 lanes of HiSeq 2500) BB B B Probes for 340 cancer genes Prepare libraries adapted from Wagle, Berger et al., Cancer Discovery, 2:82-93, Jan 2012 Align to genome and analyze Genomics Core Lab Agnes Viale Berger Lab Slide courtesy of Michael Berger
DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes >30x coverage in Normal >30x coverage in Tumor Tumor Cells Normal Cells ~ bp genomic DNA bp read “Paired end” sequencing Slide courtesy of Michael Berger
Analysis workflow in DNA-Seq General best practice workflow 1)Adaptor trimming 1)Mapping reads to the human genome [Sequence alignment] 1)Marking PCR duplicates 1)Realignment of indels 2)Base quality recalibration
1. Adaptor Trimming Sheared genomic DNA (150bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 50bp remaining100bp sequenced 50bp remaining
1. Adaptor Trimming Degraded FFPE DNA (70bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 100bp sequenced (70bp actual/30bp adaptor)
Alignment algorithms for NGS Li H. et al, Bioinformatics, 2010
BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, )Burrows-Wheeler Transform: a data structure representation that reduces the footprint of loading the entire human genome into memory. 2)Seed and extend method, which uses Smith-Waterman local alignment to assess regions of high similarity. 3)BWA-MEM is an updated version of BWA – it provides more consistent performance across a wider range of read lengths and permits multiple primary alignments.
BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, 2010 Simulated Single end reads Simulated Paired end reads
3. Marking PCR duplicates Hybridization capture: B B Capture Post-Capture PCR Unique coverage post-deduplication
3. Marking PCR duplicates Amplicon PCR: Amplicon PCR Non-unique coverage -By definition, all amplicons have same start and stop positions. Unable to deduplicate. -Representation of template molecules in final PCR output may differ from input mix.
M Efficient PCR M M Single round PCR All template molecules amplified M Inefficient PCR M M M M M M M M M M M M Multiple rounds of PCR Not all template molecules are amplified each round. M Mutant readsVariant Frequency = 67% VF= 67%VF= 60%VF= 55% 3. Marking PCR duplicates
4. Indel realignment Allows accurate phasing of indels Ronak Shah
4. Indel realignment Ronak Shah
5. Base quality calibration Before After Base quality ~ f(read group, machine cycle, reported quality score, nucleotide context)
Variant calling in DNA-Seq Types of mutations that can be identified using DNA-Seq (MSK-IMPACT as example) 1)Point mutations (Single nucleotide variants – SNVs) 1)Insertions and deletions (Indels) 1)Copy number variants (CNVs) 1)Structural variants (i.e. translocations, inversions etc)
SNV calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Somatic Nonsense Mutation BRCA2 (E1395*) Loss of Heterozygosity BRCA1 (R1751*) G: 968 T: 605 G: 968 T: 605 G: 1021 T: 1 G: 1021 T: 1 G: 1 A: 936 G: 1 A: 936 G: 749 A: 771 G: 749 A: 771 Slide courtesy of Michael Berger
SNV calling in MSK-IMPACT Cibulskis K. et al, Nature Biotech, 2013
Copy number variant calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair NOTCH4 MLL3MET FGFR3 CDKN2A/B PTEN ERG Copy Number Alterations Slide courtesy of Michael Berger
Copy number variant calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair NOTCH4 MLL3MET FGFR3 CDKN2A/B PTEN ERG Copy Number Alterations Slide courtesy of Michael Berger
Adjustment for GC bias Ostensibly, Copy number = Sequence coverage for gene in tumor sample Sequence coverage for gene in normal sample BUT sequence coverage can be skewed by GC content of capture probes
Segmentation: Identifying levels in Copy Number data RATIO SEGMENTED
IMPACT aCGH (non-NGS confirmation)
Structural variant detection in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Structural Rearrangement chr9 baitsno baits 240-kb deletion including CDKN2A/B Slide courtesy of Michael Berger
Paired-end RNA-Seq Can Reveal Fusion Transcripts BCR-ABL1 fusion in K-562 CML line Discordant read pairs Slide courtesy of Michael Berger Chr22Chr9
Paired-end RNA-Seq Can Reveal Fusion Transcripts Discordant read pairs Fusion-spanning individual reads Slide courtesy of Michael Berger Chr22Chr9
Acknowledgements Berger Lab Ronak Shah Rose Brannon Sasinya Scott Helen Won Gregory McDermott Nancy Bouvier Michael F. Berger Clinical Bioinformatics Ahmet Zehir Aijazuddin Syed Raghu Chandramohan Abhinita Mohanty Meera Prasad Mustafa Syed Sumit Middha Elsa Wang Ryan Ptashkin Jack Birnbaum Zhen Yu (Tony) Liu Molecular Diagnostics Service Marc Ladanyi Maria Arcila Liying Zhang Ryma Benayed Talia Mitchell Catherine O’ Reilly Jacklyn Casanova Angela Yannes Anna Plentsova Michael Zaidinski Yvonne Chekaluk Nana Mensah Justyna Sadowska Doudja Nafa Dara Ross Jackie Hechtman Laetitia Borsu Meera Hameed David Klimstra Bioinformatics Core Manda Wilson Nicholas Socci Chris Pepper Joanne Edington JianJiong Gao Niki Schultz Genomics Core Agnes Viale