Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,

Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology chengd1@mskcc.org AMP Webinar March 26th, 2015

Bioinformatics is a data science Molecular Profiling Bioinformatics Bioinformatics ~ molecular profiling “Application of quantitative methods in statistics and computer science to manage and analyze increasingly large datasets produced by high throughput technologies.” SangerArrays Next Generation Sequencing

Applications of bioinformatics in NGS Sequencer output Sequence alignment Variant calling 123 Variant Annotation/Filte ring 4 1) DNA-Seq: Identify mutations (deviations from reference genome) Targeted capture (Amplicon/Hybridization capture) Exome capture Whole genome sequencing 2) RNA-Seq: Applications: i) Evaluate transcript abundance ii) Identify alternatively spliced transcript isoforms iii) Identify gene fusions

Webinar agenda 1)Background: Overview of NGS sequencing technologies 2)Overview of sequence alignment methods: Introduction to global and local alignment methods Scoring matrices for alignment Introduction to BLAST 3)A typical analysis workflow for DNA-Seq 4)Variant calling SNVs, Indels, Copy Number Variants and Structural Rearrangements

Next Generation Sequencing Platforms Illumina HiSeq 2000 Illumina MiSeq Ion Torrent PGM Current: 300 – 600 Gb 6 – 11 days 1.5 Gb 1 day 10 Mb – 1 Gb 6 hours Slide courtesy of Michael Berger Illumina HiSeq 2500 100 Gb 27 hours (Rapid Run) Ion Torrent PGM

Next Generation Sequencing Platforms Illumina Library Prep Cluster Amplification Sequencing by Synthesis Illumina, inc. Slide courtesy of Michael Berger

Next Generation Sequencing Platforms Ion Torrent Life Technologies Slide courtesy of Michael Berger

Mapping sequencing output to the reference genome How do we go from: >Sequence1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC to: >Sequence1: chr12 95629 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAAC for millions of reads in a timely manner?

An introduction to sequence alignment algorithms Global alignment -Alignment is performed from beginning till end of sequence to find best possible alignment -Every position in the sequence is evaluated. -Useful if query and reference sequences are of similar length. Local alignment -Finds regions of high local similarity between query and reference sequences. -Works even if query and reference sequences are dissimilar in length. Needleman-Wunsch Smith-Waterman

Dynamic programming Query Situation/Score Match+1 Mismatch-1 Gap-1 Global alignment: Needleman-Wunsch algorithm T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty

Dynamic programming 2. Induction step Query T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) 0 + (-1) = -1 T(i-1,j) + Gap penalty-1 + (-1) = -2 T(i,j-1) + Gap penalty-1 + (-1) = -2

Dynamic programming 2. Repeat for every (i,j) position in the matrix.

Dynamic programming 3. Traceback COELACANTH -PELICAN- COELACANTH - COELACANTH -P COELACANTH -PE GAP ALIGN GAP COELACANTH -PELICAN-- GAP …

Local alignment: Smith waterman COELACANTH 0000000000 P0 E0 L0 I0 C0 A0 N0 1.Initialization step

Local alignment: Smith waterman COELACANTH 00000000000 P0 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j) T(i-1,j) + Gap penalty T(i,j-1) + Gap penalty 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH 00000000000 P00 E0 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)-6 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH 00000000000 P00000000000 E0004 L0 I0 C0 A0 N0 2. Induction step T(i,j)= maximum T(i-1,j-1) + Match/mismatch(i,j)+4 T(i-1,j) + Gap penalty-2 T(i,j-1) + Gap penalty-2 0 Match+4 Mismatch -6 Gap-2

Local alignment: Smith waterman COELACANTH 00000000000 P00000000000 E00040000000 L00008642000 I00006420000 C01004286420 A0000086121086 N0000064 161412 3. Traceback step -Start from the highest value cell (16) -Follow arrows while score is still positive (greater than zero) COEL-ACANTH || ||| ELI-CAN Local: COELACANTH -PELICAN-- Global:

Scoring matrices 1. Identity Identical=+1 Mismatch=-1 2. For DNA Identical=+3 Transitions=-1 Transversions=-3 “User-defined” Observed/Empirically derived 3. For Amino Acid Alignments PAMBLOSUM Point accepted mutations Block substitution matrix -Global alignments of very similar proteins (> 85% similarity) -PAM1: 1 accepted point mutation per 100 amino acids Use for global alignments Lower PAM number = strict Higher PAM number= permissive -Multiple ungapped local alignments of related proteins -BLOSUM62: Proteins used were at least 62% similar Use for local alignments. Lower BLOSUM number= permissive Higher BLOSUM number = strict

Example of scoring matrices PAM 1 scoring matrix: 1 accepted point mutation per 100 amino acids ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV

BLAST -Solution for rapid searching of a sequence database (i.e. Genbank) for sequences that match a query sequence. 1) Generate a word seed dictionary from query sequence Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215

BLAST 2) Generate ‘neighborhood’ of possible matching words Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 PQG20 x 20 x 20 possible words PRG PEG PQA … Apply scoring matrix PRG14 PEG15 PQA12 Keep only words scoring higher than threshold (> 13) PRG14 PEG15

BLAST 1) Word dictionary from query sequence: PQG PRG PEG … Genbank sequence QUERY 2) Identify regions of exact matching 3) Seed and extend High scoring pair (HSP) 23 30 50

BLAST 4) Report sequences in database with HSPs. E-value takes into account: a)significance of a pairwise alignment b)size of database queried c)scoring system used E-value of 0.05 indicates there is 1 in 20 probability that this result occurred by chance. Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215

Flavors of BLAST Lobo, I. (2008) Basic Local Alignment Search Tool (BLAST). Nature Education 1(1):215 Query Reference database Nucleotide sequence 6 frame translation of query nucleotide sequence Amino acid sequence Nucleotide database 6 frame translation of reference nucleotide database Amino acid database blastn blastx blastp PSI-blast tblastn tblastx

DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes B BB B B B Hybridize and select (NimbleGen SeqCap) Sequence to 500-1000X (2 lanes of HiSeq 2500) BB B B Probes for 340 cancer genes Prepare 12-24 libraries adapted from Wagle, Berger et al., Cancer Discovery, 2:82-93, Jan 2012 Align to genome and analyze Genomics Core Lab Agnes Viale Berger Lab Slide courtesy of Michael Berger

DNA-Seq: MSK-IMPACT – Targeted sequencing of 341 cancer genes >30x coverage in Normal >30x coverage in Tumor Tumor Cells Normal Cells ~200 - 400 bp genomic DNA 50 - 100 bp read “Paired end” sequencing Slide courtesy of Michael Berger

Analysis workflow in DNA-Seq General best practice workflow 1)Adaptor trimming 1)Mapping reads to the human genome [Sequence alignment] 1)Marking PCR duplicates 1)Realignment of indels 2)Base quality recalibration

1. Adaptor Trimming Sheared genomic DNA (150bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 50bp remaining100bp sequenced 50bp remaining

1. Adaptor Trimming Degraded FFPE DNA (70bp) Ligate sequencing adaptors 2x100bp sequencing Forward read Reverse read 100bp sequenced (70bp actual/30bp adaptor)

Alignment algorithms for NGS Li H. et al, Bioinformatics, 2010

BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, 2010 1)Burrows-Wheeler Transform: a data structure representation that reduces the footprint of loading the entire human genome into memory. 2)Seed and extend method, which uses Smith-Waterman local alignment to assess regions of high similarity. 3)BWA-MEM is an updated version of BWA – it provides more consistent performance across a wider range of read lengths and permits multiple primary alignments.

BWA (Burrows-Wheeler Aligner) Li H. et al, Bioinformatics, 2010 Simulated Single end reads Simulated Paired end reads

3. Marking PCR duplicates Hybridization capture: B B Capture Post-Capture PCR Unique coverage post-deduplication

3. Marking PCR duplicates Amplicon PCR: Amplicon PCR Non-unique coverage -By definition, all amplicons have same start and stop positions. Unable to deduplicate. -Representation of template molecules in final PCR output may differ from input mix.

M Efficient PCR M M Single round PCR All template molecules amplified M Inefficient PCR M M M M M M M M M M M M Multiple rounds of PCR Not all template molecules are amplified each round. M Mutant readsVariant Frequency = 67% VF= 67%VF= 60%VF= 55% 3. Marking PCR duplicates

4. Indel realignment Allows accurate phasing of indels Ronak Shah

4. Indel realignment Ronak Shah

5. Base quality calibration Before After Base quality ~ f(read group, machine cycle, reported quality score, nucleotide context)

Variant calling in DNA-Seq Types of mutations that can be identified using DNA-Seq (MSK-IMPACT as example) 1)Point mutations (Single nucleotide variants – SNVs) 1)Insertions and deletions (Indels) 1)Copy number variants (CNVs) 1)Structural variants (i.e. translocations, inversions etc)

SNV calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Somatic Nonsense Mutation BRCA2 (E1395*) Loss of Heterozygosity BRCA1 (R1751*) G: 968 T: 605 G: 968 T: 605 G: 1021 T: 1 G: 1021 T: 1 G: 1 A: 936 G: 1 A: 936 G: 749 A: 771 G: 749 A: 771 Slide courtesy of Michael Berger

SNV calling in MSK-IMPACT Cibulskis K. et al, Nature Biotech, 2013

Copy number variant calling in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair NOTCH4 MLL3MET FGFR3 CDKN2A/B PTEN ERG Copy Number Alterations Slide courtesy of Michael Berger

Adjustment for GC bias Ostensibly, Copy number = Sequence coverage for gene in tumor sample Sequence coverage for gene in normal sample BUT sequence coverage can be skewed by GC content of capture probes

Segmentation: Identifying levels in Copy Number data RATIO SEGMENTED

IMPACT aCGH (non-NGS confirmation)

Structural variant detection in MSK-IMPACT EXAMPLE: Breast cancer cell line HCC1395 pair TUMOR Cell Line (HCC1395) NORMAL Cell Line (HCC1395BL) Structural Rearrangement chr9 baitsno baits 240-kb deletion including CDKN2A/B Slide courtesy of Michael Berger

Paired-end RNA-Seq Can Reveal Fusion Transcripts BCR-ABL1 fusion in K-562 CML line Discordant read pairs Slide courtesy of Michael Berger Chr22Chr9

Paired-end RNA-Seq Can Reveal Fusion Transcripts Discordant read pairs Fusion-spanning individual reads Slide courtesy of Michael Berger Chr22Chr9

Acknowledgements Berger Lab Ronak Shah Rose Brannon Sasinya Scott Helen Won Gregory McDermott Nancy Bouvier Michael F. Berger Clinical Bioinformatics Ahmet Zehir Aijazuddin Syed Raghu Chandramohan Abhinita Mohanty Meera Prasad Mustafa Syed Sumit Middha Elsa Wang Ryan Ptashkin Jack Birnbaum Zhen Yu (Tony) Liu Molecular Diagnostics Service Marc Ladanyi Maria Arcila Liying Zhang Ryma Benayed Talia Mitchell Catherine O’ Reilly Jacklyn Casanova Angela Yannes Anna Plentsova Michael Zaidinski Yvonne Chekaluk Nana Mensah Justyna Sadowska Doudja Nafa Dara Ross Jackie Hechtman Laetitia Borsu Meera Hameed David Klimstra Bioinformatics Core Manda Wilson Nicholas Socci Chris Pepper Joanne Edington JianJiong Gao Niki Schultz Genomics Core Agnes Viale

Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,

Similar presentations

Presentation on theme: "Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,

Similar presentations

Presentation on theme: "Introduction to bioinformatics for NGS analysis Donavan Cheng, Ph.D. Assistant Attending, Department of Pathology AMP Webinar March 26th,"— Presentation transcript:

Similar presentations

About project

Feedback