First Bite of Variant Calling in NGS/MPS Precourse materials

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Differentially expressed genes Sample class prediction etc.
DNAseq analysis Bioinformatics Analysis Team
SOLiD Sequencing & Data
SMART/FHIR Genomic Resources An overview... For latest see
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
2015/6/301 TransCAD Managing Data Tables. 2015/6/302 Create a New Table.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
File formats Wrapping your data in the right package Deanna M. Church
RNAseq analyses -- methods
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
NGS data analysis CCM Seminar series Michael Liang:
Quick introduction to genomic file types Preliminary quality control (lab)
NCI Cloud Pilot Collaboration Meeting
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
SMART/FHIR Genomic Resources An overview.... Change Log Added a few changes to Sequence resource Added data support for alignment data (e.g. SAM or BAM.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Personalized genomics
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
Introduction to Variant Analysis with NGS data
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Canadian Bioinformatics Workshops
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Tools for Targeted Sequencing and NGS analysis O. Harismendy, PhD BIOM262 – W2016.
Canadian Bioinformatics Workshops
Data and Hartwig Medical Foundation
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
Create and edit web pages 4
Next Generation Sequencing Analysis
NGS Analysis Using Galaxy
Variant Calling Workshop
Bioinformatics Research Group
VCF format: variants c.f. S. Brown NYU
Introduction to RAD Acropora millepora.
MiSeq Validation Pipeline
EMC Galaxy Course November 24-25, 2014
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Whole-exome sequencing for RH genotyping and alloimmunization risk in children with sickle cell anemia by Stella T. Chou, Jonathan M. Flanagan, Sunitha.
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Introduction to Data Formats and tools
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
BF528 - Biological Data Formats
ChIP-Seq Data Processing and QC
Learning to count: quantifying signal
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
The NCI Genomic Data Commons as an engine for precision medicine
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

First Bite of Variant Calling in NGS/MPS Precourse materials Yonglan Zheng (yzheng3@uchicago.edu) 2017.6.10

Precourse materials General NGS variant calling workflow [slide#3] GATK Best Practice as an example NGS file format [slides#4-8] NGS variant calling tools and platform [slided#9] FastQC, BWA-MEM, Picard (Markduplicates), GATK (RealignerTargetCreator, IndelRealigner, Unified Genotyper), SnpEff, Freebayes

GATK Best Practice (v3.x) Multi-sample calling is replaced by a winning combination of single-sample calling in gVCF mode [Genome VCF (gVCF) for both variant and non-variant positions] and joint genotyping analysis. https://software.broadinstitute.org/gatk/best-practices

FASTQ A FASTQ file (.fq and .fastq) is a text-based file for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Each entry in a FASTQ file consists of four lines: • Sequence identifier • Sequence • Quality score identifier line (consisting of a +) • Quality score Quality A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Phred quality score: Sequence identifier @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence> https://en.wikipedia.org/wiki/FASTQ_format https://support.illumina.com/

FASTQ An example of a valid entry is as follows: @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ https://en.wikipedia.org/wiki/FASTQ_format https://support.illumina.com/

SAM/BAM/CRAM A SAM (Sequence Alignment/Map) file (.sam) is a tab-delimited text file that contains sequence alignment data. A BAM file (.bam) is the binary version of a SAM file. Typically CRAM achieves 40-50% space saving over the alternative BAM format. It uses reference based compression, meaning that only base calls that differ to a designated reference sequence need to be stored. Headers Alignment section: mandatory fields CIGAR (Compact Idiosyncratic Gapped Alignment Report) String Alignment section: optional fields https://github.com/samtools/hts-specs http://www.sanger.ac.uk/science/tools/cram

VCF/BCF A VCF (Variant Call Format) file (.vcf) is a text file that contains meta-information lines (prefixed with ”##”), a header line (prefixed with ”#”), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). VCF’s binary counterpart is BCF. https://github.com/samtools/hts-specs https://petridishtalk.com/

MAF A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations. The format originates from The Cancer Genome Atlas (TCGA) project. Its columns include: Hugo_Symbol, Entrez_Gene_Id , Center, NCBI_Build, Chromosome, Start_Position, End_Position, Strand, Variant_Classification, Variant_Type, Reference_Allele, Tumor_Seq_Allele1, Tumor_Seq_Allele2, ... https://wiki.nci.nih.gov/display/TCGA/ BED A BED (Browser Extensible Data) (.bed) file is a tab-delimited text file that defines a feature track. It consists of one line per feature, each containing 3-12 columns of data. Required Optional http://www.ensembl.org/info/website/upload/bed.html

NGS Variant Calling Tools and Platform FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ BWA-MEM: https://github.com/lh3/bwa Picard: https://broadinstitute.github.io/picard/command-line-overview.html Picard MarkDuplicates: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates GATK: https://software.broadinstitute.org/gatk/ GATK RealignerTargetCreator: https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php GATK IndelRealigner: https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php SnpEff: http://snpeff.sourceforge.net/ Freebayes: https://github.com/ekg/freebayes Galaxy: https://usegalaxy.org/ (main site); https://test.galaxyproject.org/ (test site)