The Variant Call Format

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Advertisements

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
SOLiD Sequencing & Data
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Fundamentals of Python: From First Programs Through Data Structures
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
File formats Wrapping your data in the right package Deanna M. Church
GBS Bioinformatics Pipeline(s) Overview
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Quick introduction to genomic file types Preliminary quality control (lab)
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Files Tutor: You will need ….
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Genome STRiP ASHG Workshop demo materials
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Canadian Bioinformatics Workshops
Review: A Computational View Programming language common concepts: 1. sequence of instructions -> order of operations important 2. conditional structures.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Cancer Institute Frederick National Laboratory is a federally funded research.
Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.
1 Bioinformatics Tools for Genotyping Frances Tong Dr. Garry Larson, Ph.D City of Hope Department of Molecular Medicine Southern California Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
An Overview of Applications for the MiSeq and HiSeq 2500 April 4, 2016 Kevin Shianna, Ph.D. Sequencing Specialist - Illumina, Inc. MGC USERS GROUP.
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Lesson: Sequence processing
Component 1.6.
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Miscellaneous Excel Combining Excel and Access.
MGmapper A tool to map MetaGenomics data
CSE 182 Project.
Call SNPs & Infer Phylogeny (CSI Phylogeny)
NGS Analysis Using Galaxy
Variant Calling Workshop
Short Read Sequencing Analysis Workshop
First Bite of Variant Calling in NGS/MPS Precourse materials
Introduction to RAD Acropora millepora.
MiSeq Validation Pipeline
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Intro to PHP & Variables
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Introduction to Data Formats and tools
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Spreadsheets, Modelling & Databases
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

The Variant Call Format Week 2: VCF files GEN 8900-Computational Genomics

Outline of Today’s Class Next-Gen (Illumina) Library Construction and Sequencing Overview of a typical data processing pipeline How to call variants to generate VCF files Understanding the information within a VCF Exercises: 1) Read a VCF file into R, 2) count the genotypes, 3) calculate % heterozygosity and 4) convert to an alternate format Week 2: VCF files GEN 8900-Computational Genomics

Constructing an Illumina Library Start with genomic DNA: Flowcell Binding Site Forward Primer Site Overhang Reverse Primer Site Randomly fragment DNA: (usually with sonication) Add Forward and Reverse Adapters: Size Selection: ~500 bp Final Library: PCR amplification:

Sequencing-By-Synthesis Fragments in the library are bound to the oligos on the flow cell (via the recognition seq. on the ends of the adapters) Illumina Flow Cell w/ 8 lanes Flow cell is pre-coated with small DNA oligos Library: In situ amplification creates clusters of identical copies of each fragment A C G T Add labeled nucleotides A T C G Cross section of one cluster (notice there is a mutation/PCR error at one position) Base Prob. Wrong T <1% Overhead view of the same cluster G C T T A <1% A A Output T T T <1% G 30%

Illumina Output: Fastq Format Base Prob. Phred Score Code T 0.01% 40 H A 0.1% 30 ? 0.05% 33 B G 30% 5 & 1% 20 4 C 5% 13 . Phred Score (Q) = -log10(Prob. of Error) Code comes from a subset of ASCII characters **Note that different versions of Illumina use slightly different sets of codes Single Cluster Sequencing-By-Synthesis @Seqname:Flowcell:Lane:X1:Y1 TATGAC +Seqname:Flowcell:Lane:X1:Y1 H?B&4. @Seqname:Flowcell:Lane:X2:Y2 AAAGGG +Seqname:Flowcell:Lane:X2:Y2 HH??AB Fastq format (.fq or .fastq): A text file with 4 lines per sequence

Data Processing Pipeline Quality Check Trimming Remove leftover adapters FastQC Trimmomatic RawData.fq CleanData.fq FASTQ FASTQ Map to a Reference BWA Bowtie2 Gmap soap2 NAST AlignedReads.sam SAM/BAM .msa, .bed, .psl Find Variant Sites b/t individual aligned files: Single Nucleotide Polymorphism (SNPs) Insertion/Deletions (InDels) VCF GATK Samtools varFilter freeBayes

SNP Calling SNP Position Sample 1 Sample 2 Sample 3 1 0/0 1/1 2 0/1 3 4 SNP 5 Reference Genome: Map. Rds Sample 1: Map. Rds Sample 2: Map. Rds Sample 3: SNP Position Sample 1 Sample 2 Sample 3 1 0/0 1/1 2 0/1 3 4 ./. 5

VCF Files At its core, a VCF file is just a tab-delimited text file ##fileformat=VCFv4.2 ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> ##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 ## Denotes a Meta-information Line. These lines can define the FILTER, INFO, and FORMAT terms, depending on what program created the vcf file (so not all vcf files are exactly the same!). The first line will always specify which VCF version a file is. # Denotes the Header line. The first NINE columns should always be the same for every VCF (unless you have a really old version). Then, there will be one column for every individual in your sample (i.e. these columns will change for each data set). The names for these columns are usually taken from your input file names. Meta-information typically gives you things like the definition of each term in the “Format” field, and will do the same for the “info” field if it is used Filter information will tell you what filters (if any) were applied to the vcf (for instance using a tool like GatK or VCFtools) Can also open a (small) vcf in Excel-give this a try to look at the data set for today The remaining rows have the information about each SNP position, with 1 row per VARIANT site (i.e. sites with data, but no observed differences, are NOT in the VCF file by default!)

VCF Files ##fileformat=VCFv4.2 ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> ##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 CHROM: The chromosome (or scaffold) where the SNP is located. Comes from the names within your .fasta reference genome file POS: The position within the chromosome of the SNP (positions start at 1). ID: A database ID for each SNP (if there is one). Often this may be blank: “.” REF: The allele for the Reference Genome at the SNP position. If the position has an InDel mutation, the REF may be a string instead of a single letter. ALT: The allele for alternate/SNP allele found at the position. If there are more than 2 alleles at a site, ALT will have a comma delimited list of all possible alleles. Meta-information typically gives you things like the definition of each term in the “Format” field, and will do the same for the “info” field if it is used Filter information will tell you what filters (if any) were applied to the vcf (for instance using a tool like GatK or VCFtools) Can also open a (small) vcf in Excel-give this a try to look at the data set for today QUAL: The quality or likelihood score given to the site by the program used to call variants. Often a phred-scaled score, but sometimes a –ln(likelihood) or other score in a very different range. FILTER: If you run a filter on the vcf after calling the SNPs, this will say whether each SNP passed or failed (rather than deleting SNPs that fail the filter). INFO: Often this field is blank (“.”), unless you have run some additional analysis, such as annotation prediction.

VCF Files ##fileformat=VCFv4.2 ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> ##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 The FORMAT column tells us exactly what “fields” we can expect in each of our sample columns. Each field is separated by a “:” and the fields are typically defined in the meta information. Different programs can return different info., but the piece we are most interested in is the “GT” field, which is the actual genotype. Individual Genotypes are always given in the form: allele1/allele2 0 = reference allele 1 = alternate allele 0/0 = homozygous ref; 0/1 = heterozygous; 1/1 = homozygous alt A “./.” means that the genotype is missing for that individual. If a site is multi-allelic, then there will be additional encodings (e.g. 0/2, 2/2, etc.) If a sample is polyploid, the genotype will give all of the alleles: 0/0/1/1 (tetraploid) If a sample is “phased” (a term we will discuss during linkage), there is a | between alleles instead of a / Meta-information typically gives you things like the definition of each term in the “Format” field, and will do the same for the “info” field if it is used Filter information will tell you what filters (if any) were applied to the vcf (for instance using a tool like GatK or VCFtools) Can also open a (small) vcf in Excel-give this a try to look at the data set for today

VCF Files and R Since the VCF format is essentially a text file, it is easily readable by R The things to watch out for are the special characters: VCF files have ## and # headers, which is NOT commonly differentiated by most computing languages The use of “.” as a missing data character can trip up some regular expression searches The use of “|” in phased VCF files can also mess up regular expression searches. It is also important to keep in mind that it is almost impossible to write a script that will work correctly with every version of the VCF format; early versions in particular might cause problems. This is also true of software with dedicated teams of programmers (like GATK), a change in version can break certain package functions! So, don’t feel bad – just be aware of the issue! Meta-information typically gives you things like the definition of each term in the “Format” field, and will do the same for the “info” field if it is used Filter information will tell you what filters (if any) were applied to the vcf (for instance using a tool like GatK or VCFtools) Can also open a (small) vcf in Excel-give this a try to look at the data set for today A very detailed guide to VCF files can be found here: https://samtools.github.io/hts-specs/VCFv4.1.pdf