First Bite of Variant Calling in NGS/MPS Precourse materials Yonglan Zheng (yzheng3@uchicago.edu) 2017.6.10
Precourse materials General NGS variant calling workflow [slide#3] GATK Best Practice as an example NGS file format [slides#4-8] NGS variant calling tools and platform [slided#9] FastQC, BWA-MEM, Picard (Markduplicates), GATK (RealignerTargetCreator, IndelRealigner, Unified Genotyper), SnpEff, Freebayes
GATK Best Practice (v3.x) Multi-sample calling is replaced by a winning combination of single-sample calling in gVCF mode [Genome VCF (gVCF) for both variant and non-variant positions] and joint genotyping analysis. https://software.broadinstitute.org/gatk/best-practices
FASTQ A FASTQ file (.fq and .fastq) is a text-based file for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Each entry in a FASTQ file consists of four lines: • Sequence identifier • Sequence • Quality score identifier line (consisting of a +) • Quality score Quality A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Phred quality score: Sequence identifier @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence> https://en.wikipedia.org/wiki/FASTQ_format https://support.illumina.com/
FASTQ An example of a valid entry is as follows: @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ https://en.wikipedia.org/wiki/FASTQ_format https://support.illumina.com/
SAM/BAM/CRAM A SAM (Sequence Alignment/Map) file (.sam) is a tab-delimited text file that contains sequence alignment data. A BAM file (.bam) is the binary version of a SAM file. Typically CRAM achieves 40-50% space saving over the alternative BAM format. It uses reference based compression, meaning that only base calls that differ to a designated reference sequence need to be stored. Headers Alignment section: mandatory fields CIGAR (Compact Idiosyncratic Gapped Alignment Report) String Alignment section: optional fields https://github.com/samtools/hts-specs http://www.sanger.ac.uk/science/tools/cram
VCF/BCF A VCF (Variant Call Format) file (.vcf) is a text file that contains meta-information lines (prefixed with ”##”), a header line (prefixed with ”#”), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). VCF’s binary counterpart is BCF. https://github.com/samtools/hts-specs https://petridishtalk.com/
MAF A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations. The format originates from The Cancer Genome Atlas (TCGA) project. Its columns include: Hugo_Symbol, Entrez_Gene_Id , Center, NCBI_Build, Chromosome, Start_Position, End_Position, Strand, Variant_Classification, Variant_Type, Reference_Allele, Tumor_Seq_Allele1, Tumor_Seq_Allele2, ... https://wiki.nci.nih.gov/display/TCGA/ BED A BED (Browser Extensible Data) (.bed) file is a tab-delimited text file that defines a feature track. It consists of one line per feature, each containing 3-12 columns of data. Required Optional http://www.ensembl.org/info/website/upload/bed.html
NGS Variant Calling Tools and Platform FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ BWA-MEM: https://github.com/lh3/bwa Picard: https://broadinstitute.github.io/picard/command-line-overview.html Picard MarkDuplicates: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates GATK: https://software.broadinstitute.org/gatk/ GATK RealignerTargetCreator: https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php GATK IndelRealigner: https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php SnpEff: http://snpeff.sourceforge.net/ Freebayes: https://github.com/ekg/freebayes Galaxy: https://usegalaxy.org/ (main site); https://test.galaxyproject.org/ (test site)