National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop
Overview There will be two parts to the workshop: Variant calling analysis (on the cluster) Visualization (on the desktop) using IGV Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red: $ mkdir foo $ cd foo $ ls -la total 96 drwxrwxr-x 2 cjfields cjfields Jun 23 22:51. drwxr-x cjfields cjfields Jun 23 22:51..
Prelude : Variant Calling Setup 1.Log into the cluster using your classroom account. 2.Create a work folder (I call mine ‘mayo_test’): $ mkdir mayo_test $ cd mayo_test $ ll total 0
Part Ia : Variant Calling Setup 3.Link in all scripts from the main work folder to this directory: $ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh. $ ls annotate_snpeff.sh call_variants_ug.sh hard_filtering.sh post_annotate.sh
Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reduced for a portion of human chromosome 20 Otherwise, we would not even finish the alignment within the next few days, let alone the other steps Part Ia : Variant Calling Setup
Part Ia : Variant Calling Start the variant calling job. Check the status of the job using ‘qstat’: $ qsub call_variants_ug.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default call_variants_ug gb -- R 00:01
Part Ia : Variant Calling Discussion: what did we just do? We ran the GATK UnifiedGenotyper to call variants Show the script…
Part Ia : Variant Calling Job done yet? Should only be a few minutes… What do the data look like? (anyone here use UNIX?) $ qstat -u $ ll *vcf* -rw-rw-r-- 1 cjfields cjfields Jun 23 23:10 raw_indels.vcf -rw-rw-r-- 1 cjfields cjfields 2829 Jun 23 23:10 raw_indels.vcf.idx -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf.idx $ tail -n 2 raw_indels.vcf rs CAGAC AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRan kSum=0.407;QD=5.80;ReadPosRankSum=0.371GT:AD:DP:GQ:PL0/1:44,26:75:99:1343,0, rs GTG AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRan kSum=-1.411;QD=18.11;ReadPosRankSum=1.387GT:AD:DP:GQ:PL0/1:33,36:76:99:1540,0,1253
Part Ia : Variant Calling How many SNPs and Indels were called? Any found in dbSNP? $ grep -c -v '^#' raw_snps.vcf $ grep -c -v '^#' raw_indels.vcf 1070 $ grep -c 'rs[0-9]*' raw_snps.vcf $ grep -c 'rs[0-9]*' raw_indels.vcf 1019
Part Ib : Hard filtering We need to filter the variant calls Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering
Part Ib : Hard filtering Start the hard filtering step. This will be fast: You will have two new VCF files in a minute: hard_filtered_snps.vcf hard_filtered_indels.vcf $ qsub hard_filtering.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default hard_filtering.s gb -- R --
Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output?
Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output? $ grep -c 'PASS' hard_filtered_snps.vcf 8270 $ grep -c 'PASS' hard_filtered_indels.vcf 1041
Part Ic : Annotate the variants (SnpEff) Run the next job, which uses SnpEff to add annotation to the VCF: This takes a couple of minutes… Two new VCF: hard_filtered_snps_annotated.vcf hard_filtered_indels_annotated.vcf $ qsub annotate_snpeff.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default annotate_snpeff gb -- R --
Part Ic : Annotate the variants (SnpEff) SnpEff adds information about where the variants are in relation to specific genes The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX) The Ensembl ID for FOXA2 is ENSG
Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2?
Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2? SnpEff also creates some additional output files; we’ll see those in a bit $ grep -c 'ENSG ' hard_filtered_snps_annotated.vcf 3 $ grep -c 'ENSG ' hard_filtered_indels_annotated.vcf 0
Part Id : GATK VariantAnnotator SnpEff adds a lot of information to the VCF. GATK VariantAnnotator helps remove a lot of the extraneous information
Part Id : GATK VariantAnnotator The last step: This may take about 5-10 minutes $ qsub post_annotate.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default post_annotate.sh gb -- R 00:01
While this is going on… Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)
Prelude to Part II We need to download the results from your user folders to the local desktop We’ll use FileZilla for this
FileZilla
Transfer folder to the desktop
Part II : Viewing Results in IGV Open IGV Switch the genome to ‘Human (b37)’