National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.

Slides:

Advertisements

Similar presentations

USING FLUENT FOR HPC IT MANAGEMENT OF FME, 3 RD MARCH 2011.

Advertisements

DNAseq analysis Bioinformatics Analysis Team

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.

Cosc 4750 Getting Started in UNIX Don’t be afraid of the prompt, in linux it can be your best friend. In some cases, the only way to do certain things.

High Performance Computing

Web Pages Publishing your page on ASUWlink. Unix Directory Commands ls –la –will show all directories and files –will show directory and file permissions.

CS1020: Intro Workshop. Topics CS1020Intro Workshop Login to UNIX operating system 2. …………………………………… 3. …………………………………… 4. …………………………………… 5. ……………………………………

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

ISG We build general capability Purpose After this tutorial, you should: Be comfortable submitting work to the batch queuing system of olympus and be familiar.

NGS Analysis Using Galaxy

Introduction to UNIX/Linux Exercises Dan Stanzione.

CprE 288 – Quick intro for compiling C in Linux

Introduction to RNA-Seq and Transcriptome Analysis

Customized cloud platform for computing on your terms !

Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.

Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.

Guideline for ClinLabGeneticist tool Jinlian Wang

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

File Permissions. What are the three categories of users that apply to file permissions? Owner (or user) Group All others (public, world, others)

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

ChrGeneticist introduction for reviewer Jinlian Wang 10/8/2014.

Configuring IQmol for Windows machines, use version!

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.

40 Years and Still Rocking the Terminal!

Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

CE: compute element TP: CE & WN Compute Element Worker Node Installation configuration.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

FTP COMMANDS OBJECTIVES. General overview. Introduction to FTP server. Types of FTP users. FTP commands examples. FTP commands in action (example of use).

Personalized genomics

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.

CIS 370 Lab1 Unix Commands. Things to do before start... Login username : name with password: fall2009 Open : Terminal (Applications->Systems.

Canadian Bioinformatics Workshops

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)

Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.

Setting up visualization. Make output folder for visualization files Log into vieques $ ssh

Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.

From Reads to Results Exome-seq analysis at CCBR

Day11a FTP. File Transfer Protocol. –Used to move files from one machine to another. Windows -> Unix Unix -> Windows Unix -> Unix Windows -> Windows etc.

Editing, Transferring, and Running Files on Vieques Daniel Malmer Dowell Lab Short Reads Course 6/9/15.

Canadian Bioinformatics Workshops

Advanced Computing Facility Introduction

Welcome to Indiana University Clusters

NGS File formats Raw data from various vendors => various formats

Day 5 Mapping and Visualization

CS1010: Intro Workshop.

Login The Login prompt provides access to the files located on the server.

Dowell Short Read Class Phillip Richmond

RNA Sequencing Day 7 Wooohoooo!

Welcome to Indiana University Clusters

Integrative Genomics Viewer (IGV)

NGS Analysis Using Galaxy

How to use the HPCC to do stuff

Variant Calling Workshop

Short Read Sequencing Analysis Workshop

How to access your work from home or another computer

Part 3 – Remote Connection, File Transfer, Remote Environments

CommLab PC Cluster (Ubuntu OS version)

Practice #0: Introduction

MiSeq Validation Pipeline

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

High-Performance Computing at the Martinos Center

Using the Omega3P Eigensolver

Presentation transcript:

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop

Overview There will be two parts to the workshop: Variant calling analysis (on the cluster) Visualization (on the desktop) using IGV Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red: $ mkdir foo $ cd foo $ ls -la total 96 drwxrwxr-x 2 cjfields cjfields Jun 23 22:51. drwxr-x cjfields cjfields Jun 23 22:51..

Prelude : Variant Calling Setup 1.Log into the cluster using your classroom account. 2.Create a work folder (I call mine ‘mayo_test’): $ mkdir mayo_test $ cd mayo_test $ ll total 0

Part Ia : Variant Calling Setup 3.Link in all scripts from the main work folder to this directory: $ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh. $ ls annotate_snpeff.sh call_variants_ug.sh hard_filtering.sh post_annotate.sh

Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reduced for a portion of human chromosome 20 Otherwise, we would not even finish the alignment within the next few days, let alone the other steps Part Ia : Variant Calling Setup

Part Ia : Variant Calling Start the variant calling job. Check the status of the job using ‘qstat’: $ qsub call_variants_ug.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default call_variants_ug gb -- R 00:01

Part Ia : Variant Calling Discussion: what did we just do? We ran the GATK UnifiedGenotyper to call variants Show the script…

Part Ia : Variant Calling Job done yet? Should only be a few minutes… What do the data look like? (anyone here use UNIX?) $ qstat -u $ ll *vcf* -rw-rw-r-- 1 cjfields cjfields Jun 23 23:10 raw_indels.vcf -rw-rw-r-- 1 cjfields cjfields 2829 Jun 23 23:10 raw_indels.vcf.idx -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf.idx $ tail -n 2 raw_indels.vcf rs CAGAC AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRan kSum=0.407;QD=5.80;ReadPosRankSum=0.371GT:AD:DP:GQ:PL0/1:44,26:75:99:1343,0, rs GTG AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRan kSum=-1.411;QD=18.11;ReadPosRankSum=1.387GT:AD:DP:GQ:PL0/1:33,36:76:99:1540,0,1253

Part Ia : Variant Calling How many SNPs and Indels were called? Any found in dbSNP? $ grep -c -v '^#' raw_snps.vcf $ grep -c -v '^#' raw_indels.vcf 1070 $ grep -c 'rs[0-9]*' raw_snps.vcf $ grep -c 'rs[0-9]*' raw_indels.vcf 1019

Part Ib : Hard filtering We need to filter the variant calls Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering

Part Ib : Hard filtering Start the hard filtering step. This will be fast: You will have two new VCF files in a minute: hard_filtered_snps.vcf hard_filtered_indels.vcf $ qsub hard_filtering.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default hard_filtering.s gb -- R --

Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output?

Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output? $ grep -c 'PASS' hard_filtered_snps.vcf 8270 $ grep -c 'PASS' hard_filtered_indels.vcf 1041

Part Ic : Annotate the variants (SnpEff) Run the next job, which uses SnpEff to add annotation to the VCF: This takes a couple of minutes… Two new VCF: hard_filtered_snps_annotated.vcf hard_filtered_indels_annotated.vcf $ qsub annotate_snpeff.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default annotate_snpeff gb -- R --

Part Ic : Annotate the variants (SnpEff) SnpEff adds information about where the variants are in relation to specific genes The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX) The Ensembl ID for FOXA2 is ENSG

Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2?

Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2? SnpEff also creates some additional output files; we’ll see those in a bit $ grep -c 'ENSG ' hard_filtered_snps_annotated.vcf 3 $ grep -c 'ENSG ' hard_filtered_indels_annotated.vcf 0

Part Id : GATK VariantAnnotator SnpEff adds a lot of information to the VCF. GATK VariantAnnotator helps remove a lot of the extraneous information

Part Id : GATK VariantAnnotator The last step: This may take about 5-10 minutes $ qsub post_annotate.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default post_annotate.sh gb -- R 00:01

While this is going on… Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)

Prelude to Part II We need to download the results from your user folders to the local desktop We’ll use FileZilla for this

FileZilla

Transfer folder to the desktop

Part II : Viewing Results in IGV Open IGV Switch the genome to ‘Human (b37)’