Material for today’s workshop is at: http://www.c3g.ca/computational-epigenetics-workshop/
2 days 1.5 hours
Module #: Title of Module 3
Launching jobs at Compute Canada David Bujold Epigenomic Data Analysis Your logo here
What is HPC? High Performance Computing “Traditional” in-house servers can quickly get overloaded Compute Canada provides HPC resources available to Canadian academic labs There are also many options in the private sector, such as AWS HPC uses clusters of computers Each individual computer in the cluster is called a node
What is Compute Canada? CFI-funded national platform integrating HPC resources at partner consortia across the country, to create a dynamic computational resource ACEnet Calcul Québec SciNet HPCVL SHARCNET WestGrid
Concepts connected to CC accounts Shared resource for Canadian academia An account gives you access to free compute resource You get a yearly allocation Compute time (in core/years) Storage space Once logged in, you can launch compute jobs A job is a software execution Compute jobs use the yearly allocation
How to get an account Apply for an account at the Compute Canada website https://www.computecanada.ca/research-portal/apply-for-an-account/ Apply for an account in one of the consortia Log into the CCDB portal, and follow the link "Apply for a Consortium Account" Choose to open an account at, for example, Calcul Québec Log into the Calcul Québec portal, and request access to the desired HPC under the "My Profile" tab https://portail.calculquebec.ca/accounts/login/
Concepts connected to CC accounts When you log into an HPC, you are on a login node Login nodes are the HPC entry point, by which users will launch commands on the scheduler The scheduler is a queuing system in which computation jobs are waiting for available compute nodes Compute nodes are nodes on which the jobs get executed Resources on login nodes are limited, so jobs should always get launched on the scheduler HPC sysadmins don’t like jobs launched on the login nodes!
Scheduler At Compute Canada HPCs, you launch jobs by submitting commands to the scheduler When you launch the job, you can specify: A number of cores (CPUs) A walltime, the maximum amount of time that this job can take (after which it gets killed) It’s important to set those numbers properly Jobs with less walltime get processed quicker, but get killed if going overtime
Concepts connected to CC accounts The time you will wait in the queue depends on many factors: How busy the HPC is Job length Number of cores (CPUs) needed Remaining allocation Etc. You can control things such as job length and the number of cores when submitting jobs to the scheduler In this workshop, we will make abstraction of the scheduler Software will be executed directly using an interactive node
Software through GenAP Bioinformatics software pre-installed on Compute Canada https://www.genap.ca/ http://www.computationalgenomics.ca/cvmfs-modules/ http://www.computationalgenomics.ca/cvmfs-genomes/
Modules Software is made available in the shape of loadable modules To load the list of CVMFS modules: module use You can get the list of all available software with module avail To load a module: module load You need to load modules when you launch jobs on the scheduler
Ready to see how it works? Let’s look at the ChIP-seq lab!
Module 3 Introduction to WGBS and analysis Guillaume Bourque Epigenomic Data Analysis
Bisulfite treatment Xi and Li, BMC Bioinformatics, 2009
Workflow for analyzing BS-data Processing of bisulfite-sequencing data: Quality control and pre-processing Bisulfite sequence alignment Quantification of absolute DNA methylation Data visualization and statistical analysis Visual inspection in a genome browser of selected regions Visualization of global distribution of methylation values Clustering of samples based on similarity Downstream analysis Identification of Differentially Methylated Regions (DMRs) Global analysis of DMRs
Quality metrics Read quality Presence of adapter sequencers Duplicate rates Conversion rate
ENCODE WGBS Standards Experiments should have two or more biological replicates; they may have two technical replicates per biological replicate. The C to T conversion rate should be ≥98% The CpG quantification should have a Pearson correlation of ≥0.8 for sites with ≥10X coverage. Sequencing may be paired- or single-ended, as long as sequencing type is specified and paired sequences are indicated. The experiment must pass routine metadata audits in order to be released. https://www.encodeproject.org/wgbs/
Bisulphite sequence alignment Bock, Nat Rev Genet, 2012
Bismark
Visualizing BS-seq data in IGV https://www.broadinstitute.org/igv
GenPipes – Methyl-seq pipeline http://www.computationalgenomics.ca/genpipes/
Ready to see how it works? Let’s look at the WGBS lab!