Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015.

Slides:



Advertisements
Similar presentations
The essentials managers need to know about Excel
Advertisements

IT253: Computer Organization
R Packages Davor Cubranic SCARL, Dept. of Statistics.
R for Macroecology Aarhus University, Spring 2011.
Computer Basics Hit List of Items to Talk About ● What and when to use left, right, middle, double and triple click? What and when to use left, right,
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Windows XP Basics OVERVIEW Next.
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
Introduction to Unix – CS 21 Lecture 10. Lecture Overview Midterm questions Jobs and processes description The foreground and background Controlling jobs.
Computer Memory GCSE Computing.
1 SAS Formats and SAS Macro Language HRP223 – 2011 November 9 th, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
Chapter 4 Assessing and Understanding Performance
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
How to build your own computer And why it will save you time and money.
NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint.
SM3121 Software Technology Mark Green School of Creative Media.
Introduction to Unix – CS 21 Lecture 5. Lecture Overview Lab Review Useful commands that will illustrate today’s lecture Streams of input and output File.
Bioinformatics Tips NGS data processing and pipeline writing
NGS Analysis Using Galaxy
Hotmail Tutorial This tutorial aims to quickly cover some of the basic elements of web based using msn Hotmail - a free service Use the Index.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Servers, R and Wild Mice Robert William Davies Feb 5, 2014.
Chromium OS is an open-source project that aims to build an operating system that provides a fast, simple, and more secure computing experience for people.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Computer Memory Chips Vs. Human Memory Computer Memory Chips Vs. Human Memory Agenda.Introduction.What does ( memory ) mean ?.Brain memory V.S computer.
Computers in the real world Objectives Understand what is meant by memory Difference between RAM and ROM Look at how memory affects the performance of.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CSS Sprites. What are sprites? In the early days of video games, memory for graphics was very low. So to make things load quickly and make graphics look.
MICROSOFT WORD GETTING STARTED WITH WORD. CONTENTS 1.STARTING THE PROGRAMSTARTING THE PROGRAM 2.BASIC TEXT EDITINGBASIC TEXT EDITING 3.SAVING A DOCUMENTSAVING.
Unit 1 – Improving Productivity Instructions ~ 100 words per box.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:
IT253: Computer Organization Lecture 11: Memory Tonga Institute of Higher Education.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
1 Day 5 Additional Unix Commands. 2 Important vs. Not Often in Unix there are multiple ways to do something. –In this class, we will learn the important.
IT253: Computer Organization
Unit 1 – Improving Productivity Instructions ~ 100 words per box.
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Forms and Server Side Includes. What are Forms? Forms are used to get user input We’ve all used them before. For example, ever had to sign up for courses.
Robert Crawford, MBA West Middle School.  Explain how the binary system is used by computers.  Describe how software is written and translated  Summarize.
Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Making Python Pretty!. How to Use This Presentation… Download a copy of this presentation to your ‘Computing’ folder. Follow the code examples, and put.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Intermediate 2 Computing Unit 2 - Software Development.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
THE PAPERLESS CLASSROOM: USING GOOGLE DRIVE TO CONDUCT A PAPERLESS RESEARCH PAPER: BENEFITS OF USING GOOGLE DRIVE TO CONDUCT A PAPERLESS RESEARCH PAPER,
By the end of this lesson you will be able to explain: 1. Identify the support categories for reported computer problems 2. Use Remote Assistance to connect.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Using Your Endicott Laptop. hardware HP 17” laptop DVD drive with a little printer inside 149GB hard drive fast processor, lots of regular and video RAM.
Canadian Bioinformatics Workshops
Welcome to Indiana University Clusters
IUIE Reporting Basics Workshop
Welcome to Indiana University Clusters
Getting Started with R.
MiSeq Validation Pipeline
Introduction to Computers
ECONOMETRICS ii – spring 2018
Data Entry and Managment
Year 10 Computer Science Hardware - CPU and RAM.
Presentation transcript:

Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015

Outline 1 - How to be a good server citizen 2 – Some useful tricks in R (including ESS) 3 – Project reporting using github + knitr 4 – NGS pipeline example – wild mice

1 - How to be a good server citizen Server throughput is affected by – CPU usage cat /proc/cpuinfo, top or htop – RAM top and htop – Swap space When your computer doesn’t have enough RAM for open jobs, it puts some on the hard disk. This is BAD – Disk Input/Output and space iostat, df, du

Check CPU information using cat /proc/cpuinfo cat /proc/cpuinfo | head processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6344 stepping : 0 microcode : 0x600081c cpu MHz : cache size : 2048 KB physical id : 0 cat /proc/cpuinfo | grep processor | wc -l 48

Check RAM amount + usage and CPU usage htop and top 48 cores Load average – average over 1, 5, 15 minutes RAM – 512GB total 142 in use (rest free)

Check disk use using iostat High sequential reading (fast!) iostat -m -x 2 Relatively unused Also note from top and htop state – D = limited by IO There are also ways to optimize disk use for different IO requirements on a server – ask Warren Kretschmar

Check disk usage using du and df Get sizes of directories Get available disk space for drives -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G) -s, --summarize display only a total for each argument

1 - How to be a good server citizen Take away CPU usage – Different servers / groups have different philosophies – In general, try for load <= number of cores RAM – High memory jobs can take down a server very easily by pushing RAM to swap and will make others very mad at you – best to avoid Disk IO – For IO bound jobs you often get better combined throughput from running one or a few jobs than many in parallel. Test to determine which is best for you – Also don’t forget to try and avoid clogging up disks

2 – Some useful tricks in R (including ESS) R is a commonly used programming language / statistical environment Pros – (Almost) everyone uses it (especially in Genetics), so it’s very easy to use for collaborations – Very easy to learn and use Cons – It’s “slow” – It can’t do X But! R can be faster, and it might be able to do X! Here I’ll show a few tricks

R editors ESS (Emacs speaks statistics) There are many R editors I won’t talk about here (Rstudio comes to mind) Emacs is a general purpose text editor. There exists an extension to emacs called ESS allowing you to use R within emacs This allows you to analyze data on a server very nicely using a split screen environment and keyboard shortcuts to run your code

I have code on the left An R terminal on the right Running a line of code ctrl-c ctr-j Running a paragraph of code ctrl-c ctrl-p Switching windows ctrl-x o

Google: ESS cheat sheet C- = ctrl M- = option key Its easy to find cheatsheets for editors like emacs+ESS

R samtools R package to give you basic access to BAM files. Useful if you want to manually interrogate BAM files For example, get number of reads in an interval, then can calculate average mapping quality, etc.

R mclapply lapply – apply a function to members of a list mclapply – do it multicore! Note there exists a spawning cost depending on memory of current R job Not 19X faster due to chromosome size differences Also I ran this on a 48 core server with a load of 40

R ff Save R objects (like matrices) to disk in a non- human-readable format. Later, you can reload part of a matrix instead of the whole thing Example – matrix of 583,937 rows, 106 columns Accessing 1 entry takes 0.01 seconds with ff and 2 seconds when you load the whole thing into R first. Bonus – you can write to different entries in an ff file using different processes!

R Rcpp The only thing I’ve ever been stuck on running fast in R is long for loops with dependent elements, like when writing an HMM Here, I used c++ in R to take a reference genome (size 60M) coded as an integer 0 to 3, and calculate number of Kmers of size K I write the c++ as a vector in R, compile it using R (which takes a few seconds), then call the function as I would any other in R Works with multi-variable input and output using lists Note that you can call fancy R things from c++

A lot of people complain about R being slow but it’s really not that slow Lots of packages exist for speeding up your code including Rcpp, ff, multicore, Rsamtools, etc Spend the time finding an editor that works for you (emacs+ESS, vi, Rstudio, etc). It will save you a lot of time as you memorize keyboard shortcuts 2 – Some useful tricks in R (including ESS) – Take away

3 – Project reporting using knitr+github “Robbie, what if you were to use alpha=2 instead of alpha=3? Surely alpha=2 is better” “Robbie, why don’t you try filtering out X? I think that would improve things” “Robbie, can you send me new figures showing the effect of alpha=2”? “Sorry actually now that I’ve thought about it I decided that alpha=3 is better”

What are knitr and github Knitr – Write R code to automatically generate PDF using latex or markdown (fancy html) files from results and parameters – When results change, your output automatically incorporates those changes! Github – Traditionally used for hosting code, versioning, collaborating, etc. – Can also be used to host project output online

Setting up a knitr+github pipeline Cons – Takes an afternoon to set up – Everything takes ~20-60 minutes longer as you write code to put it online Pros – You can make small changes and easily regenerate all of your downstream plots and tables – Everything is neat and organized – less scrambling to find files / code 6+ months later

Real life examples My github for one of my projects Kiran github for PacBio malaria sequencing eports/FirstLook eports/FirstLook

Changing small parameter Real life example 2015_01_22 Made small change to filtering condition in middle of pipeline New downstream plot is similar (but better)! 2015_01_06 Earlier version

3 - Project reporting using github + knitr – take away Some start up cost Once set up, allows you to very easily modify parameters and re-run analysis Easy to return to and look up how you made all your figures, tables, etc I will use this or something similar for every subsequent project I’m involved with

4 – An example of an NGS pipeline – wild mice analysis We have data (fastQ’s) on 69 mice We want VCFs (genotypes at SNPs) to build recombination rate maps and to look at population genetic type analyses Here I will discuss what the pipeline involved in terms of software + run times

N=1 – 40X N=1 - 40X N=1 – 40X N=20 – 10X N=10 – 30X N= X N=13- 40X M. m. Domesticus M. m. Castaneus M. m. musculus

bwa aln –q 10 Stampy –bamkeepgoodreads Add Read group info Merge into library level BAM using picard MergeSamFiles 69 analysis ready BAMS! Picard markDuplicates Merge into sample level BAM Use GATK RealignerTargetCreator on each population Realign using GATK IndelRealigner per BAM Use GATK UnifedGenotyper on each population to create a list of putative variant sites. GATK BaseRecalibrator to generate recalibration tables per mouse GaTK PrintReads to apply recalibration 6 pops – 20 French, 20 Taiwan, 10 Indian, 17 Lab mice, 1 Fam, 1 Caroli

Downloaded 95GB of gzipped.sra (15 files) Turned back into FQs (relatively fast) (30 files) bwa – about 2 days at 40 AMD cores (86 GB output, 30 files) Merged 30 -> 15 files (215 GB) stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500 files) Example for 1 mus caroli (~2.5 GB genome ~50X coverage) Merge stampy jobs together, turn into BAMs (220 GB 15 files) Merge library BAMs together, then remove duplicates per library, then merge and sort into final BAM (1 output, took about 2 days, 1 AMD) 1BAM, 170 GB NOTE: GATK also has scatter-gather for cluster work – probably worthwhile to investigate if you’re working on a project with 10T+ data Indel realignment – find intervals – 16 Intel cores, fast (30 mins) Apply realignment – 1 intel core – slower 1 BAM, 170 GB BQSR – call putative set of variants – 16 intel cores – (<2 hours) BQSR – generate recalibration tables – 16 intel cores – 10.2 hours (note – used relatively new GATK which allows multi-threading for this) BQSR – output – 1 Intel core – 37.6 hours 1 BAM, 231 GB

Wildmice – calling variants We made two sets of callsets using the GATK – 3 population specific (Indian, French, Taiwanese), principally for estimating recombination rate FP susceptible – prioritize low error at the expense of sensitivity – Combined – for pop gen We used the GATK to call variants and VQSR to filter

Take raw callset. Split into known and novel (array, dbSNP, etc) Split into known and novel Fit a Gaussian Mixture Model on QC parameters on known Keep the novel that’s close to the GMM, remove if far away What is the VQSR? (Variant Quality Score Recalibrator) Ti/Tv -> Expect ~2.15 genome wide Higher in genic regions

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen FrenchArray Filtered ,957, FrenchArray Filtered ,606, FrenchArray Filtered ,353, FrenchArray Not Filt ,071, FrenchArray Not Filt ,369, FrenchArray Not Filt ,008, French17 Strains ,805, French17 Strains ,547, French17 Strains ,843, FrenchHard FiltersNA ,805, It’s a good idea to benchmark your SNP data and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right))

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen FrenchArray Filtered ,957, FrenchArray Filtered ,606, FrenchArray Filtered ,353, FrenchArray Not Filt ,071, FrenchArray Not Filt ,369, FrenchArray Not Filt ,008, French17 Strains ,805, French17 Strains ,547, French17 Strains ,843, FrenchHard FiltersNA ,805, Sensitivity – You set this – How much of your training set do you want to recover HetsInHomE – Look at homozygous regions in the mouse – how many hets do you see chrXHetE – Look at chromosome X in males – how many hets do you see nSNPs – number of SNPs TiTv – transition transversion ratio – expect ~2.15 for real, 0.5 for FP arrayCon – Concordance with array genotypes arraySen – Sensitivity for polymorphic array sites We chose a dataset for recombination rate estimation with low error rate but still a good number of SNPs Notes – VQSR sensitivity not always “calibrated” - It’s a good idea to benchmark your callsets and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right))

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen TaiwanArray Not Filt ,344, NA TaiwanArray Not Filt ,183, NA TaiwanArray Not Filt ,864, NA Taiwan17 Strains ,748, NA Taiwan17 Strains ,112, NA Taiwan17 Strains ,549, NA TaiwanHard FiltersNA ,692, NA IndianArray Not Filt ,190, NA IndianArray Not Filt ,134, NA IndianArray Not Filt ,220, NA Indian17 Strains ,674, NA Indian17 Strains ,981, NA Indian17 Strains ,103, NA IndianHard FiltersNA ,487, NA AllArray Not Filt ,827, AllArray Not Filt ,447, AllArray Not Filt ,977, Some of the datasets are extremely big Combined datasets allow us to better evaluate differences between populations Notes – VQSR sensitivity not always “calibrated” – Note: Be VERY skeptical of the work of others wrt sensitivity, specificity, that depends on NGS. Different filtering on different datasets can often explain alot

Huge Taiwan and French bottleneck, India OK Homozygosity = red French and Taiwanese very inbred, not so for the Indian mice Taiwan France India

Admixture / introgression common Recent Admixture is visible in French and Taiwanese populations

French hotspots are cold in Taiwan and vice-versa Our Domesticus hotspots are enriched in an already known Domesticus motif Broad scale correlation is conserved between subspecies, like in humans vs chimps

4 – An example of an NGS pipeline – wild mice analysis – take away All the stuff involving BAMs are slow. Take care and try to avoid mistakes, but redo analyses if appropriate to fix them If you’re doing human stuff, you can probably get away with Ti/Tv for SNP filtering. If not human, try to set up benchmarks to guide SNP calling and filtering Boy do I wish I had used some sort of knitr + github reporting system (for the downstream stuff)

Extra 1 – Useful random linux screen – Log onto server, start a “screen” session. You can then disconnect from the server and reconnect at a later time with all your programs open Set up password-less ssh using public-private keys! – Google “password less ssh”

Extra 2 – Give some thought to folder organization

Conclusions Please don’t crash the server Please don’t hog the server without reason (especially RAM and disk IO!) Consider something like emacs and ESS for quick programming in R R is pretty fast if you program it right, and there are lots of packages and tricks to make it faster Consider something like iPython or knitr(+/-github) to document your work and auto-generate reports on long projects Sequencing data is big, slow and unwieldy. But it is very informative!

Acknowledgements Simon Myers – supervisor Jonathan Flint, Richard Mott – close collaborators Oliver Venn – Recombination work for wild mice Kiran Garimella – GATK, github Cai Na – Pre-processing pipeline Winni Kretzschmar – ESS, many other things Amelie Baud, Binnaz Yalcin, Xiangchao Gan and many others for the wild mice