Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015.

Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015

Outline 1 - How to be a good server citizen 2 – Some useful tricks in R (including ESS) 3 – Project reporting using github + knitr 4 – NGS pipeline example – wild mice

1 - How to be a good server citizen Server throughput is affected by – CPU usage cat /proc/cpuinfo, top or htop – RAM top and htop – Swap space When your computer doesn’t have enough RAM for open jobs, it puts some on the hard disk. This is BAD – Disk Input/Output and space iostat, df, du

Check CPU information using cat /proc/cpuinfo rwdavies@dense:~$ cat /proc/cpuinfo | head processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6344 stepping : 0 microcode : 0x600081c cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 rwdavies@dense:~$ cat /proc/cpuinfo | grep processor | wc -l 48

Check RAM amount + usage and CPU usage htop and top 48 cores Load average – average over 1, 5, 15 minutes RAM – 512GB total 142 in use (rest free)

Check disk use using iostat High sequential reading (fast!) rwdavies@dense:~$ iostat -m -x 2 Relatively unused Also note from top and htop state – D = limited by IO There are also ways to optimize disk use for different IO requirements on a server – ask Warren Kretschmar

Check disk usage using du and df Get sizes of directories Get available disk space for drives -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G) -s, --summarize display only a total for each argument

1 - How to be a good server citizen Take away CPU usage – Different servers / groups have different philosophies – In general, try for load <= number of cores RAM – High memory jobs can take down a server very easily by pushing RAM to swap and will make others very mad at you – best to avoid Disk IO – For IO bound jobs you often get better combined throughput from running one or a few jobs than many in parallel. Test to determine which is best for you – Also don’t forget to try and avoid clogging up disks

2 – Some useful tricks in R (including ESS) R is a commonly used programming language / statistical environment Pros – (Almost) everyone uses it (especially in Genetics), so it’s very easy to use for collaborations – Very easy to learn and use Cons – It’s “slow” – It can’t do X But! R can be faster, and it might be able to do X! Here I’ll show a few tricks

R editors ESS (Emacs speaks statistics) There are many R editors I won’t talk about here (Rstudio comes to mind) Emacs is a general purpose text editor. There exists an extension to emacs called ESS allowing you to use R within emacs This allows you to analyze data on a server very nicely using a split screen environment and keyboard shortcuts to run your code

I have code on the left An R terminal on the right Running a line of code ctrl-c ctr-j Running a paragraph of code ctrl-c ctrl-p Switching windows ctrl-x o

Google: ESS cheat sheet http://ess.r-project.org/refcard.pdf C- = ctrl M- = option key Its easy to find cheatsheets for editors like emacs+ESS

R samtools R package to give you basic access to BAM files. Useful if you want to manually interrogate BAM files For example, get number of reads in an interval, then can calculate average mapping quality, etc.

R mclapply lapply – apply a function to members of a list mclapply – do it multicore! Note there exists a spawning cost depending on memory of current R job Not 19X faster due to chromosome size differences Also I ran this on a 48 core server with a load of 40

R ff Save R objects (like matrices) to disk in a non- human-readable format. Later, you can reload part of a matrix instead of the whole thing Example – matrix of 583,937 rows, 106 columns Accessing 1 entry takes 0.01 seconds with ff and 2 seconds when you load the whole thing into R first. Bonus – you can write to different entries in an ff file using different processes!

R Rcpp The only thing I’ve ever been stuck on running fast in R is long for loops with dependent elements, like when writing an HMM Here, I used c++ in R to take a reference genome (size 60M) coded as an integer 0 to 3, and calculate number of Kmers of size K I write the c++ as a vector in R, compile it using R (which takes a few seconds), then call the function as I would any other in R Works with multi-variable input and output using lists Note that you can call fancy R things from c++

A lot of people complain about R being slow but it’s really not that slow Lots of packages exist for speeding up your code including Rcpp, ff, multicore, Rsamtools, etc Spend the time finding an editor that works for you (emacs+ESS, vi, Rstudio, etc). It will save you a lot of time as you memorize keyboard shortcuts 2 – Some useful tricks in R (including ESS) – Take away

3 – Project reporting using knitr+github “Robbie, what if you were to use alpha=2 instead of alpha=3? Surely alpha=2 is better” “Robbie, why don’t you try filtering out X? I think that would improve things” “Robbie, can you send me new figures showing the effect of alpha=2”? “Sorry actually now that I’ve thought about it I decided that alpha=3 is better”

What are knitr and github Knitr – Write R code to automatically generate PDF using latex or markdown (fancy html) files from results and parameters – When results change, your output automatically incorporates those changes! Github – Traditionally used for hosting code, versioning, collaborating, etc. – Can also be used to host project output online

Setting up a knitr+github pipeline Cons – Takes an afternoon to set up – Everything takes ~20-60 minutes longer as you write code to put it online Pros – You can make small changes and easily regenerate all of your downstream plots and tables – Everything is neat and organized – less scrambling to find files / code 6+ months later

Real life examples My github for one of my projects https://github.com/rwdavies/hotspotDeath Kiran github for PacBio malaria sequencing https://github.com/kvg/PacBio/tree/master/r eports/FirstLook https://github.com/kvg/PacBio/tree/master/r eports/FirstLook

Changing small parameter Real life example 2015_01_22 Made small change to filtering condition in middle of pipeline New downstream plot is similar (but better)! 2015_01_06 Earlier version

3 - Project reporting using github + knitr – take away Some start up cost Once set up, allows you to very easily modify parameters and re-run analysis Easy to return to and look up how you made all your figures, tables, etc I will use this or something similar for every subsequent project I’m involved with

4 – An example of an NGS pipeline – wild mice analysis We have data (fastQ’s) on 69 mice We want VCFs (genotypes at SNPs) to build recombination rate maps and to look at population genetic type analyses Here I will discuss what the pipeline involved in terms of software + run times

N=1 – 40X N=1 - 40X N=1 – 40X N=20 – 10X N=10 – 30X N=20 - 10X N=13- 40X M. m. Domesticus M. m. Castaneus M. m. musculus

bwa aln –q 10 Stampy –bamkeepgoodreads Add Read group info Merge into library level BAM using picard MergeSamFiles 69 analysis ready BAMS! Picard markDuplicates Merge into sample level BAM Use GATK RealignerTargetCreator on each population Realign using GATK IndelRealigner per BAM Use GATK UnifedGenotyper on each population to create a list of putative variant sites. GATK BaseRecalibrator to generate recalibration tables per mouse GaTK PrintReads to apply recalibration 6 pops – 20 French, 20 Taiwan, 10 Indian, 17 Lab mice, 1 Fam, 1 Caroli

Downloaded 95GB of gzipped.sra (15 files) Turned back into FQs (relatively fast) (30 files) bwa – about 2 days at 40 AMD cores (86 GB output, 30 files) Merged 30 -> 15 files (215 GB) stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500 files) Example for 1 mus caroli (~2.5 GB genome ~50X coverage) Merge stampy jobs together, turn into BAMs (220 GB 15 files) Merge library BAMs together, then remove duplicates per library, then merge and sort into final BAM (1 output, took about 2 days, 1 AMD) 1BAM, 170 GB NOTE: GATK also has scatter-gather for cluster work – probably worthwhile to investigate if you’re working on a project with 10T+ data Indel realignment – find intervals – 16 Intel cores, fast (30 mins) Apply realignment – 1 intel core – slower 1 BAM, 170 GB BQSR – call putative set of variants – 16 intel cores – (<2 hours) BQSR – generate recalibration tables – 16 intel cores – 10.2 hours (note – used relatively new GATK which allows multi-threading for this) BQSR – output – 1 Intel core – 37.6 hours 1 BAM, 231 GB

Wildmice – calling variants We made two sets of callsets using the GATK – 3 population specific (Indian, French, Taiwanese), principally for estimating recombination rate FP susceptible – prioritize low error at the expense of sensitivity – Combined – for pop gen We used the GATK to call variants and VQSR to filter

Take raw callset. Split into known and novel (array, dbSNP, etc) Split into known and novel Fit a Gaussian Mixture Model on QC parameters on known Keep the novel that’s close to the GMM, remove if far away What is the VQSR? (Variant Quality Score Recalibrator) Ti/Tv -> Expect ~2.15 genome wide Higher in genic regions

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen FrenchArray Filtered950.641.9712,957,8302.2099.0894.02 FrenchArray Filtered970.722.2814,606,1492.1999.0796.01 FrenchArray Filtered991.123.6217,353,2642.1699.0698.09 FrenchArray Not Filt952.065.8218,071,5932.1499.0796.58 FrenchArray Not Filt972.978.2419,369,8162.1099.0798.01 FrenchArray Not Filt996.1115.7322,008,9782.0199.0699.20 French17 Strains951.293.8916,805,7172.1499.0793.49 French17 Strains972.206.5218,547,7132.1199.0796.49 French17 Strains994.1911.6320,843,6792.0499.0698.62 FrenchHard FiltersNA5.3616.3719,805,5922.0699.0996.96 It’s a good idea to benchmark your SNP data and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right))

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen FrenchArray Filtered950.641.9712,957,8302.2099.0894.02 FrenchArray Filtered970.722.2814,606,1492.1999.0796.01 FrenchArray Filtered991.123.6217,353,2642.1699.0698.09 FrenchArray Not Filt952.065.8218,071,5932.1499.0796.58 FrenchArray Not Filt972.978.2419,369,8162.1099.0798.01 FrenchArray Not Filt996.1115.7322,008,9782.0199.0699.20 French17 Strains951.293.8916,805,7172.1499.0793.49 French17 Strains972.206.5218,547,7132.1199.0796.49 French17 Strains994.1911.6320,843,6792.0499.0698.62 FrenchHard FiltersNA5.3616.3719,805,5922.0699.0996.96 Sensitivity – You set this – How much of your training set do you want to recover HetsInHomE – Look at homozygous regions in the mouse – how many hets do you see chrXHetE – Look at chromosome X in males – how many hets do you see nSNPs – number of SNPs TiTv – transition transversion ratio – expect ~2.15 for real, 0.5 for FP arrayCon – Concordance with array genotypes arraySen – Sensitivity for polymorphic array sites We chose a dataset for recombination rate estimation with low error rate but still a good number of SNPs Notes – VQSR sensitivity not always “calibrated” - It’s a good idea to benchmark your callsets and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right))

PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen TaiwanArray Not Filt952.0511.2036,344,0632.12NA TaiwanArray Not Filt972.8714.6739,183,9322.10NA TaiwanArray Not Filt996.3425.5742,864,3222.05NA Taiwan17 Strains951.8310.3229,748,4562.11NA Taiwan17 Strains972.1611.2034,112,3252.11NA Taiwan17 Strains993.6615.8039,549,6662.08NA TaiwanHard FiltersNA6.1119.4433,692,8572.04NA IndianArray Not Filt951.111.8066,190,3902.18NA IndianArray Not Filt971.592.5771,134,7572.16NA IndianArray Not Filt993.705.5678,220,3482.11NA Indian17 Strains950.671.1657,674,2092.18NA Indian17 Strains971.091.6365,981,6542.17NA Indian17 Strains992.633.3175,103,8862.13NA IndianHard FiltersNA5.4172.6178,487,6162.10NA AllArray Not Filt951.908.95140,827,8102.0499.0796.74 AllArray Not Filt972.3813.99160,447,2552.0399.0798.20 AllArray Not Filt994.5222.73184,977,1571.9999.0699.36 Some of the datasets are extremely big Combined datasets allow us to better evaluate differences between populations Notes – VQSR sensitivity not always “calibrated” – Note: Be VERY skeptical of the work of others wrt sensitivity, specificity, that depends on NGS. Different filtering on different datasets can often explain alot

Huge Taiwan and French bottleneck, India OK Homozygosity = red French and Taiwanese very inbred, not so for the Indian mice Taiwan France India

Admixture / introgression common Recent Admixture is visible in French and Taiwanese populations

French hotspots are cold in Taiwan and vice-versa Our Domesticus hotspots are enriched in an already known Domesticus motif Broad scale correlation is conserved between subspecies, like in humans vs chimps

4 – An example of an NGS pipeline – wild mice analysis – take away All the stuff involving BAMs are slow. Take care and try to avoid mistakes, but redo analyses if appropriate to fix them If you’re doing human stuff, you can probably get away with Ti/Tv for SNP filtering. If not human, try to set up benchmarks to guide SNP calling and filtering Boy do I wish I had used some sort of knitr + github reporting system (for the downstream stuff)

Extra 1 – Useful random linux screen – Log onto server, start a “screen” session. You can then disconnect from the server and reconnect at a later time with all your programs open Set up password-less ssh using public-private keys! – Google “password less ssh”

Extra 2 – Give some thought to folder organization

Conclusions Please don’t crash the server Please don’t hog the server without reason (especially RAM and disk IO!) Consider something like emacs and ESS for quick programming in R R is pretty fast if you program it right, and there are lots of packages and tricks to make it faster Consider something like iPython or knitr(+/-github) to document your work and auto-generate reports on long projects Sequencing data is big, slow and unwieldy. But it is very informative!

Acknowledgements Simon Myers – supervisor Jonathan Flint, Richard Mott – close collaborators Oliver Venn – Recombination work for wild mice Kiran Garimella – GATK, github Cai Na – Pre-processing pipeline Winni Kretzschmar – ESS, many other things Amelie Baud, Binnaz Yalcin, Xiangchao Gan and many others for the wild mice

Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015.

Similar presentations

Presentation on theme: "Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015.

Similar presentations

Presentation on theme: "Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015."— Presentation transcript:

Similar presentations

About project

Feedback