Breaking Up is Hard to Do: Migrating R Project to CHTC Brian S. Yandell UW-Madison 22 November 2013.

Slides:



Advertisements
Similar presentations
This demo will show the analysis functionality of Phenom-Networks based on a dataset generated in the Hebrew University, the Faculty of Agriculture in.
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Genetic Analysis of Genome-wide Variation in Human Gene Expression Morley M. et al. Nature 2004,430: Yen-Yi Ho.
Mapping a trait. Types of trait: 1.Monogenic 2.Polygenic-Quantitative-continuous QT are the cumulative effect of many genes and are environment dependent.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Parasol Architecture A mild case of scary asynchronous system stuff.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
MALD Mapping by Admixture Linkage Disequilibrium.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
Inferring Causal Phenotype Networks Elias Chaibub Neto & Brian S. Yandell UW-Madison June 2010 QTL 2: NetworksSeattle SISG: Yandell ©
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Computational Infrastructure for Systems Genetics Analysis Brian Yandell, UW-Madison high-throughput analysis of systems data enable biologists & analysts.
Distributed Computations
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
2050 VLSB. Dad phase unknown A1 A2 0.5 (total # meioses) Odds = 1/2[(1-r) n r k ]+ 1/2[(1-r) n r k ]odds ratio What single r value best explains the data?
Seattle Summer Institute : Systems Genetics for Experimental Crosses Brian S. Yandell, UW-Madison Elias Chaibub Neto, Sage Bionetworks
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Linkage Analysis in Merlin
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Chapter 13 Starting Design: Logical Architecture and UML Package Diagrams.
Bayesian causal phenotype network incorporating genetic variation and biological knowledge Brian S Yandell, Jee Young Moon University of Wisconsin-Madison.
Characterizing the role of miRNAs within gene regulatory networks using integrative genomics techniques Min Wenwen
Quantile-based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March © YandellMSRC5.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Monsanto: Yandell © Building Bridges from Breeding to Biometry and Biostatistics Brian S. Yandell Professor of Horticulture & Statistics Chair of.
Computational Infrastructure for Systems Genetics Analysis Brian Yandell, UW-Madison high-throughput analysis of systems data enable biologists & analysts.
Project of CZ5225 Zhang Jingxian:
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Complex Traits Most neurobehavioral traits are complex Multifactorial
Quantitative Genetics
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Association between genotype and phenotype
Population structure at QTL d A B C D E Q F G H a b c d e q f g h The population content at a quantitative trait locus (backcross, RIL, DH). Can be deduced.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Causal Network Models for Correlated Quantitative Traits Brian S. Yandell UW-Madison October Jax SysGen: Yandell.
Lecture 22: Quantitative Traits II
1 Paper Outline Specific Aim Background & Significance Research Description Potential Pitfalls and Alternate Approaches Class Paper: 5-7 pages (with figures)
13 October 2004Statistics: Yandell © Inferring Genetic Architecture of Complex Biological Processes Brian S. Yandell 12, Christina Kendziorski 13,
Gene Mapping for Correlated Traits Brian S. Yandell University of Wisconsin-Madison Correlated TraitsUCLA Networks (c)
Genetic mapping and QTL analysis - JoinMap and QTLNetwork -
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Causal Network Models for Correlated Quantitative Traits Brian S. Yandell UW-Madison October Jax SysGen: Yandell.
EQTLs.
Quantile-based Permutation Thresholds for QTL Hotspots
Quantile-based Permutation Thresholds for QTL Hotspots
upstream vs. ORF binding and gene expression?
Attie Bioinformatics Server Redesign
Gene mapping in mice Karl W Broman Department of Biostatistics
Inferring Genetic Architecture of Complex Biological Processes BioPharmaceutical Technology Center Institute (BTCI) Brian S. Yandell University of Wisconsin-Madison.
Causal Network Models for Correlated Quantitative Traits
Inferring Causal Phenotype Networks
Causal Network Models for Correlated Quantitative Traits
Causal Network Models for Correlated Quantitative Traits
Inferring Causal Phenotype Networks Driven by Expression Gene Mapping
Starting Design: Logical Architecture and UML Package Diagrams
Inferring Genetic Architecture of Complex Biological Processes Brian S
Charles Tappert Seidenberg School of CSIS, Pace University
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Lecture 9: QTL Mapping II: Outbred Populations
GWAS-eQTL signal colocalisation methods
The first two principal components for the islet gene expression data for the 181 microarray probes that map to the chromosome 6 trans-eQTL hotspot with.
MapReduce: Simplified Data Processing on Large Clusters
eQTL Tools a collaboration in progress
Presentation transcript:

Breaking Up is Hard to Do: Migrating R Project to CHTC Brian S. Yandell UW-Madison 22 November 2013

Topics in this talk science – Fisher permutation on correlated tests & traits – DAG causal tests using genetics mechanics of scaling up calculations – rewrite R for parallel steps – organize shell scripts to schedule jobs – process results

 hotspots  Jax SysGen: Yandell © 20133

hotspot permutation test (Breitling et al. Jansen 2008 PLoS Genetics) for original dataset and each permuted set: – set single trait LOD threshold T use Churchill-Doerge (1994) permutations – count number of traits (N) with LOD above T count for every locus (marker or pseudomarker) smooth counts if markers are dense find count with at most 5% of permuted sets above (critical value) as count threshold conclude original counts above threshold are real Jax SysGen: Yandell © 20134

Genetic architecture of gene expression in 6 tissues. A Tissue-specific panels illustrate the relationship between the genomic location of a gene (y-axis) to where that gene’s mRNA shows an eQTL (LOD > 5), as a function of genome position (x-axis). Circles represent eQTLs that showed either cis-linkage (black) or trans-linkage (colored) according to LOD score. Genomic hot spots, where many eQTLs map in trans, are apparent as vertical bands that show either tissue selectivity (e.g., Chr 6 in the islet,  ) or are present in all tissues (e.g., Chr 17,  ). B The total number of eQTLs identified in 5 cM genomic windows is plotted for each tissue; total eQTLs for all positions is shown in upper right corner for each panel. The peak number of eQTLs exceeding 1000 per 5 cM is shown for islets (Chrs 2, 6 and 17), liver (Chrs 2 and 17) and kidney (Chr 17).

Tissue-specific hotspots with eQTL and SNP architecture Are these hotspots real?

Single trait permutation threshold T Churchill Doerge (1994) Null distribution of max LOD – Permute single trait separate from genotype – Find max LOD over genome – Repeat 1000 times Find 95% permutation threshold T Identify interested peaks above T in data Controls genome-wide error rate (GWER) – Chance of detecting at least on peak above T © YandellMSRC5

Single trait permutation schema phenotype genotypes max LODLOD over genome 1. shuffle phenotypes to break QTL 2. repeat 1000 times and summarize 8UCLA 2013 © Yandellhotspots

Hotspot count threshold N(T) Breitling et al. Jansen (2008) Null distribution of max count above T – Find single-trait 95% LOD threshold T – Find max count of traits with LODs above T – Repeat 1000 times Find 95% count permutation threshold N Identify counts of LODs above T in data – Locus-specific counts identify hotspots Controls GWER in some way © YandellMSRC5

Hotspot permutation schema phenotypes genotypes count LODs at locus over threshold T LOD at each locus for each phenotype over genome 1. shuffle phenotypes by row to break QTL, keep correlation 2. repeat 1000 times and summarize max count N over genome 10UCLA 2013 © Yandellhotspots

permutation across traits (Breitling et al. Jansen 2008 PLoS Genetics) Jax SysGen: Yandell © gene expression strain marker right waywrong way break correlation between markers and traits but preserve correlation among traits

Yeast study 120 individuals 6000 traits 250 markers 1000 permutations 1.8 * 10^10 linear models doable (barely) on a laptop 2012 © Yandell12MSRC5

Mouse study 500 individuals 30,000 traits * 6 tissues 2000 markers 1000 permutations 1.8 * 10^13 linear models 1000 x more than yeast study need to parallelize 2012 © Yandell13MSRC5

 CHTC mechanics  Jax SysGen: Yandell ©

CHTC mechanics 1.figure out how to do small scale science – R code, then R package 2.diagram flow 3.Refactor code to scale up 4.Small scale tests 5.Ramp up

why automate? Evolution of data – code breaks – code improves (and breaks again) – data breaks (typos, mistakes, etc.) – data improves (new results, altered database) big data is not static metadata is key – versions, annotation,...

figure out how to do small scale science build R code pieces – use Rproject or emacs or … organize as R package – read “Writing R Extensions” many times document, document, document – ## comments in code at all steps – use prompt() and create function manual pages – include error checking with clear messages – write vignette/markdown of example run

diagram flow calculations for one set of data – one trait: anova tests across many predictors predictor = genotype at chromosome location test statistic = log likelihood = LOD score – repeat for 1000s of traits (= molecular signals) – record test statistic distribution for each predictor only need tail distribution maybe only number of traits above threshold – find maximum by quantile across anova statistics do same for multiple (1000s) of permutations estimate permutation distribution (quantiles)

refactor code to scale up overview mechanics – have simple calls at top for later scheduling needs – put parallel functions in their own file – document, document, document for you and others parallel.qtlhot(phase, index, …, dirpath=“.”) – one master function to guide phases – each phase has needed arguments, etc. – dirpath useful for testing, but use “.” for CHTC sandboxes parallel.error(number, phase, index) – write “RESULT.phase.index” file with number parallel.message(number) – output informative message (0 = “OK”) qtlhot.phaseN(dirpath, index, …) – N = phase number – called by parallel.qtlhot(), hidden from user except N=0

parallel phases qtlhot.phaseN(dirpath, index, …) 0: initialization (for scheduling before CHTC) 1: setup objects needed in phase 2 (one processor) – read raw data – create useful stuff (random sequence for index?) – write Phase1.RData object 2: map phase: parallel permutation (send to CHTC sandboxes) – load Phase1.RData object – permute or otherwise shuffle as needed using index – do calculations for one set of data – write Phase2.index.RData objects 3: reduce phase: merge permutations (one processor) – load Phase1.RData object – loop through Phase2.*.RData objects – write Phase3.RData object

CHTC use: one “small” project Open Science Grid Glidein Usage (4 feb 2012) grouphourspercent 1BMRB % 2 Biochem_Attie % 3 Statistics_Wahba % Monsanto: Yandell ©

art of map phase want CHTC runs that are ~2 hours if longer, break up into smaller chunks – keep track of randomization sequence, etc. – build merge into R code or DAGman scheme if shorter, combine sets as series – 1000 runs of 10 sets rather than 10,000 small runs – use R code to organize objects as needed

art of reduce phase each map phase run may have sizeable object – you have subtle summaries needed across runs – you don’t yet know what you want to save summarize but don’t throw out checkpoints test each phase as you go in modular way when satisfied, remove big unneeded objects – save space on submit node (fills quickly) – decide what you need to keep longer term – can often rerun CHTC if need to reproduce stuff

phase 0 CHTC initialization use a script! document steps!./R --no-save --args islet Male m2 < SOAR.args0.R may have to combine multiple data sources reduce object size to minimal needed create local folder with subfolders for CHTC move objects into place copy folder to submit node submit job to CHTC

initialization script./R --no-save --args islet Male m2 < SOAR.args0.R Contents of SOAR.args0.R: args <- commandArgs(TRUE) tissue <- args[1] subset.sex <- args[2] runnum <- args[3] ## SOAR.phase0.R does the initialization source(”SOAR.phase0.R")

post-CHTC processing use a script! document steps! keep track of CHTC submit job ID check that final object is there – have backup plan to do phase 3 offline load Phase3.RData object do post-processing, plots, summaries save useful stuff

post-processing script./R --no-save --args islet Male m2 209 < SOAR.args1.R Contents of SOAR.args1.R: args <- commandArgs(TRUE) tissue <- args[1] subset.sex <- args[2] runnum <- args[3] job <- args[4] ## SOAR.post.R does the post processing source(”SOAR.post.R")

Breitling Method 2012 © Yandell28MSRC5 28Monsanto: Yandell © 2012

Brietling et al (2008) hotspot size thresholds from permutations Monsanto: Yandell ©

quality vs. quantity in hotspots (Chaibub Neto et al Genetics) detect single trait with very large LOD – control FWER across genome and all traits find small hotspots with very significant traits – all traits have large LODs at same locus – maybe one strongly disrupted signal pathway? use sliding LOD threshold across hotspot sizes – small LOD threshold (~4) for large hotspots – large LOD threshold (~8) for small hotspots Jax SysGen: Yandell ©

rethinking the approach For a hotspots of size N, what threshold T(N) is just large enough to declare 5% significance? N = 1 (single trait) – What threshold T(1) is needed to declare any single peak significant? – valid across all traits and whole genome Chaibub Neto E, Keller MP, Broman AF, Attie AD, Jansen RC, Broman KW, Yandell BS, Quantile-based permutation thresholds for QTL hotspots. Genetics (tent. accepted). 31Monsanto: Yandell © 2012

Chaibub Neto sliding LOD thresholds Monsanto: Yandell © single trait significant 50-trait hotspot significant

sliding LOD method Monsanto: Yandell ©

 causal networks  Jax SysGen: Yandell ©

BxH ApoE-/- chr 2: causal architecture hotspot 12 causal calls 35Jax SysGen: Yandell © 2013

BxH ApoE-/- causal network for transcription factor Pscdbp causal trait 36 unpublished work of Elias Chaibub Neto Jax SysGen: Yandell © 2013

basic idea of QTLnet iterate between finding QTL and network genetic architecture given causal network – trait y depends on parents pa(y) in network – QTL for y found conditional on pa(y) Parents pa(y) are interacting covariates for QTL scan causal network given genetic architecture – build (adjust) causal network given QTL – each direction change may alter neighbor edges Jax SysGen: Yandell ©

edge direction: which is causal? Jax SysGen: Yandell © due to QTL

graph complexity with node parents Jax SysGen: Yandell © pa2pa1 node of2 of3 of1 pa1 node of2of1 pa3 of3

how many node parents? how many edges per node? (fan-in) – few parents directly affect one node – many offspring affected by one node BIC computations by maximum number of parents # all # all ,300 2,560 3,820 4,660 5, , , , , M , , M 18.6M 16.1B , M 26.7M 157M 22.0T , M 107M 806M 28.1Q Jax SysGen: Yandell ©

BIC computation each trait (node) has a linear model – Y ~ QTL + pa(Y) + other covariates BIC = LOD – penalty – BIC balances data fit to model complexity – penalty increases with number of parents limit complexity by allowing only 3-4 parents Jax SysGen: Yandell ©

parallel phases for larger projects Jax SysGen: Yandell © b2.1 … m 4.1 … 5 Phase 1: identify parents Phase 2: compute BICs Phase 3: store BICs Phase 4: run Markov chains Phase 5: combine results

parallel implementation R/qtlnet available at Condor cluster: chtc.cs.wisc.edu – System Of Automated Runs (SOAR) ~2000 cores in pool shared by many scientists automated run of new jobs placed in project Jax SysGen: Yandell © Phase 4Phase 2

Jax SysGen: Yandell © single edge updates 100,000 runs burnin

neighborhood edge reversal Jax SysGen: Yandell © Grzegorczyk M. and Husmeier D. (2008) Machine Learning 71 (2-3), orphan nodes reverse edge find new parents select edge drop edge identify parents

Jax SysGen: Yandell © neighborhood for reversals only 100,000 runs burnin