HTCondor for machine learning in biology

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

High Throughput Computing and Protein Structure Stephen E. Hamby.
Cyberinfrastructure and Scientific Collaboration: Application of a Virtual Team Performance Framework to GLOW II Teams Sara Kraemer Chris Thorn Value-Added.
Cheminformatics II Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Epistasis Analysis Using Microarrays Chris Workman.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Welcome to HTCondor Week #14 (year #29 for our project)
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Expanding Access to OpenClinica Data via LabKey Server Peter Hussey 2013 OpenClinica Global Conference Founding Partner, LabKey Software June 21, 2013.
Hideko Mills, Manager of IT Research Infrastructure
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Cry1wild type. A multicamera image acquisition platform In different rooms for different purposes we have approximately 20 of these CCD cameras equipped.
Interpreting Microarray Expression Data Using Text Annotating the Genes Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik.
Custom Spotfire Applications for use in Drug Discovery Chris Louer Team Leader, Cheminformatics © 2001, GlaxoSmithKline, Inc. - All Rights Reserved.
Computing and Communications and Biology Molecular Communication; Biological Communications Technology Workshop Arlington, VA 20 February 2008 Jeannette.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)
Copyright OpenHelix. No use or reproduction without express written consent1.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
A quick introduction to Oncinfo Lab Dr. Habil Zare, PhD PI of Oncinfo Lab Department of Computer Science Texas State University 18 September 2015.
Integrating the Bioinformatic Technology Group into your research programme Introduction People and Skills Examples Integrating the BTG Contacts BHRC Away.
An Integrated Computational Framework for Systems Biology Ram Samudrala University of Washington How does the genome of an organism specify its behaviour.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Center for Predictive Computational Phenotyping (CPCP): Training Plans May 15, 2015 Debora Treu and Whitney Sweeney Center for Predictive Computational.
Why Write A Grant? Elaine M. Hylek, MD, MPH Professor of Medicine Associate Director, Education and Training Division BU CTSI Section of General Internal.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
I NSTITUTE for G ENOMIC B IOLOGY Nathan D. Price Department of Chemical and Biomolecular Engineering Center for Biophysics and Computational Biology Institute.
Use of Machine Learning in Chemoinformatics
1 Modelling and Simulation EMBL – Beyond Molecular Biology Physics Computational Biology Chemistry Medicine.
신기술 접목에 의한 신약개발의 발전전망과 전략 LGCI 생명과학 기술원. Confidential LGCI Life Science R&D 새 시대 – Post Genomic Era Genome count ‘The genomes of various species including.
Multi-scale network biology model & the model library 多尺度网络生物学模型 -- 兼论模型库的建立与应用 Jianghui Xiong 熊江辉
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
The CompTox Chemistry Dashboard: an informational data hub at the
High-Throughput Machine Learning from EHR Data
Python for data analysis Prakhar Amlathe Utah State University
Journal club Jun , Zhen.
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
Matt Link Associate Vice President (Acting) Director, Systems
Computer Science Department, University of Missouri, Columbia
An Artificial Intelligence Approach to Precision Oncology
MODELLING INTERACTOMES
Miron Livny John P. Morgridge Professor of Computer Science
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
Condor: Job Management
Learning Sequence Motif Models Using Expectation Maximization (EM)
Inferring Models of cis-Regulatory Modules using Information Theory
Basic machine learning background with Python scikit-learn
Advanced Bioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2018 Anthony Gitter
Machine Learning Ali Ghodsi Department of Statistics
Identifying Signaling Pathways
Eukaryotic Gene Finding
Dean Martin Cadwallader Dean of the Graduate School
PABIO 590B Advanced Topics in Bioinformatics
Genomes and Their Evolution
A Short Tutorial on Causal Network Modeling and Discovery
Linking Genetic Variation to Important Phenotypes
From Data to Therapies Research in Xinghua Lu’s Lab
Graph Neural Networks Amog Kamsetty January 30, 2019.
Adam P. Deveau, Victoria L. Bentley, Jason N. Berman 
A Suppression Strategy for Antibiotic Discovery
Harnessing the Antioxidant Power with ARE-Inducing Compounds
Satisfying the Ever Growing Appetite of HEP for HTC
Adam P. Deveau, Victoria L. Bentley, Jason N. Berman 
SYGNALing a Red Light for Glioblastoma
Artificial Intelligence and Trusted Microelectronics
Bioinformatic analyses suggest that PI3K/AKT signaling may be a key downstream pathway of tazarotene signaling. Bioinformatic analyses suggest that PI3K/AKT.
Presentation transcript:

HTCondor for machine learning in biology Anthony Gitter University of Wisconsin-Madison Morgridge Institute for Research May 22, 2018 gitter@biostat.wisc.edu www.biostat.wisc.edu/~gitter @anthonygitter These slides are licensed under CC BY 4.0 by Anthony Gitter. See links for third-party image attribution.

Drug discovery and GPU computing Image: Morgridge Moayad Alnammi Shengchao Liu Sam Gelman

High-throughput chemical screening Given a protein of interest, identify chemicals that may have the ability to control the protein Image: Norah Trent

> 1 million compounds Computational chemical prioritization New protein target > 1 million compounds Suggested test order 5* 30390 201 980000 2* 458 … Chemical compound ≈ key Protein ≈ lock

Chemical representations for machine learning Chemception 2-D searching tutorial Wikipedia SMILES PubChem See Ching et al. 2018

Training neural networks We’ve seen 2-100x speedups on GPUs Some jobs still require days on GPUs

Expanding our GPU capacity UW grid Cooley condor_annex

Software dependencies Chemical screening software System libraries NVIDIA driver XXX.YY

Tradeoffs of conda Pros: Easy to install Anaconda Environments are shareable Easy to update packages in environment Cons: Still depend on system libraries conda_submit gpu-job.sub condor install numpy

Predictive models perform well in experimental tests Train on 75k chemicals, PriA-SSB inhibition Choose among many possible models Models select 250 of 25k new chemicals 64 active chemicals in the 25k Model we selected is the best, finds 40 of the 62 Random forest outperforms the neural networks Now testing on much larger chemical libraries

Gene networks and single cell expression Atul Deshpande

What are the relationships among genes inside a cell? Measure snapshots of gene abundance Gene C “Pseudo” time Abundance Gene D Gene A Gene B Gene C Gene D

For each gene, learn which genes control it Gene A A B C D X Gene B Gene C Gene D

Divide network inference into many small computational jobs Gene C “Pseudo” time Abundance Gene D D A ? B C 100 batches of genes X 10 random bootstraps X 100 parameter combinations

Acknowledgments UW-Madison, CHTC Researchers, collaborators Lauren Michael Christina Koch Miron Livny Todd Miller Aaron Moate Tim Cartwright Jaime Frey Greg Thain Neil Van Lysel Dakota Chambers Researchers, collaborators Moayad Alnammi Atul Deshpande Sam Gelman Shengchao Liu Other group members Spencer Ericksen Scott Wildman Michael Hoffmann Andrew Voter James Keck Ron Stewart

Funding and computing resources UW-Madison Center for High Throughput Computing NVIDIA NIH Commons Credits Center for Predictive Computational Phenotyping NIH U54 AI117924 UW Carbone Cancer Center NIH P30 CA014520 UW-Madison VCRGE with funding from WARF PhRMA Foundation Morgridge Institute for Research NSF CAREER 1553206