Project JUNIOR: Drug Discovery using Azure Project JUNIOR: Drug Discovery using Azure April 6-7, 2010 Redmond, Washington.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Configuration management
Analysis of High-Throughput Screening Data C371 Fall 2004.
Chapter 5 Multiple Linear Regression
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
PlanetLab Operating System support* *a work in progress.
Pankaj Kumar Qinglan Zhang Sagar Davasam Sowjanya Puligadda Wei Liu
Simon Woodman Hugo Hiden Paul Watson Jacek Cala. Outline 1. What is e-Science Central? 2. Architecture and Features 3. Workflows and Applications.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Cheminformatics II Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
Cloud Computing for Chemical Property Prediction Paul Watson School of Computing Science Newcastle University, UK Microsoft Cloud.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CS Instance Based Learning1 Instance Based Learning.
A Hadoop MapReduce Performance Prediction Method
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Classification and Prediction: Regression Analysis
Workflow API and workflow services A case study of biodiversity analysis using Windows Workflow Foundation Boris Milašinović Faculty of Electrical Engineering.
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 21, 2011 Windows Azure—Overview.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Automate Microsoft Azure Ross Sponholtz Mark Ghazai.
Databases with Scalable capabilities Presented by Mike Trischetta.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Department of Computer Science Engineering SRM University
Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3: Operating Systems Computer Science: An Overview Tenth Edition.
Chapter 9 Neural Network.
Mostafa Abdollahi Mazandaran University Of Science And Technology January 2011.
Configuration Management (CM)
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
LHCb Software Week November 2003 Gennady Kuznetsov Production Manager Tools (New Architecture)
SIMO SIMulation and Optimization ”New generation forest planning system” Antti Mäkinen & Jussi Rasinmäki Dept. of Forest Resource Management.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Towards large-scale parallel simulated packings of ellipsoids with OpenMP and HyperFlow Monika Bargieł 1, Łukasz Szczygłowski 1, Radosław Trzcionkowski.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Use of Machine Learning in Chemoinformatics
Enabling the Cloud OS Today  New high-density Web Sites with elastic cloud scaling and complete dev-ops experiences  New rich IaaS experience for self-service.
Configuring Print Services Lesson 7. Print Sharing Print device sharing is another one of the most basic applications for which local area networks were.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Let's build a VMM service template from A to Z in one hour Damien Caro Technical Evangelist Microsoft Central & Eastern Europe
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
1
Energy Management Solution
SQL Database Management
MASS Java Documentation, Verification, and Testing
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Chilimbi, et al. (2014) Microsoft Research
Introduction to R Programming with AzureML
Energy Management Solution
Performance Load Testing Case Study – Agilent Technologies
Dev Test on Windows Azure Solution in a Box
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Virtual Screening.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Object Detection Implementations
Presentation transcript:

Project JUNIOR: Drug Discovery using Azure Project JUNIOR: Drug Discovery using Azure April 6-7, 2010 Redmond, Washington

Computing Science – Paul Watson, Hugo Hiden, Simon Woodman, Jacek Cala, Martyn Taylor Northern Institute for Cancer Research – David Leahy, Dominic Searson, Valdimir Sykora Microsoft – Christophe Poulain

What are the properties of this molecule? Toxicity Solubility Biological Activity Perform experiments Time consuming Expensive Ethical constraints

Predict likely properties based on similar molecules CHEMBL Database: data on 622,824 compounds, collected from 33,956 publications WOMBAT-PK Database:data on 1230 compounds, for over 13,000 clinical measurements WOMBAT Database:data on 251,560 structures, for over 1,966 targets What is the relationship between structure and activity? All these databases contain structure information and numerical activity data

Quantitative Structure Activity Relationship f() Activity ≈ More accurately, Activity related to a quantifiable structural attribute Activity ≈ f( logP, number of atoms, shape....) Currently > 3,000 recognised attributes

Aim: To generate models for as much of the freely available data as possible and make it available on so that researchers can generate predictions for their own molecules

QSAR is a Regression exercise Build ModelsValidate ModelsPredict activities

QSAR is a Regression exercise Build ModelsValidate ModelsPredict activities Select a “training set” of compounds CompoundsActivities Select descriptors most related to activity Descriptor Values – logP, shape Calculate descriptor values which compounds? which descriptors? which model? Calculate model parameters XY y = mx + c

Build ModelsValidate ModelsPredict activities Calculate same set of descriptors Descriptors Use model to estimate activities Estimate Select a new set of compounds not used during model building Compounds Actual Compare the estimated activities with the actual activities Error Keep model if error acceptable... what measure?

Build ModelsValidate ModelsPredict activities Use the model to estimate activities for new compounds f (212, 0.9, 9.8) = 5.7 Use Discard

QSAR Process requires many choices – Which descriptors? – Which modelling algorithm? – What model testing strategy? Quality of result depends on make correct choices – All runs are different Discovery Bus manages this process – Apply everything approach

Linear regression Neural Network Partial Least Squares Classification Trees Correlation analysis Genetic algorithms Random selection Java CDK descriptors C++ CDL descriptors Random split 80:20 split Add to database Partition training & test data Calculate descriptors Select descriptors Build model Manages the many model generation paths

Legacy system – Uses Oracle stored procedures and shell scripts – Models built using R – Designed to scale using agents on multiple machines Hard to use and maintain – Undocumented – Specific library and OS versions Runs on Amazon VMs AIM: Extend Discovery Bus using Azure – New agent types – Make use of more computing resources Move towards more maintainable system

“QSPRDiscover&Test”

Move computationally intensive tasks to Cloud – Descriptor calculation – Descriptor selection – Model building Keep Discovery Bus for management Apply this system to CHEMBL and WOMBAT databases MOAQ – Mother Of All QSAR – Model everything in the databases in one shot

Bus Master Planner Calculation Agents e-Science Central Workflows “Proxy” Agent Agent code executes on multiple Azure nodes

~70 Mins 2 Workers ~20 Mins10 Workers ~15 Mins 15 Workers Azure CPU Utilisation:

No  Number of Azure Nodes Processing time (minutes)

Scales to ~20 worker nodesWant scalability to at least 100 Improves calculation time, but still takes too long for a run: Model validation and admin tasks form a long tail

Bus Master Planner Calculation Agents e-Science Central Workflows “Proxy” Agent Agent code executes on multiple Azure nodes Not sending enough work fast enough to Azure Add more admin capable agents Architecture? No improvement Tops out at over 400 concurrent workflows Discovery Bus?

Bus Master Planner Discovery Bus? Optimise Amazon VM configuration Standard VMHigh CPU VM EBS database Local disk database Custom file transfer NFS file transfer Significant gains Planner optimisation Takes ~1 second to plan a request to Azure Feature / limitation of Discovery Bus – always “active” Multiple planning agents Flatten / simplify plan structure 20 Nodes Nodes

Current setup scales to 50 nodes – Run two parallel Discovery Bus instances – Feeds 100 Azure nodes Moving more of the Admin tasks to Azure / e-Science Central – Move model validation out of Discovery Bus – More co-ordination outside Discovery Bus Work in progress – Azure utilisation increasing (average ~60% over entire run)

Before Optimisation: After Optimisation:

Before Optimisation: After Optimisation:

Moving to the Cloud exposes architectural weaknesses – Not always where expected – Stress test individual components in isolation Good utilisation needs the right kind of jobs – Computationally expensive – Modest data transfer sizes – Easy to run in parallel Be prepared to modify architecture However – We are dramatically shortening calculation time – We do have a system that can expand on demand

Finish runs on 100 nodes using 2 Discovery Busses – Publish results on Move more code away from Discovery Bus – Many tasks already on Azure – Fixing planning issue will be complex – Move planning to e-Science Central / Azure Generic model building / management environment in the Cloud

Microsoft External Research Project funders for 2009 / 2010 Roger Barga, Savas Parastatidis, Tony Hey Paul Appleby, Mark Holmes Christophe Poulain