A pilot application 12/9/2008Microsoft eScience Workshop 2008 Robert Bukowski and Jarek Pillardy Computational Biology Service Unit Cornell University.

Slides:



Advertisements
Similar presentations
NCeSS e-Stat quantitative node Prof. William Browne & Prof. Jon Rasbash University of Bristol.
Advertisements

The GATE-LAB system Sorina Camarasu-Pop, Pierre Gueth, Tristan Glatard, Rafael Silva, David Sarrut VIP Workshop December 2012.
Database System Concepts and Architecture
Companies can suffer numerous problems due to poor management of resources and careless decisions. In real-world decision- making, many organizations lack.
Database Concepts Lec. 5. What Is a Database? Data are unprocessed raw facts that include text, number, images, audio, and video. Information is processed.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Time Series Analyst An Internet Based Application for Viewing and Analyzing Environmental Time Series Jeffery S. Horsburgh Utah State University David.
EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration.
27. to 28. March 2007 | Geneva, Switzerland. Fabrice Romelard ilem SA Level 200.
Interpret Application Specifications
Chapter 11 ASP.NET JavaScript, Third Edition. 2 Objectives Learn about client/server architecture Study server-side scripting Create ASP.NET applications.
Objective In this session we will discuss about : What is ADO. NET ?
Creating a SharePoint App with Microsoft Access Services
November 2011 At A Glance GREAT is a flexible & highly portable set of mission operations analysis tools that increases the operational value of ground.
New Challenges in Cloud Datacenter Monitoring and Management
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SOFTWARE.
Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
IT Project Management Cheng Li, Ph.D. August 2003.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
Overview of SQL Server Alka Arora.
Databases and the Internet. Lecture Objectives Databases and the Internet Characteristics and Benefits of Internet Server-Side vs. Client-Side Special.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
Information Systems Chapter 5 Building the database Part 1. Unsing Access.
Financial Services Developer Conference Excel Solutions with CCS Antonio Zurlo Technology Specialist HPC Microsoft Corporation.
Modeling.
Running Climate Models On The NERC Cluster Grid Using G-Rex Dan Bretherton, Jon Blower and Keith Haines Reading e-Science Centre Environmental.
Business Intelligence (BI) Primer BI Tools in SharePoint 2010 Excel Services Performance Point Services.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
Microsoft SharePoint Server 2010 for the Microsoft ASP.NET Developer Yaroslav Pentsarskyy

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Examples of Computing Uses for Statisticians Data management : data entry, data extraction, data cleaning, data storage, data manipulation, data distribution.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Application portlets within the PROGRESS HPC Portal Michał Kosiedowski
Experiment Management with Microsoft Project Gregor von Laszewski Leor E. Dilmanian Acknowledgement: NSF NMI, CMMI, DDDAS
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Preparation NAME::ASMAA ALASY Supervision A::RASHA ATALLAH.
6/12/99 Java GrandeT. Haupt1 The Gateway System This project is a collaborative effort between Northeast Parallel Architectures Center (NPAC) Ohio Supercomputer.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Application Software System Software.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SHIWA and Coarse-grained Workflow Interoperability Gabor Terstyanszky, University of Westminster Summer School Budapest July 2012 SHIWA is supported.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
Take Your Data Analysis and Reporting to the Next Level by Combining SAS Office Analytics, SAS Visual Analytics, and SAS Studio David Bailey Tim Beese.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
MCMC Output & Metropolis-Hastings Algorithm Part I
MATLAB Distributed, and Other Toolboxes
MapReduce Types, Formats and Features
Chapter 10 Development of Multimedia Project
Business Process Management Software
Lecture 1: Multi-tier Architecture Overview
Saranya Sriram Developer Evangelist | Microsoft
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
MAPREDUCE TYPES, FORMATS AND FEATURES
Mark Quirk Head of Technology Developer & Platform Group
Map Reduce, Types, Formats and Features
Presentation transcript:

A pilot application 12/9/2008Microsoft eScience Workshop 2008 Robert Bukowski and Jarek Pillardy Computational Biology Service Unit Cornell University

12/9/2008Microsoft eScience Workshop 2008 Computational Biology Service Unit (CBSU) provides computational support to biologists at Cornell University Maintains several Windows –based compute clusters, making them available to Cornell community and users world-wide Convenience of access to HPC is a major issue

12/9/2008Microsoft eScience Workshop 2008 BioHPC.org – a popular web-based (ASP.NET) interface to HPC clusters created by CBSU – see our poster

12/9/2008Microsoft eScience Workshop 2008 Web service-based interface? Would allow to incorporate HPC applications in analysis pipelines Would allow convenient user interfaces other than web forms, such as Excel

Microsoft Computational Finance Server (CompFin) Recently developed by Microsoft HPC++ Labs for computational finance applications ( Deployment and execution platform for HPC Web service - based Features Excel 2007 user interface As a “proof of principle” and feasibility test, we decided to adapt a few computational biology applications to CompFin Our pilot application: STRUCTURE Genetics [J. K. Pritchard et al., Genetics 155, 945 (2000); D. Falush et al., Genetics 164, 1567 (2003)]– one of the most popular population genetics programs run on CBSU clusters (via our web interface BioHPC.org) 12/9/2008Microsoft eScience Workshop 2008

Outline What is STRUCTURE ? What is CompFin ? CompFin Conclusions 12/9/2008Microsoft eScience Workshop 2008

What is STRUCTURE ? Objective: split a group of individuals into populations (or clusters) based on known genetic characteristics of individuals Method: Model-based clustering Input: X – genomic data (alleles at a several loci for a set of individuals) K – the guessed number of populations Model variables (multi-dimensional vectors): Z – assignment of individuals to populations P – allele frequencies within populations Probability of observing X: Pr(X | P,Z) Which (P,Z) “fit the data” best? Look at posterior probability distribution Pr(Z,P | X) ~ Pr(X | Z,P) Pr(Z) Pr(P) 12/9/2008Microsoft eScience Workshop 2008

What is STRUCTURE ? Pr(P,Z | X) estimated by Markov Chain Monte Carlo (MCMC) simulation (Z,P) (1), (Z,P) (2), ………, (Z,P) (N) Output : various quantities (summary statistics) derived from Pr(Z,P |X), e.g.: Inferred ancestry of individuals (a list of probabilities of each individual belonging to each population; roughly – average Z) Inferred allele frequencies within populations (roughly – average P) STRUCTURE is a “legacy code”; input and output in text files 12/9/2008Microsoft eScience Workshop 2008

What is STRUCTURE ? For a given dataset X, multiple independent simulations are usually needed For different numbers of populations (K) – to infer the best one With the same K – to make sure results are consistent With different MCMC control parameters Each of the multiple simulations is long (hours to days) STRUCTURE analysis is an HPC task ! Would benefit from Excel user interface 12/9/2008Microsoft eScience Workshop 2008

What is CompFin ? API -.NET programmer’s interface which abstracts from implementation details of job scheduler and storage Web services to submit/monitor jobs and retrieve output data Taskpane (Excel add-in) – client consuming the above web services Share Point Server for storage of Excel templates and model binaries and for job management MS SQL Server for data storage (other physical storage implementations are also possible) Cluster running Windows Server 2008 with HPC Server 2008 (or Windows Server 2003 with CCS) SQL Database of historical market data (accessible using Financial APIs) 12/9/2008Microsoft eScience Workshop 2008

What does it take to deploy a CompFin application ? 12/9/2008Microsoft eScience Workshop 2008  Prepare Excel 2007 template workbook with XML-mapped input/output tables Excel 2007 Table(s) with input dataTable(s) with output data XML Maps Template workbook Taskpane [ResultsDataContract] Launch tasks Input (XML) [DataContract]s Create input txt files Launch structure.exe Parse output txt files [ResultsDataContract] Output (XML) Create input txt files Launch structure.exe Parse output txt files C# wrapper  Prepare a C# wrapper code (a “model”) which uses CompFin’s API to o handle XML input/output by converting to/from Data Contracts o Partition job into multiple- tasks; seamlessly interact with job scheduler Web service SQL  Upload the C# assembly (with all necessary binaries) and the Excel template workbook to the Share Point site

Running a CompFin application 12/9/2008Microsoft eScience Workshop 2008 IE Excel SharePoint Excel template C# wrapper + binaries Job Repository Web services Job launch monitoring Results retrieval Job scheduler C#+binaries Input XML SQL API Compute cluster User’s laptop

12/9/2008Microsoft eScience Workshop 2008 STRUCTURE at CompFin

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008 Output information from XML maps is visualized using pivot tables pivot charts VB macros

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

12/9/2008Microsoft eScience Workshop 2008

CompFin as a platform for computational biology Pros: Powerful Excel user interface Easy deployment On-site (on-cluster) data storage (not used here, but with great potential for data-intensive applications, such as Next Generation Sequencing data analysis) CompFin developed with the idea of bringing computational power to the data (rather than data to computational power) Directions of future development Currently, input/output data transfer is through Excel only. Basic file transfer functionality is needed. Raw biological data usually too big or not “pretty” enough to be put into Excel Output transfer from on-cluster SQL storage to Excel XML maps not too efficient for large datasets (although greatly improved as a result of this project) User needs domain account on cluster – good for small, closed organization, not so much for an open university research environment 12/9/2008Microsoft eScience Workshop 2008

We acknowledge support from Microsoft HPC Institute program Microsoft Research 12/9/2008Microsoft eScience Workshop 2008 …. and collaboration with MS HPC Team Richard Ciapala Daniel Simon