Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

Slides:



Advertisements
Similar presentations
SALSA HPC Group School of Informatics and Computing Indiana University.
Advertisements

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
International Conference on Cloud and Green Computing (CGC2011, SCA2011, DASC2011, PICom2011, EmbeddedCom2011) University.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Student Visits August Geoffrey Fox
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
School of Informatics and Computing Indiana University
Science of Cloud Computing Panel Cloud2011 Washington DC July Geoffrey Fox
MapReduce TG11 BOF FutureGrid Team (Geoffrey Fox) TG11 19 July 2011 Downtown Marriott Salt Lake City.
Experimenting with FutureGrid CloudCom 2010 Conference Indianapolis December Geoffrey Fox
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Building Effective CyberGIS: FutureGrid Marlon Pierce, Geoffrey Fox Indiana University.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Hosting Cloud, HPC and Grid Educational Activities on FutureGrid Renato Figueiredo – U. of Florida Geoffrey Fox, Barbara Ann O’Leary – Indiana University.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSA HPC Group School of Informatics and Computing Indiana University.
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Tools and Services Workshop
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
Status and Challenges: January 2017
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
MapReduce and Data Intensive Applications XSEDE’12 BOF Session
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
I590 Data Science Curriculum August
Assignment 0 (5 points; Due Jan. 15, 2017)
Applying Twister to Scientific Applications
Data Science Curriculum March
Biology MDS and Clustering Results
Scientific Data Analytics on Cloud and HPC Platforms
Scalable Parallel Interoperable Data Analytics Library
Clouds from FutureGrid’s Perspective
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Convergence of Big Data and Extreme Computing
Presentation transcript:

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu Jha, Judy Qiu

Life Sciences & Cyberinfrastructure Enormous increase in scale of data generation, vast data diversity and complexity - Development, improvement and sustainability of 21st Century tools, databases, algorithms & cyberinfrastructure Past: 1 PI (Lab/Institute/Consortium) = 1 Problem Future: Knowledge ecologies and New metrics to assess scientists & outcomes (lab’s capabilities vs. ideas/impact) Unprecedented opportunities for scientific discovery and solutions to major world problems

Some Statistics 10,000-fold improvement in sequencing vs. 16-fold improvement in computing over Moore Law - 11% Reproducibility Rate (Amgen) and up to 85% Research Waste (Chalmers) /-9 % of Misidentified Cancer Lines and One of out 3 Proteins Unannotated (Unknown Function)

Opportunities and Challenges New transformative ways of doing data-enabled/ data- intensive/ data-driven discovery in life sciences. Identification of research issues/high potential projects to advance the impact of data-enabled life sciences on the pressing needs of the global society. Challenges to development, improvement, sustainability, reproducibility and criteria to evaluation the success. Education and Training for next generation data scientists

Largely Data for Life Sciences How do we move data to computing Does data have co-located compute resources (cloud?) Do we want HDFS style data storage Or is data in a storage system supporting wide area file system shared by nodes of cloud? Or is data in a database (SciDB or SkyServer)? Or is data in an object store like OpenStack Swift or S3? Relative importance of large shared data centers versus instrumental or computer generated individually owned data? How often is data read (presumably written once!) – Which data is most important? Raw or processed to some level? Is there a metadata challenge? How important is data security and privacy?

Largely Computing for Life Sciences Relative importance of data analysis and simulation Do we want Clouds (cost effective and elastic) OR Supercomputers (low latency)? What is the role of Campus Clusters/resources? Do we want large cloud budgets in federal grants? How important is fault tolerance/autonomic computing? What are special Programming Model issues? – Software as a Service such as “Blast on demand” – Is R (cloud R, parallel R) critical – What about Excel, Matlab – Is MapReduce important? – What about Pig Latin? What about visualization?

SALSA HPC Group School of Informatics and Computing Indiana University

SALSASALSA

Outline Iterative Mapreduce Programming Model Interoperability of HPC and Cloud Reproducibility of eScience

University of Arkansas Indiana University University of California at Los Angeles Penn State Iowa Univ.Illinois at Chicago University of Minnesota Michigan State Notre Dame University of Texas at El Paso IBM Almaden Research Center Washington University San Diego Supercomputer Center University of Florida Johns Hopkins July 26-30, 2010 NCSA Summer School Workshop Students learning about Twister & Hadoop MapReduce technologies, supported by FutureGrid.

Intel’s Application Stack

Linux HPC Bare-system Linux HPC Bare-system Amazon Cloud Windows Server HPC Bare-system Windows Server HPC Bare-system Virtualization Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping CPU Nodes Virtualization Applications Programming Model Infrastructure Hardware Azure Cloud Security, Provenance, Portal High Level Language Distributed File Systems Data Parallel File System Grid Appliance GPU Nodes Support Scientific Simulations (Data Mining and Data Analysis) Runtime Storage Services and Workflow Object Store

SALSASALSA Map Reduce Programming Model Moving Computation to Data Scalable Fault Tolerance – Simple programming model – Excellent fault tolerance – Moving computations to data – Works very well for data intensive pleasingly parallel applications Ideal for data intensive pleasingly parallel applications

Gene Sequences (N = 1 Million) Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi- Dimensional Scaling (MDS) Visualization 3D Plot Reference Sequence Set (M = 100K) N - M Sequence Set (900K) Select Referenc e Reference Coordinates x, y, z N - M Coordinates x, y, z Pairwise Alignment & Distance Calculation O(N 2 )

Input DataSize: 680k Sample Data Size: 100k Out-Sample Data Size: 580k Test Environment: PolarGrid with 100 nodes, 800 workers. 100k sample data 680k data

17 Building Virtual Clusters Towards Reproducible eScience in the Cloud Separation of concerns between two layers Infrastructure Layer – interactions with the Cloud API Software Layer – interactions with the running VM

18 Design and Implementation Equivalent machine images (MI) built in separate clouds Common underpinning in separate clouds for software installations and configurations Configuration management used for software automation Extend to Azure

19 Running CloudBurst on Hadoop Running CloudBurst on a 10 node Hadoop Cluster knife hadoop launch cloudburst 9 echo ‘{"run list": "recipe[cloudburst]"}' > cloudburst.json chef-client -j cloudburst.json CloudBurst on a 10, 20, and 50 node Hadoop Cluster