Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Slides:

Advertisements

Similar presentations

1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009.

Advertisements

1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012.

BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

Introduction to Scalable Programming using Makeflow and Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Research Issues in Cooperative Computing Douglas Thain

Scaling Up Without Blowing Up Douglas Thain University of Notre Dame.

Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.

1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.

1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.

1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.

1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

Positioning Dynamic Storage Caches for Transient Data Sudharshan VazhkudaiOak Ridge National Lab Douglas ThainUniversity of Notre Dame Xiaosong Ma North.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.

1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Introduction to Makeflow Li Yu University of Notre Dame 1.

The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.

Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.

Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.

Building Scalable Applications on the Cloud with Makeflow and Work Queue Douglas Thain and Patrick Donnelly University of Notre Dame Science Cloud Summer.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,

Ch 4. The Evolution of Analytic Scalability

Portable Resource Management for Data Intensive Workflows Douglas Thain University of Notre Dame.

Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.

Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

07:44:46Service Oriented Cyberinfrastructure Lab, Introduction to BOINC By: Andrew J Younge

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.

1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Deepcarbon.net Xiaogang Ma, Patrick West, John Erickson, Stephan Zednik, Yu Chen, Han Wang, Hao Zhong, Peter Fox Tetherless World Constellation Rensselaer.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Purdue RP Highlights TeraGrid Round Table May 20, 2010 Preston Smith Manager - HPC Grid Systems Rosen Center for Advanced Computing Purdue University.

Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.

Fundamental Operations Scalability and Speedup

Tools and Services Workshop

Distributed Data Access and Resource Management in the D0 SAM System

Scaling Up Scientific Workflows with Makeflow

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Ch 4. The Evolution of Analytic Scalability

BXGrid: A Data Repository and Computing Grid for Biometrics Research

What’s New in Work Queue

Creating Custom Work Queue Applications

Presentation transcript:

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009

The Standard Model Disk CPU File System Upload Data 1 Batch System Submit 1M Jobs 2 3 Access Data P F The user has to guess at what the filesystem is good at. The FS has to guess what what the user is going to do next.

Understand what the user is trying to accomplish. …not what they are struggling to run on this machine right now.

The Context: Biometrics Research Computer Vision Research Lab at Notre Dame: Kinds of research questions: –How does human variation affect matching? –What’s the reliability of 3D versus 2D? –Can we reliably extract a good iris image from an imperfect video, and use it for matching? A great context for systems research: –Committed collaborators, lots of data, problems of unbounded size, some time sensitivity, real impact. –Acquiring TB data/month, sustained 90% CPU util. –Always composing workloads that break something.

BXGrid Schema fileid = size = 300K type = jpg sum = abc123… replicaid=423 state=ok replicaid=105 state=ok replicaid=293 state=creating replicaid=102 state=deletingTypeSubjectEyeColorFileIDIrisS100RightBlue10486 IrisS100LeftBlue10487 IrisS203RightBrown24304 IrisS203LeftBrown24305 Scientific Metadata General Metadata Immutable Replicas

BXGrid Abstractions B1 B2 B3 A1A2A3 FFF F FF FF F Lbrown Lblue Rbrown R eyecolor G G G ROC Curve S = Select( color=“brown” ) B = Transform( S,G ) M = AllPairs( A, B, F ) H. Bui et al, “BXGrid: A Repository and Experimental Abstraction” Special Issue on e-Science, Journal of Cluster Computing, S

An abstraction is a declarative specification of both the data and computation in a batch workload (Think SQL for clusters.)

A0 AllPairs( set A, set B, func F ) A1 An B0B1Bn F B2 A2 Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

> AllPairs IrisA IrisB MyComp.exe

How can All-Pairs do better? No demand paging! Distribute the data via spanning tree. Metadata lookup can be cached forever. File locations can be cached optimistically. Compute the right number of CPUs to use. Check for data integrity, but on failure, don’t abort, just migrate. On a multicore CPU, use multiple threads and a cache-aware access algorithm. Decide at runtime whether to run here on 4 cores, or run on the entire cluster. Number of CPUs Turnaround Time 200

What’s the Upshot? The end user gets scalable performance always at the “sweet spot” of the system. It works despite many different failure modes, including uncooperative admins. 60Kx60K All-Pairs ran on CPUs over one week, largest known experiment on public data in the field. All-Pairs abstraction is 4X more efficient than the standard model.

A0 AllPairs( set A, set B, func F ) A1 An B0B1Bn F B2 A2 Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

A0 SomePairs( set A, set B, pairs P, func F ) A1 An B0B1Bn B2 A2 (0,1) (0,2) (1,1) (1,2) (2,0) (2,2) (2,3) P Moretti, et al, “Scalable Modular Genome Assembly on Campus Grids”, under review in 2009.

Wavefront ( R[x,0], R[0,y], F(x,y,d) ) R[4,2] R[3,2]R[4,3] R[4,4]R[3,4]R[2,4] R[4,0]R[3,0]R[2,0]R[1,0]R[0,0] R[0,1] R[0,2] R[0,3] R[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd Yu et al, “Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions”, IEEE HPDC 2009.

Directed Graph part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result Thain and Moretti, “Abstractions for Cloud Computing with Condor”, in Handbook of Cloud Computing Services, in press.

Contributions Repository for Scientific Data: –Provide a better match with end user goals. –Simplify scalability, management, recovery… Abstractions for Large Scale Computing: –Expose actual data needs of computation. –Simplify scalability, consistency, recovery… –Enable transparent portability from single CPU to multicore to cluster to grid. Are there other useful abstractions?

HECURA Summary Impact on real science: –Production quality data repository used to produce datasets accepted by NIST and published to community. “Indispensable” –Abstractions used to scale up student’s daily “scratch” work to 500 CPUs; increased data coverage of published results by two orders of magnitude. –Planning to adapt prototype to Civil Eng CDI project. Impact on computer science: –Open source tools that run on Condor, SGE, multicore, recently with independent users. –1 chapter, 2 journal papers, 4 conference papers. Broader Impact –Integration into courses, used by other projects at ND. –Undergrad co-authors from ND and elsewhere.

Participants Faculty: –Douglas Thain and Patrick Flynn Graduate Students: –Christopher Moretti (Abstractions) –Hoang Bui (Repository) –Li Yu (Multicore) Undergraduates: –Jared Bulosan –Michael Kelly –Christopher Lyon (North Carolina A&T Univ) –Mark Pasquier –Kameron Srimoungchanh (Clemson Univ) –Rachel Witty

For more information: The Cooperative Computing Lab – Cloud Computing and Applications –October 20-21, Chicago IL – This work was supported by the National Science Foundation via grant CCF