Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

Slides:



Advertisements
Similar presentations
OptorSim: A Replica Optimisation Simulator for the EU DataGrid W. H. Bell, D. G. Cameron, R. Carvajal, A. P. Millar, C.Nicholson, K. Stockinger, F. Zini.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.
Research Issues in Cooperative Computing Douglas Thain
Lecture 7: 9/17/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
The Center for Computational Genomics and Bioinformatics Christopher Dwan Mike Karo Tim Kunau.
1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.
1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.
1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.
1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.
Conductor A Framework for Distributed, Type-checked Computing Matthew Kehrt.
1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
2/18/2004 Challenges in Building Internet Services February 18, 2004.
Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.
1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.
Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
Workload Management Massimo Sgaravatto INFN Padova.
PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.
CS 6190 Finding a Research Topic. The Thesis Equation Topic + Advisor = Dissertation.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
1 Lecture 2 Introduction, OS History n objective of an operating system n OS history u no OS u batch system u multiprogramming u multitasking.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Larisa kocsis priya ragupathy
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.
Grid MP at ISIS Tom Griffin, ISIS Facility. Introduction About ISIS Why Grid MP? About Grid MP Examples The future.
Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.
1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris MIT CSAIL and New York University.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Millions of Jobs or a few good solutions …. David Abramson Monash University MeSsAGE Lab X.
1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Fundamental Operations Scalability and Speedup
Workload Management Workpackage
Clouds , Grids and Clusters
Large-scale file systems and Map-Reduce
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
BXGrid: A Data Repository and Computing Grid for Biometrics Research
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Introduction to MapReduce
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008

What is Cooperative Computing? By combining our computing and storage resources together, we can attack problems larger than we could alone. I can use your computer when it is idle, and vice versa. (Most computers are idle about 90 percent of the day.) Also known as… –Grid computing, distributed computing, metacomputing, volunteer computing, etc…

Who Needs Coop Computing? Many fields of study rely on simulation and data processing to conduct science. –Physics, chemistry, biology, engineering, finance, sociology, computer science. More Computing == Better Results –NOT High Performance: Speed up one program. –High Throughput: Produce as many results as possible over the next day / week / year.

Cooperative Computing Lab We design and build distributed systems that helps people to attack BIG problems. Work directly with end users to make sure that our solutions affect the real world. Operate a modest computing system as both a production service and a research testbed. –Currently about 500 cpus and 300 disks. CS Research challenges: scalability, robustness, usability, debugging, and performance.

What Makes this Challenging? The Programming Model –I want to process 10 TB of data on 100 machines, then distribute it across 20 disks, then view the best results on my workstation. Fault Tolerance –Something is always broken! Performance Robustness –There is always one slowpoke. Debugging –My job runs correctly here but not there...!?

An Example Collaboration: Biometrics Research and Distributed Systems

A Common Pattern in Biometrics F Sample Workload: 4000 images 256KB each 1s per F 185 CPU-days Future Workload: images 1MB each 0.1s per F 4166 CPU-days

Non-Expert User Using 500 CPUs Try 1: Each F is a batch job. Failure: Dispatch latency >> F runtime. HN CPU FFFF F Try 2: Each row is a batch job. Failure: Too many small ops on FS. HN CPU FFFF F F F F F F F F F F F F F F F F Try 3: Bundle all files into one package. Failure: Everyone loads 1GB at once. HN CPU FFFF F F F F F F F F F F F F F F F F Try 4: User gives up and attempts to solve an easier or smaller problem.

All Pairs Production System Web Portal 300 active storage units 500 CPUs, 40TB disk FGH S T All-Pairs Engine 2 - AllPairs(F,S) FFF FFF 3 - O(log n) distribution by spanning tree. 6 - Return result matrix to user. 1 - Upload F and S into web portal. 5 - Collect and assemble results. 4 – Choose optimal partitioning and submit batch jobs.

Some Results on Real Workload

Collaboration is Where the Interesting Problems Are! (Cooperative Computing Provides the Resources)

What Makes a Collaboration Work? Like a marriage? (old joke.) First, a show of commitment: go after some low hanging fruit, and publish it. A proposal for funding only succeeds if you have already started working together. Need very concrete goals: your partner may not share your idea of an interesting tangent. Students sometimes need a big push to leave their comfort zone and work together.

For more information… Douglas Thain Cooperative Computing Lab – Apply for Summer 2008 REU: – Supported by NSF Grants CCF and CNS