Considering Time in Designing Large-Scale Systems for Scientific Computing Nan-Chen Chen 1 Sarah S. Poon 2 Lavanya Ramakrishnan 2 Cecilia R. Aragon 1,2.

Slides:



Advertisements
Similar presentations
Discussion of Infrastructure Clouds A NERSC Magellan Perspective Lavanya Ramakrishnan Lawrence Berkeley National Lab.
Advertisements

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update June 12,
Working With & Supporting Administrative Staff IdeaPOP! May 13, 2011 Division of Student Affairs and Academic Support Presented by Kelley Fink Coordinator.
HPC - High Performance Productivity Computing and Future Computational Systems: A Research Engineer’s Perspective Dr. Robert C. Singleterry Jr. NASA Langley.
German Priority Programme 1648 Software for Exascale Computing.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Python Programming Chapter 1: The way of the program Saad Bani Mohammad Department of Computer Science Al al-Bayt University 1 st 2011/2012.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
June 13, Introduction to CS II Data Structures Hongwei Xi Comp. Sci. Dept. Boston University.
+ Software engineering in High Performance Computing Anastas Mishev Faculty of Computer Science and Engineering UKIM.
PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
The Mind Map of a Data Scientist Rebecca Perry and Carlota Valdivieso, Work Experience Students July 2013 What qualifies Data Science? Many things qualify.
COMPUTER SCIENCE 10: INTRODUCTION TO COMPUTER SCIENCE Dr. Natalie Linnell with credit to Cay Horstmann and Marty Stepp.
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
Lawrence Berkeley National Laboratory Kathy Yelick Associate Laboratory Director for Computing Sciences.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Erin Crede and Maura Borrego, Department of Engineering Education, Virginia Tech Modeling the Graduate Engineering Student Experience: Combining Socialization.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program CASC, May 3, ADVANCED SCIENTIFIC COMPUTING RESEARCH An.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan
SOFTWARE PERFORMANCE TESTING Or: Have We Got Bad Blood?
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update September.
Starting and Growing Your Own Research Program Cecilia Aragon Associate Professor Dept. of Human Centered Design & Engineering University of Washington.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
October 21, 2015 XSEDE Technology Insertion Service Identifying and Evaluating the Next Generation of Cyberinfrastructure Software for Science Tim Cockerill.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 NERSC Visualization Greenbook Workshop Report June 2002 Wes Bethel LBNL.
Chapter 4 Collecting Requirements. What do you want to know? What is the problem area? How does the business you approach it? Is the data available? Who.
Light Source Reviews The BES Perspective July 23, 2002 Pedro A. Montano Materials Sciences and Engineering Basic Energy Sciences BASIC ENERGY SCIENCES.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
Tom Furlani Director, Center for Computational Research SUNY Buffalo Metrics for HPC September 30, 2010.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
Healthcare Quality Improvement Dr. Nishan Sharma University of Calgary, Canada March
Software Development in HPC environments: A SE perspective Rakan Alseghayer.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
© 2010 Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center RP Update July 1, 2010 Bob Stock Associate Director
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Welcome to the PRECIS training workshop
Seaborg Decommission James M. Craw Computational Systems Group Lead NERSC User Group Meeting September 17, 2007.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.
National Aeronautics and Space Administration Jet Propulsion Laboratory March 17, 2009 Workflow Orchestration: Conducting Science Efficiently on the Grid.
Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.
Founded in 1899, the Society  is a non-profit corporation  has an Executive Office in Washington, DC  is governed by a 19-member Council  elected.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Usability and Human Factors Cognition and Human Performance Lecture c This material (Comp15_Unit3c) was developed by Columbia University, funded by the.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
A Brief Introduction to NERSC Resources and Allocations
OPERATING SYSTEMS CS 3502 Fall 2017
DOE Facilities - Drivers for Science: Experimental and Simulation Data
What is an Operating System? A historical Perspective
Panel on Research Challenges in Big Data
Welcome to (HT)Condor Week #19 (year 34 of our project)
Presentation transcript:

Considering Time in Designing Large-Scale Systems for Scientific Computing Nan-Chen Chen 1 Sarah S. Poon 2 Lavanya Ramakrishnan 2 Cecilia R. Aragon 1,2 1 Department of Human Centered Design & Engineering, University of Washington 2 Lawrence Berkeley National Laboratory

High Performance Computing (HPC) N ational E nergy R esearch S cientific C omputing Center (NERSC) 133,824 CPU cores 357 TB memory 5000 users (= Supercomputers)

Impact of the NERSC HPC systems years of simulation data generated in a year years of simulation data generated in a day

Impact of the NERSC HPC systems top journal cover stories per year 10 Nobel Prizes 4 journal publications per year years of simulation data generated in a year years of simulation data generated in a day

“Exascale machines” will be coming out Impact of the NERSC HPC systems years of simulation data generated in a year years of simulation data generated in a day 2025

Increased speed, increased efficiency? SpeedComplexity HPC machines Misunderstandings Breakdowns Users InefficiencyDifficulties

Increased speed, increased efficiency? SpeedComplexity HPC machines Misunderstandings Breakdowns Users InefficiencyDifficulties How can we better consider user-related aspects in HPC design?

Time as a lens By focusing on the temporal aspects, it “makes us speak in a different language, ask different questions, and use a different framework in the methodological aspects of our research.” (Ancona et al., 2001) HPC machines Users Clock time CPU time Floating point operations per second (flops) ? Time in current HPC design

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N HPC machines Users How many CPU hours does NERSC have in total? How many hours do I have to work on this project? Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

An exemplar HPC workflow N HPC machines Users How to schedule jobs to best utilize the system? How can I get my work done faster? Apply for allocation Prepare jobs Submit to queues Jobs are running Archive outputs

Time in CSCW literature Time is not just a mechanistic metric "sets of practices, which are bound up with time-reckoning and time-keeping technologies, but which vary and are shaped by different times, places and communities“ (Glennie and Thrift, 2009) distributed collective practices not only have rhythms, but in some fundamental sense are rhythms.” (Jackson et al. 2011)

26 interviews with 15 people + occasional direct observation / shadowing Six month field study at a research center where scientists use NERSC machines 13 male and 2 female. 4 domain scientists, 7 computer engineers, 4 HPC facility staff. 5~25 years experiences with HPC Methodology

Finding 1: Time cost in preparing jobs I am not really interested in making a script that takes an hour, run in 10 minutes. I am interested in taking a script that runs three days, and running in one, or less … Where my interests are, is making the intractable problem, tractable; not making the tractable problems faster, because they’re tractable, who cares? [Domain Scientist A] Getting codes to run really fast on HPC takes time to learn and to do, and it is not what scientists are interested in

Finding 2: Variability and uncertainty in execution You don't always get the same result when you do something twice… Sometimes I will run something literally without changing anything, resubmit the same job again. It will have failed once. It will run successfully the second time. [Domain Scientist C] Variability and uncertainty in the system: It can take people a long time to debug long run- times or failures.

Finding 3: Time to handle system upgrades Every time there's an operating system upgrade, it hurts us badly. We haven't gone through any of them without some kind of scar. Sometimes it's really bad. This one is really bad. It may be weeks or months before we actually can run again. [Domain Scientist A] System upgrades enhance performance, but they can lead to compatibility issues which take scientists a long time to fix

Theme 1: Conflicts between temporal rhythms Code fastRun fast Handle issues caused by upgrades Upgrade system for performance Temporal rhythm conflicts in collaboration (Jackson et al. 2011) Identifying temporal rhythms and their conflicts in the HPC ecosystem

Theme 2: Challenges in communication HPC machines Users Surfacing states and intentions are critical (Ackerman 2000, Aragon et al. 2008, Bardram 2000, Begole 2002, Dourish & Button 1998, Kusunoki & Sarcevic 2015, Landgren 2006, Mazmanian & Erickson 2014, Mazmanian, Erickson & Harmon 2015)

Theme 3: Collective Time HPC machines Users Temporal rhythms Technology can shape the ways time is organized (Ackerman 2000, Lindley 2015, Orlikowski & Yates 2002 ) ?...

Final take away It is critical to consider user-related aspects in designing large-scale systems for scientific computing Using time as a lens helps us to identify important design spaces in this large-scale sociotechnical ecosystem Open questions for designers… –What can be designed to help resolve temporal rhythm conflicts? –How to better communicate states and intentions in the ecosystem? –Which designs can support and shape people’s understanding of time and temporal rhythm in the ecosystem in a collective way?

Acknowledgments This work was funded by the Office of Science, Office of Advanced Scientific Computing Research (ASCR) of the U.S. Department of Energy under Contract Number DE-AC02-05CH11231 and award number DE- SC All the participants

UDA - Usable Data Abstractions Project Blog More questions/comments? UTC+8 Taipei Time Nan-Chen Chen Sarah S. PoonLavanya RamakrishnanCecilia R. Aragon