Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by.

Slides:



Advertisements
Similar presentations
Performance Assessment
Advertisements

Our school: typical Greek school traditional models of education in its daily instructive practice. Past few years: efforts to modernize these instructive.
E-Science Data Information and Knowledge Transformation Thoughts on Education and Training for E-Science Based on edikt project experience Dr. Denise Ecklund.
Standards Alignment A study of alignment between state standards and the ACM K-12 Curriculum.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
IPDPS Looking Back Panel Uzi Vishkin, University of Maryland.
Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,
Learning Objectives, Performance Tasks and Rubrics: Demonstrating Understanding and Defining What Good Is Brenda Lyseng Minnesota State Colleges.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
ISBN Chapter 3 Describing Syntax and Semantics.
James Edwards and Uzi Vishkin University of Maryland 1.
CS540 Software Design Lecture 1 1 Lecture 1: Introduction to Software Design Anita S. Malik Adapted from Budgen (2003) Chapters 1.
Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:
Introductory Comments Regarding Hardware Description Languages.
Introduction CS 524 – High-Performance Computing.
Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.
Describing Syntax and Semantics
Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.
CSC230 Software Design (Engineering)
Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--
Ryann Kramer EDU Prof. R. Moroney Summer 2010.
How the Social Studies Interns are Viewed by their Mentors Going Public Presentation Mike Broda, Mark Helmsing, Chris Kaiser, and Claire Yates.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Thinking Actively in a Social Context T A S C.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Overview of Computing. Computer Science What is computer science? The systematic study of computing systems and computation. Contains theories for understanding.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Team Skill 6: Building the Right System From Use Cases to Implementation (25)
CSC-115 Introduction to Computer Programming
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
National Math Panel Final report 2008 presented by Stanislaus County Office of Education November 2008.
How does it work and what should people know to participate “Work-depth” Alg Methodology (SV82) State all ops you can do in parallel. Repeat. Minimize:
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
SOFTWARE ENGINEERING1 Introduction. SOFTWARE ENGINEERING2 Software Q : If you have to write a 10,000 line program in C to solve a problem, how long will.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Session Objectives Analyze the key components and process of PBL Evaluate the potential benefits and limitations of using PBL Prepare a draft plan for.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
The case for scientific literacy? so pretty i never knew mars had a sun.
Challenges of Quantitative Reasoning Assessment Donna L. Sundre Center for Assessment and Research Studies
From the customer’s perspective the SRS is: How smart people are going to solve the problem that was stated in the System Spec. A “contract”, more or less.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
1 A simple parallel algorithm Adding n numbers in parallel.
The Current Crisis in Computing: What are the Real Issues? Mark Guzdial School of Interactive Computing Georgia Tech Story: –1 st Order Economics, 2 nd.
Conclusions on CS3014 David Gregg Department of Computer Science
Common Core State Standards: Myths vs. Facts
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Introduction SOFTWARE ENGINEERING.
Redesigning College Teaching at Sacramento State University
Parallel Algorithm Design
Introduction CSE 1310 – Introduction to Computers and Programming
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Common Core State Standards Initiative
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by DARPA HPCS, 2005]: Programming for parallelism is easy It is the programming for performance that makes it hard

Reinvention of Computing for Many-Core Parallelism Requires Addressing Productivity Uzi Vishkin A less fatalistic position: Programming for parallelism is easy But, the difficulty of programming for performance depends on the system

Productivity in Parallel Computing The large parallel machines story Funding of productivity: $M650 HProductivityCS, ~2002 Met # Gflops goals: up by 1000X since mid-90’s; Exascale talk & plans Met power goals. Also: groomed eloquent spokespeople Progress on productivity: No agreed benchmarks. No spokesperson. Elusive! In fact, not much has changed since: “as intimidating and time consuming as programming in assembly language”--NSF Blue Ribbon Committee, 2003 or even “parallel software crisis”, CACM Common sense engineering: Untreated bottleneck  diminished returns on improvements  bottleneck becomes more critical Next 10 years: New specific programs on flops and power. What about productivity?! Reality: economic island. Cleared by marketing: DOE applications Enter: mainstream many-cores Every CS major should be able to program many-cores

Coherence Issue When you come to a fork in the road, take it!- Yogi Berra Camp 1 Many US best minds opt for occupations that do not involve programming NSF tries to lure them to CS in HS by: (1) presenting the steady march and broad reach of computing across the sciences, industries, culture and society, correcting the current narrow focus on programming in introductory course [New Programs Aim to Lure Young Into Digital Jobs, NYTimes, 12/09]; (2) productivity (3) computational thinking Camp 2 Power/performance  Reinvent mainstream computing for parallelism Vendors try to build many-cores that require decomposition-first programming. Railroading to productivity “disaster area”. Hacking. Insufficient support from parallel algorithms design & analysis. Short on outreach/productivity/abstraction Unintended outcome of “taking the fork” (prod vs. power/perf) Camp cheerleaders: core CS (alg design & analysis style) is radical. Peer review favors both sides over center. Centrists as extremists is an oxymoron! Building wrong expectations among prospective CS majors. Disappointment will lead to “Get me out of this major” Pool of CS majors to be engaged in decomposition- first too limited (after subtracting the lured-to-breadth-over-programming and the core) Consequences of “taking the fork” surrealism Eventual casualties: # students, credibility & productivity Research/comparison of several holistic parallel platforms could: (i) prevent much of the damage, (ii) build up the real diversity needed for natural selection, and (iii) advise the NSF on programs that otherwise could cancel one another

Lessons from Invention of Computing “It should be noted that in comparing codes four viewpoints must be kept in mind, all of them of comparable importance: Simplicity and reliability of the engineering solutions required by the code; Simplicity, compactness and completeness of the code; Ease and speed of the human procedure of translating mathematical conceived methods into the code [”COMPUTATIONAL THINKING”], and also of finding and correcting errors in coding or of applying to it changes that have been decided upon at a later stage; Efficiency of the code in operating the machine near it full intrinsic speed. -H. Goldstine, J. von Neumann. Planning and coding problems for an electronic computing instrument, 1947 Take home - Comparing codes is a pivotal and broad issue - Concern for Productivity is as old as computing (development-time) -Human process: intellectual/algorithm/planning plus skill/coding -Contrast with: Tendency to understand HW upgrade from application code (even if machine not yet built, A. Ghuloum, Intel, CACM 9/09) – unreasonable expectation from application code developers

How was the “human procedure” addressed? Answer: Basically, By Abstraction and Induction 1. General-Purpose computing is about a platform for your future (whatever) program, as opposed specific application, a general method for the human procedure was key 2. GvN47 based coding on mathematical induction (known for math proofs and as axiom of the natural numbers) 3. It worked for establishing serial computing. This method led to simplicity, compactness and completeness of the resulting code. References: - Knuth67, The art of Computer Programming. Vol. 1: Fundamental Algorithms. Chapter 1: Basic concepts. 1.1 Algorithms. 1.2 Math Prelims Math Induction Algorithms: 1. Finiteness. 2. Definiteness. 3. Input. 4. Output. 5. Effectiveness. Gold standards Definiteness: Induction Effectiveness: “Uniform cost criterion" [AHU74] abstraction

“Killer app” for general-purpose many cores: Let the app-dreamers do their magic Oxymoron?.. general-purpose: no one application in particular Not really: If possible, a killer application would be helpful However, wrong as condition for progress General-purpose computing is an infrastructure for the IT sector and the economy The general-purpose computing infrastructure has been realized by the software spiral (the cyclic process of hardware improvements leading to software improvements that lead back to hardware improvements and so on; Andy Grove, Intel) Instituting a parallel software spiral is a killer application for many-cores: as in the past app-dreamers will invent uses  Not surprisingly, the killer application is also an infrastructure Government has a role in building infrastructure  Instituting a parallel software spiral merits government funding However, insufficient empowerment for: creating and developing alternative platforms to the point of establishing their merit.

Serial Abstraction & A Parallel Counterpart Example Rudimentary abstraction that made serial computing simple that any single instruction available for execution in a serial program executes immediately Abstracts away different execution time for different operations (e.g., memory hierarchy). Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively) Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE)  Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step. What could I do in parallel at each step assuming unlimited hardware  # ops.. time # ops time Time = Work Work = total #ops Time << Work Serial Execution, Based on Serial Abstraction Parallel Execution, Based on Parallel Abstraction

CACM’10: Using simple abstraction to guide the reinvention of computing for parallelism [Overall: old Work-Depth description. Only “minimalist abstraction”: ICE builds only on induction, itself a rudimentary concept] [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill Lots of evidence that “work-depth” works. Used as framework in PRAM algorithms texts: JaJa-92, KKT-01 ICE in line with PRAM: Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase Widely agreed: work&depth are necessary. Jury is out on: what else. Our position: as little as possible.

Workflow from parallel algorithms to programming versus trial-and-error Option 1 PAT Rethink algorithm: Take better advantage of cache Hardware PAT Tune Hardware Option 2 Parallel algorithmic thinking (ICE/WD/PRAM) Compiler Is Option 1 good enough for the parallel programmer’s model? Options 1B and 2 start with a PRAM algorithm, but not option 1A. Options 1A and 2 represent workflow, but not option 1B. Not possible in the 1990s. Possible now: Why settle for less? Insufficient inter-thread bandwidth? Domain decomposition, or task decomposition Program Prove correctness Still correct

Mark Twain on the PRAM We should be careful to get out of an experience only the wisdom that is in it— and stop there; lest we be like the cat that sits down on a hot stove-lid. She will never sit down on a hot stove-lid again— and that is well; but also she will never sit down on a cold one anymore— Mark Twain PRAM algorithms did not become standard CS knowledge in since “hot stove-lid”: No 1990s implementable computer architecture allowed programmers to look at a computer as a PRAM The XMT changed that PS NVidia happy to report success with 2 PRAM algorithms in IPDPS09. Great to see that from a major vendor [These 2 algorithms are decomposition-based, unlike most PRAM algorithms. Freshmen programmed same 2 algorithms on our XMT machine

The Parallel Programmer’s Productivity Landscape Postulation: a continental divide How different can productivity of many-core architectures be? Answer: very! Metaphor: Dropping rain a short distance apart. Very different outcomes. Think of programmer’s productivity as cost of producing usable water. The decomposition-first programming side requires domain-decomposition or task-decomposition that have not worked in spite of big investment. (Looks greener, since invested; what if goes to ocean while arid side to Sweetwater?) Work-depth initial abstraction is decomposition-free. (Arid, under-invested) Require leap-of-faith for investment. Decomposition-first programming Work-depth programming Ocean   Great Lakes

Validation of Ease of Programming To Date 1. Comparison with MPI by DARPA-HPCS SW Eng leaders [HochsteinBasiliVGilbert] 2. Teachability demonstrated so far [TorbertVTzurEllison, SIGCSE’10 to appear] : - To freshman class with 11 non-CS students. Some prog. assignments: median finding, merge-sort, integer-sort & sample-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher. Teachability: necessary (but not sufficient) condition for ease-of-programming. Itself necessary (but not sufficient) condition for productivity. Hence, teachability as good a benchmark as any out there for productivity

Conclusion - Want future mainstream programmers to embrace general-purpose parallelism (every CS major; for common SW architectures). Yet, in the past: - Insufficient evidence on productivity. Yet, history of repeated surprise: Parallel machines repel programmers Research Drivers 1.Empower select holistic (HW+SW) parallel platforms for merit-based comparison. Imagine a new world with the given platform. Consider all aspects: e.g., is it sufficient for reinstating the SW spiral? Is the barrier-to-entry for creative applications low enough? How will the CS curriculum will look? Who will be attracted to study CS? Then, gather evidence: 2.Methodically compare productivity (development-time, run-time) of platforms.  Ownership stake role for Indian partner (Prof. PJ Narayan, IIIT, Hyderabad): India – largest producer of SW. New platform requires sufficient Indian interest. Lead benchmarking/comparison for productivity, etc. For session Coming from algorithms, computer vision and computational biology, compare select platforms for performance, productivity (development-time and run-time), and overall for reinstating the SW spiral. Benchmark algorithms and applications based on their inherent parallelism for future machine platforms, as opposed to using existing code written for yesterday’s (serial or parallel) machines. Issue: How to benchmark for productivity?

Not just a theory. XMT: prototyped HW&SW Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it  great speedups. Motivation: Fix the IF 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to cores on-chip

Programmer’s Model: Engineering Workflow Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of processes at once. Can express locality (“decomposition-second”) – Prefix-sum (ps). Independence of order semantics (IOS). – Establish correctness & complexity by relating to WD analyses. – Circumvents “The problem with threads”, e.g., [Lee]. Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL08] Trial&error contrast: similar start  while insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} spawnjoinspawnjoin

Performance Simulation of 1024 processors: 100X on standard benchmark suite for VHDL gate-level simulation. for 1024 processors GV06 [SPAA’09]: ~10X relative to Intel Core 2 Duo with 64-processor XMT; same silicon area as 1 commodity processor (core) Promise of 100X with 1024 processors also for irregular, fine-grained parallelism with up- and down-scalability.

Some Credits Grad students:, George Caragea, James Edwards, David Ellison, Fuat Keceli, Beliz Saybasili, Alex Tzannes. Recent grads: Aydin Balkan, Mike Horak, Xingzhi Wen Industry design experts (pro-bono) Rajeev Barua, Compiler. Co-advisor of 2 CS grad students NSF grant Gang Qu, VLSI and Power. Co-advisor Steve Nowick, Columbia U., Asynch computing. Co-advisor NSF team grant. Ron Tzur, U. Colorado, K12 Education. Co-advisor NSF seed funding K12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools Marc Olano, UMBC, Computer graphics. Co-advisor. Tali Moreshet, Swarthmore College, Power. Co-advisor. Bernie Brooks, NIH. Co-Advisor Marty Peckerar, Microelectronics Igor Smolyaninov, Electro-optics Funding: NSF, NSA 2008 deployed XMT computer, NIH 6 Issued patents. More patent applications Informal industry partner: Intel Reinvention of Computing for Parallelism. Selected for Maryland Research Center of Excellence (MRCE) by USM. Not yet funded. 17 members, including UMBC, UMBI, UMSOM. Mostly applications.