Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

Standards Alignment A study of alignment between state standards and the ACM K-12 Curriculum.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Parallel Programming Yang Xianchun Department of Computer Science and Technology Nanjing University Introduction.
IPDPS Looking Back Panel Uzi Vishkin, University of Maryland.
Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,
NSF/TCPP Early Adopter Experience at Jackson State University Computer Science Department.
James Edwards and Uzi Vishkin University of Maryland 1.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Introduction CS 524 – High-Performance Computing.
June 13, Introduction to CS II Data Structures Hongwei Xi Comp. Sci. Dept. Boston University.
Addressing software engineering issues in student software projects across different curricula Dušanka Bošković Computing and Informatics Bachelor Programme.
Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.
Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.
Introduction to Computer Architecture SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING SUMMER 2015 RAMYAR SAEEDI.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--
DAAD project “Joint Course on OOP using Java” Design Patterns in the course ‘OOP in Java’ - first experiences Ana Madevska Bogdanova Institute of informatics.
Introduction to Parallel Computing. Serial Computing.
Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum Andrew Danner & Tia Newhall Swarthmore College Third NSF/TCPP.
CIS4930/CDA5125 Parallel and Distributed Systems Florida State University CIS4930/CDA5125: Parallel and Distributed Systems Instructor: Xin Yuan, 168 Love,
I-SPAN’05 December 07, Process Scheduling for the Parallel Desktop Designing Parallel Operating Systems using Modern Interconnects Process Scheduling.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Bev Bachmayer Software and Solutions group With special thanks to Matthew Wolf Georgia Technical Universitiy Pacifying the Pandora's Box of Parallelism:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
If Exascale by 2018, Really? Yes, if we want it, and here is how Laxmikant Kale.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Does humans-in-the-service-of-technology have a future Preview of Viewpoint article: Is Multi-Core Hardware for General-Purpose Parallel Processing Broken?
How to Read Research Papers? Xiao Qin Department of Computer Science and Software Engineering Auburn University
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Stakeholders How to engage them ? How to ensure success ? David Padua University of Illinois at Urbana-Champaign.
22 Nov 2005 CSE599A TA Training, Au'05, Session 08 Session 08: Developing Homework Assignments and Exams Valentin Razmov.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Using Alice in an introductory programming course for non-CS majors Adelaida A. Medlock Department of Computer Science Drexel University
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Multi-Semester Effort and Experience to Integrate NSF/IEEE-TCPP PDC into Multiple Department- wide Core Courses of Computer Science and Technology Department.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Programmability Hiroshi Nakashima Thomas Sterling.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,
University of Washington Today Quick review? Parallelism Wrap-up 
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
1 A simple parallel algorithm Adding n numbers in parallel.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Project Office Effectiveness Educating the Organization on How to Use a PMO February 22 nd, 2006.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
CS Undergraduate Advisor
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Common Core State Standards: Myths vs. Facts
CS Undergraduate Advisor
CS Undergraduate Advisor
Parallel Programming By J. H. Wang May 2, 2017.
Computer Science Department
Computer Science Department
Lecture on High Performance Processor Architecture (CS05162)
Mattan Erez The University of Texas at Austin
CSE 153 Design of Operating Systems Winter 2019
Presentation transcript:

Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker Jeffrey C. Carver, University of Alabama

Motivation 1/4 Programmers of today’s parallel machines must overcome 3 productivity busters, beyond just identifying operations that can be executed in parallel: (i)impose the often difficult 4-step programming-for-locality recipe: decomposition, assignment, orchestration, and mapping [CS99] (ii) reason about concurrency in threads; e.g., race conditions (iii) for machines such as GPU, that fall behind on serial (or low parallelism) code, whole programs must be highly parallel 2

Motivation 2/4: Commodity computer systems If you want your program to run significantly faster … you’re going to have to parallelize it  Parallelism: only game in town But, where are the players? “The Trouble with Multicore: Chipmakers are busy designing microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010 Only heroic programmers can exploit the vast parallelism in current machines – Report by CSTB, U.S. National Academies 2011 An education agenda must: (i) recognize this reality, (ii) adapt to it, and (iii) identify broad impact opportunities for education

Motivation 3/4: Technical Objectives Parallel computing exists for providing speedups over serial computing Its emerging democratization  the general body of CS students & graduates must be capable of achieving good speedups What is at stake? A general-purpose computer that can be programmed effectively by too few programmers, or requires excessive learning  application SW development costs more, weakening market potential of not only the computer: Traditionally, Economists look to the manufacturing sector for bettering the recovery prospects of the economy. Software production is the quintessential 21 st century mode of manufacturing. These prospects are at peril if most programmers are unable to design effective software for mainstream computers 4

Motivation 4/4: Possible Roles for Education Facilitator. Prepare & train students and the workforce for a future dominated by parallelism. Testbed. Experiment with vertical approaches and refine them to identify the most cost-effective ways for achieving speedups. Benchmark. Given a vertical approach, identify the developmental stage at which it can be taught. Rationale: Ease of learning/teaching is a necessary (though not sufficient) condition for ease-of- programming 5

The joint inter-university course UIUC: Parallel Programming for Science and Engineering, Prof: DP UMD: Parallel Algorithms, Prof: UV Student population: upper-division undergrads and graduate students. Diverse majors and backgrounds ~1/2 of the fall 2010 sessions, joint by videoconferencing. Objectives 1.Demonstrate logistical and educational feasibility of a real-time co- taught course. Outcome Overall success. Minimal glitches. Helped to alert students that success on material taught by the other prof is as important. 2. Compare OpenMP using 8-processor SMP against PRAM/XMTC using 64-processor XMT (<1/4 of silicon area for 2 SMP processors) 6

Joint sessions DP taught OpenMP programming. Provided parallel architecture knowledge UV taught parallel (PRAM) algorithms. ~20 minutes of XMTC programming 3 joints programming assignments Non-shared sessions UIUC: mostly MPI. Submitted more OpenMP programming assignments UMD: More parallel algorithms. Dry homework on design & analysis of parallel algorithms. Submitted a more demanding XMTC programming assignment JC: Anonymous questionnaire filled by the students. Accessed by DP and UV only after all grades were posted, per IRB guidelines 7

Rank approaches for achieving (hard) speedups Breadth-first-search (BFS) example 42 students in fall 2010 joint UIUC/UMD course -<1X speedups using OpenMP on 8-processor SMP -7x-25x speedups on 64-processor XMT FPGA prototype Questionnaire All students, but one : XMTC ahead of OpenMP for achieving speedups In view of this evidence Are we really ready for standards? 8

Parallel Random-Access Machine/Model PRAM: n synchronous processors all having unit time access to a shared memory. Reactions You got to be kidding, this is way: - Too easy - Too difficult: Why even mention processors? What to do with n processors? How to allocate processors to instructions?

Immediate Concurrent Execution 10 ‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01].Example: Pairwise parallel summation. 1 st round for 8 elements: In parallel 1 st +2 nd, 3 rd +4 th,5 th +6 th,7 th +8 th ICE basis for architecture specs: V, Using simple abstraction to reinvent computing for parallelism, CACM 1/2011 Similar to role of stored-program & program-counter in arch specs for serial comp

Feasible for many-cores Algorithms Programming Programmer’s workflow Rudimentary yet stable compiler PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT [ SPAA98..CF08] Toolchain Compiler + simulator HIPS’ core interconnection network IBM 90nm: 9mmX5mm, MHz [HotI07] FPGA design  ASIC IBM 90nm: 10mmX10mm 150 MHz Architecture scales to cores on-chip XMT homepage: or search: ‘XMT’

Has the study of PRAM algorithms helped XMT programming? Majority of UIUC students No UMD students Strong Yes: enforced by written explanation Discussion Exposure of UIUC students to PRAM algorithms and XMT programming much more limited. Their understanding of this material not challenged by analytic homework, or exams. For same programming challenges, performance of UIUC and UMD students was similar. Must students be exposed to minimal amount of parallel algorithms and their programming, and be properly challenged on analytic understanding to internalize their merit? If yes: tension with pressure on parallel computing courses to cover a hodge-podge of programming paradigms & architecture backgrounds

More Issues/lessons Recall the title of the courses at UIUC/UMD: Should we use class time only for algorithms or also for programming? Algorithms: high level of abstraction. Allows to cover more advanced problems. Note: Understanding tested only for UMD students. Made do with already assigned courses. Next time: more homogenous population; e.g., CS grad class. If interested in taking part, please let us know General lesson: IRB requires pre-submission of all questionnaires. Must complete planning by then.

Conclusion For parallelism to succeed serial computing in the mainstream, the first experience of students got to: - demonstrate solid hard speedups - be trauma-free Beyond education Objective rankings of approaches for achieving hard speedups provide a clue for curing the ills of the field. 14

Course homepages agora.cs.illinois.edu/display/cs420fa10/Home and For summary of the PRAM/XMT education approach: Includes teaching experience extending from middle school to graduate courses, course material [class notes, programming assignments, video presentations of a full- day tutorial and a full-semester graduate course], a software toolchain (compiler and cycle-accurate simulator, HIPS 5/20) available for free download, and the XMT hardware 15

How I teach parallel algorithms at different developmental stage Graduate In class, same PRAM algorithms course as in prior decades and complexity-style dry HW. <20 minutes of XMTC programming. 6 programming assigning with target hard speedups objectives. Include: parallel graph connectivity and XMT performance tuning Upper division undergraduate Less dry HW. Less programming. Still demand hard speedups Freshmen/HS [SIGCSE’10] Minimal/no dry HW. Same problems as in freshmen serial programming course  Understanding of par algorithms needs to be enforced & validated by programming, or otherwise most students will get very little from it 16

What about architecture education? Need badly parallel architectures that make parallel thinking easier In the happy days of serial computing, stored-program + program counter  wall between arch and alg  algs low priority. Not now! A trigger for XMT: brilliant incompetence of ECE faculty never teach undergrad alg courses. Can be alg researcher and teach arch courses …  XMT Reality Few regularly teach arch and (grad) alg courses, not to say par algs But, why rely on accidents?! teach next generation arch students to master both, so that they can be better architects Very different thought styles are used for one and the same problem more often than are very closely related ones—1935, Ludwik Fleck (‘the Turing’ of Sociology of Science) 17