Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker Jeffrey C. Carver, University of Alabama
Motivation 1/4 Programmers of today’s parallel machines must overcome 3 productivity busters, beyond just identifying operations that can be executed in parallel: (i)impose the often difficult 4-step programming-for-locality recipe: decomposition, assignment, orchestration, and mapping [CS99] (ii) reason about concurrency in threads; e.g., race conditions (iii) for machines such as GPU, that fall behind on serial (or low parallelism) code, whole programs must be highly parallel 2
Motivation 2/4: Commodity computer systems If you want your program to run significantly faster … you’re going to have to parallelize it Parallelism: only game in town But, where are the players? “The Trouble with Multicore: Chipmakers are busy designing microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010 Only heroic programmers can exploit the vast parallelism in current machines – Report by CSTB, U.S. National Academies 2011 An education agenda must: (i) recognize this reality, (ii) adapt to it, and (iii) identify broad impact opportunities for education
Motivation 3/4: Technical Objectives Parallel computing exists for providing speedups over serial computing Its emerging democratization the general body of CS students & graduates must be capable of achieving good speedups What is at stake? A general-purpose computer that can be programmed effectively by too few programmers, or requires excessive learning application SW development costs more, weakening market potential of not only the computer: Traditionally, Economists look to the manufacturing sector for bettering the recovery prospects of the economy. Software production is the quintessential 21 st century mode of manufacturing. These prospects are at peril if most programmers are unable to design effective software for mainstream computers 4
Motivation 4/4: Possible Roles for Education Facilitator. Prepare & train students and the workforce for a future dominated by parallelism. Testbed. Experiment with vertical approaches and refine them to identify the most cost-effective ways for achieving speedups. Benchmark. Given a vertical approach, identify the developmental stage at which it can be taught. Rationale: Ease of learning/teaching is a necessary (though not sufficient) condition for ease-of- programming 5
The joint inter-university course UIUC: Parallel Programming for Science and Engineering, Prof: DP UMD: Parallel Algorithms, Prof: UV Student population: upper-division undergrads and graduate students. Diverse majors and backgrounds ~1/2 of the fall 2010 sessions, joint by videoconferencing. Objectives 1.Demonstrate logistical and educational feasibility of a real-time joint co-taught course. Outcome Overall success. Minimal glitches. Helped to alert students that success on material taught by the other prof is as important. 2. Compare OpenMP using 8-processor SMP against PRAM/XMTC using 64-processor XMT (<1/4 of silicon area for 2 SMP processors) 6
Joint sessions DP taught OpenMP programming. Provided parallel architecture knowledge UV taught parallel (PRAM) algorithms. ~20 minutes of XMTC programming 3 joints programming assignments Non-shared sessions UIUC: mostly MPI. Submitted more OpenMP programming assignments UMD: More parallel algorithms. Dry homework on design & analysis of parallel algorithms. Submitted a more demanding XMTC programming assignment JC: Anonymous questionnaire filled by the students. Accessed by DP and UV only after all grades were posted, per IRB guidelines 7
Rank approaches for achieving (hard) speedups Breadth-first-search (BFS) example 42 students in fall 2010 joint UIUC/UMD course -<1X speedups using OpenMP on 8-processor SMP -7x-25x speedups on 64-processor XMT FPGA prototype Questionnaire All students, but one : XMTC ahead of OpenMP 8
Has the study of PRAM algorithms helped XMT programming? Majority of UIUC students No UMD students Strong Yes: enforced by written explanation Discussion Exposure of UIUC students to PRAM algorithms and XMT programming much more limited. Their understanding of this material not challenged by analytic homework, or exams. For same programming challenges, performance of UIUC and UMD students was similar. Must students must be exposed to minimal amount of parallel algorithms and their programming, and be properly challenged on analytic understanding to internalize their merit? If yes: tension with pressure on parallel computing courses to cover a hodge-podge of programming paradigms & architecture backgrounds
More Issues/lessons Recall the title of the courses at UIUC/UMD: Should we use class time only for algorithms or also for programming? Algorithms: high level of abstraction. Allows to cover more advanced problems. Note: Understanding tested only for UMD students. Made do with already assigned courses. Next time: more homogenous population; e.g., CS grad class. If interested in taking part, please let us know General lesson: IRB requires pre-submission of all questionnaires. Must complete planning by then.
Conclusion For parallelism to succeed serial computing in the mainstream, the first experience of students got to: - demonstrate solid hard speedups - be trauma-free Beyond education If broadly done, objective ranking of approaches for achieving hard speedups, through education and by other means, provide a clue for curing the ills of the field. 11
Course homepages agora.cs.illinois.edu/display/cs420fa10/Home and For summary of the PRAM/XMT education approach: Includes teaching experience extending from middle school to graduate courses, course material [class notes, programming assignments, video presentations of a full- day tutorial and a full-semester graduate course, a software toolchain (compiler and cycle-accurate simulator [HIPS 5/20]) available for free download, and the XMT hardware 12