Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time
2009 Develop in 2009 application-SW for 2010s many-cores, or wait? Portability/investment questions: Will 2009 code be supported in 2010s? Development-hours in 2009 vs 2010s? Maintenance in 2010s? Performance in 2010s? Good News Vendors open up to ~40 years of parallel computing. Also SW to match vendors’ HW (2009 acquisitions ). Also: new starts However They picked the wrong part: parallel architectures are a disaster area for programmability. In any case: their programming is too constrained. Contrast with general-purpose serial computing that “set the serial programmer free”. Current direction drags general-purpose computing to an unsuccessful paradigm. My main point Need to reproduce serial success for many-core computing. The business food chain SW developers serve customers NOT machines. If HW developers will not get used to idea of serving SW developers, guess what will happen to customers of their HW.
Technical points Will overview/note: -What does it mean to “set free” parallel algorithmic thinking? -Architecture functions/abilities that achieve that -HW features supporting them Vendors must provide such functions. Simple way: just add these features
Example of HW feature Prefix-Sum 1500 cars enter a gas station with 1000 pumps. Direct in unit time a car to a EVERY pump. Direct in unit time a car to EVERY pump becoming available. Proposed HW solution: prefix-sum functional unit. (HW enhancement of Fetch&Add) SPAA’97 + US Patent
Objective for programmer’s model Emerging: not sure, but analysis should be work-depth. Why not design for your analysis? (like serial) [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill. Lots of evidence that this “work-depth methodology” works. Used as framework in PRAM algorithms textbooks: JaJa-92, Keller-Kessler-Traeff-01. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase What could I do in parallel at each step assuming unlimited hardware # ops.. time # ops time Time = WorkWork = total #opsTime << Work Serial Paradigm Natural (Parallel) Paradigm
Hardware prototypes of PRAM-On-Chip XMT big idea in a nutshell Design for work-depth 1) 1 operation now; Any #ops next time unit.2) No need to program for locality beyond use of local thread variables, post work-depth. 3) Enough interconnection network bandwidth 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” 7+ years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to cores on-chip
XMT: A PRAM-On-Chip Vision IF you could program a current manycore great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Unlike matching current HW Today’s position Replicate functions Tested HW & SW prototypes Software release of full XMT environment SPAA’09: ~10X relative to Intel Core 2 Duo For more info: Google “XMT”
Programmer’s Model: Workflow Function Arbitrary CRCW Work-depth algorithm. - Reason about correctness & complexity in synchronous model SPMD reduced synchrony –Main construct: spawn-join block. Can start any number of processes at once. Threads advance at own speed, not lockstep –Prefix-sum (ps). Independence of order semantics (IOS) –Establish correctness & complexity by relating to WD analyses –Circumvents “The problem with threads”, e.g., [Lee] Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] Trial&error contrast: similar start while insufficient inter- thread bandwidth do{rethink algorithm to take better advantage of cache} spawnjoinspawnjoin
Ease of Programming Benchmark Can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.
Conclusion XMT provides viable answer to biggest challenges for the field –Ease of programming –Scalability (up&down) Facilitates code portability Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform ICPP’08 paper compares with GPUs. Easy to build. 1 student in 2+ yrs: hardware design + FPGA-based XMT computer in slightly more than two years time to market; implementation cost. Replicate functions, perhaps by replicating solutions
Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: a) Cycle-accurate simulator of the XMT machine b) Compiler from XMTC to that machine Also provided, extensive material for teaching or self- studying parallelism, including (i)Tutorial + manual for XMTC (150 pages)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) Video recording of grad Parallel Algorithms lectures (30+hours) Or just Google “XMT”
Q&A Question: Why PRAM-type parallel algorithms matter, when we can get by with existing serial algorithms, and parallel programming methods like OpenMP on top of it? Answer: With the latter you need a strong-willed Comp. Sci. PhD in order to come up with an efficient parallel program at the end. With the former (study of parallel algorithmic thinking and PRAM algorithms) high school kids can write efficient (more efficient if fine-grained & irregular!) parallel programs.