Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Effects of Virtual Cache Aliasing on the Performance of the NetBSD Operating System Rafal Boni CS 535 Project Presentation.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
IPDPS Looking Back Panel Uzi Vishkin, University of Maryland.
Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
James Edwards and Uzi Vishkin University of Maryland 1.
Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:
SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.
Seminar in Parallel Computing Architectures Fall 2012 Ran Ginosar, Pre-requisite: Parallel Computing Architectures.
Introduction CS 524 – High-Performance Computing.
Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by.
Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Early Adopter Introduction to Parallel Computing: Research Intensive University: 4 th Year Elective Bo Hong Electrical and Computer Engineering Georgia.
Introduction CSE 1310 – Introduction to Computers and Programming
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum Andrew Danner & Tia Newhall Swarthmore College Third NSF/TCPP.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS 3 rd Project Review October 2015 WP6 – Space Demonstrator.
CDA 4253 FPGA System Design Hao Zheng Dept of Comp Sci & Eng USF.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
CSCI N201 Programming Concepts and Database 2 - STAIR Lingma Acheson Department of Computer and Information Science, IUPUI.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Conclusions on CS3014 David Gregg Department of Computer Science
Hiba Tariq School of Engineering
Large-scale Machine Learning
For Massively Parallel Computation The Chaotic State of the Art
Memory Consistency Models
Memory Consistency Models
Genomic Data Clustering on FPGAs for Compression
Introduction CSE 1310 – Introduction to Computers and Programming
Lecture 5: GPU Compute Architecture
Multi-Processing in High Performance Computer Architecture:
XMT Another PRAM Architectures
Compiler Construction
S.T.A.I.R CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson
Lecture 2 The Art of Concurrency
Presentation transcript:

Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time

2009 Develop in 2009 application-SW for 2010s many-cores, or wait? Portability/investment questions: Will 2009 code be supported in 2010s? Development-hours in 2009 vs 2010s? Maintenance in 2010s? Performance in 2010s? Good News Vendors open up to ~40 years of parallel computing. Also SW to match vendors’ HW (2009 acquisitions ). Also: new starts However They picked the wrong part: parallel architectures are a disaster area for programmability. In any case: their programming is too constrained. Contrast with general-purpose serial computing that “set the serial programmer free”. Current direction drags general-purpose computing to an unsuccessful paradigm. My main point Need to reproduce serial success for many-core computing. The business food chain SW developers serve customers NOT machines. If HW developers will not get used to idea of serving SW developers, guess what will happen to customers of their HW.

Technical points Will overview/note: -What does it mean to “set free” parallel algorithmic thinking? -Architecture functions/abilities that achieve that -HW features supporting them  Vendors must provide such functions. Simple way: just add these features

Example of HW feature Prefix-Sum 1500 cars enter a gas station with 1000 pumps. Direct in unit time a car to a EVERY pump. Direct in unit time a car to EVERY pump becoming available. Proposed HW solution: prefix-sum functional unit. (HW enhancement of Fetch&Add) SPAA’97 + US Patent

Objective for programmer’s model Emerging: not sure, but analysis should be work-depth. Why not design for your analysis? (like serial) [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill. Lots of evidence that this “work-depth methodology” works. Used as framework in PRAM algorithms textbooks: JaJa-92, Keller-Kessler-Traeff-01. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase What could I do in parallel at each step assuming unlimited hardware  # ops.. time # ops time Time = WorkWork = total #opsTime << Work Serial Paradigm Natural (Parallel) Paradigm

Hardware prototypes of PRAM-On-Chip XMT big idea in a nutshell Design for work-depth 1) 1 operation now; Any #ops next time unit.2) No need to program for locality beyond use of local thread variables, post work-depth. 3) Enough interconnection network bandwidth 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” 7+ years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to cores on-chip

XMT: A PRAM-On-Chip Vision IF you could program a current manycore  great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Unlike matching current HW Today’s position Replicate functions Tested HW & SW prototypes Software release of full XMT environment SPAA’09: ~10X relative to Intel Core 2 Duo For more info: Google “XMT”

Programmer’s Model: Workflow Function Arbitrary CRCW Work-depth algorithm. - Reason about correctness & complexity in synchronous model SPMD reduced synchrony –Main construct: spawn-join block. Can start any number of processes at once. Threads advance at own speed, not lockstep –Prefix-sum (ps). Independence of order semantics (IOS) –Establish correctness & complexity by relating to WD analyses –Circumvents “The problem with threads”, e.g., [Lee] Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] Trial&error contrast: similar start  while insufficient inter- thread bandwidth do{rethink algorithm to take better advantage of cache} spawnjoinspawnjoin

Ease of Programming Benchmark Can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

Conclusion XMT provides viable answer to biggest challenges for the field –Ease of programming –Scalability (up&down)  Facilitates code portability Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform ICPP’08 paper compares with GPUs. Easy to build. 1 student in 2+ yrs: hardware design + FPGA-based XMT computer in slightly more than two years  time to market; implementation cost. Replicate functions, perhaps by replicating solutions

Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: a) Cycle-accurate simulator of the XMT machine b) Compiler from XMTC to that machine Also provided, extensive material for teaching or self- studying parallelism, including (i)Tutorial + manual for XMTC (150 pages)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) Video recording of grad Parallel Algorithms lectures (30+hours) Or just Google “XMT”

Q&A Question: Why PRAM-type parallel algorithms matter, when we can get by with existing serial algorithms, and parallel programming methods like OpenMP on top of it? Answer: With the latter you need a strong-willed Comp. Sci. PhD in order to come up with an efficient parallel program at the end. With the former (study of parallel algorithmic thinking and PRAM algorithms) high school kids can write efficient (more efficient if fine-grained & irregular!) parallel programs.