1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Chapter 18 Multicore Computers
Unit VI. Keil µVision3/4 IDE for 8051 Tool for embedded firmware development Steps for using keil.
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Multi-core architectures. Single-core computer Single-core CPU chip.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Ch. 2 Data Manipulation 4 The central processing unit. 4 The stored-program concept. 4 Program execution. 4 Other architectures. 4 Arithmetic/logic instructions.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Parallel Computing Presented by Justin Reschke
My Coordinates Office EM G.27 contact time:
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP 740: Computer Architecture and Implementation
Lynn Choi School of Electrical Engineering
18-447: Computer Architecture Lecture 30B: Multiprocessors
Multi-core processors
Morgan Kaufmann Publishers
Parallel Algorithm Design
Morgan Kaufmann Publishers
Multicore / Multiprocessor Architectures
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
Multicore and GPU Programming
Presentation transcript:

1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010

2 Disclosure I am co-inventor / co-founder of Plurality Based on 30 years of (on/off) research MSc of Dr. Nimrod Bayer, ~1990 However, research is forward-looking But takes advantage of Plurality tools and data

3 Many-cores CMP / Multi-core is “more of the same” Several high-end complex powerful processors Each processor manages itself Each processor can execute the OS Good for many unrelated tasks (e.g. Windows) Reasonable on 2–8 processors, then it breaks Many-cores 100 – 1,000 – 10,000 Useful for heavy compute-bound tasks So far (50 years) many disasters But there is light at the end of the tunnel

4 Agenda Review 4 cases Analyze How NOT to make a many-core

5 Many many-core contenders Ambric Aspex Semiconductor ATI GPGPU BrightScale ClearSpeed Technologies Coherent Logix, Inc. CPU Technology, Inc. Element CXI Elixent/Panasonic IBM Cell IMEC Intel Larrabee Intellasys IP Flex MathStar Motorola Labs NEC Nvidia GPGPU PACT XPP Picochip Plurality Rapport Inc. Recore Silicon Hive Stream Processors Inc. Tabula Tilera (many are dead / dying / will die / should die)

6 PACT XPP German company, since 1999 Martin Vorbach, an ex-user of Transputers 42x Transputers mesh 1980s

7 PACT XPP (96 elements)

8 PACT XPP die photo

9 PACT: Static mapping, circuit-switch reconfigured NoC

10 PACT ALU-PAE

11 PACT Static task mapping  And a debug tool for that

12 PACT analysis Fine granularity computing Heterogeneous processors  Static mapping  complex programming  Circuit-switched NoC  static reconfigurations  complex programming  Limited parallelism Doesn’t scale easily

13 UK company Inspired by Transputers (1980s), David May 42x Transputers mesh 1980s

14 322x 16-bit LIW RISC

15

16

17

18

19

20

21

22

23

24 : Static Task Mapping  Compile

25 MIMD, fine granularity, homogeneous cores Static mapping  complex programming  Circuit-switched NoC  static reconfigurations  complex programming  Doesn’t scale easily Can we create / debug / understand static mapping on 10K? analysis

26 USA company Based on RAW MIT (A. Agarwal) Heavy DARPA funding, university IP Classic homogeneous MIMD on mesh NoC “Upgraded” Transputers with “powerful” uniprocessor features Caches  Complex communications  “tiles era”

27 Tiles Powerful processor High freq: ~1 GHz High power (0.5W)  5-mesh NoC P-M / P-P / P-IO 2.5 levels cache  L1+ L2 Can fetch from L2 of others Variable access time 1 – 7 – 70 cycles

28 Caches Kill Performance Cache is great for a single processor Exploits locality (in time and space) Locality only happens locally on many-cores Other (shared) data are buried elsewhere Caches help speed up parallel (local) phases Amdahl [1967]: the challenge is NOT the parallel phases Need to program many looong tasks on same data Hard ! That’s the software gap in parallel programming

29 Array processors MIMD / SIMD  Total 5+ MB memory In distributed caches High power ~27W  Die photo

30 allows statics Pre-programmed streams span multi-processors Static mapping

31 co-mapping: code, memory, routing 

32 static mapping debugger 

33 analysis Achieves good performance Bad on power Hard to scale Hard to program

34 Israel Technion research (since 1980s) P LURALITY

35 Architecture: Part I “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC P P P PPP P P external memory shared memory P-to-M resolving NoC P LURALITY

36 P P P PPP P P external memory shared memory P-to-M resolving NoC low latency parallel scheduling enables fine granularity scheduler P-to-S scheduling NoC “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC Architecture: Part II P LURALITY

37 Floorplan S P LURALITY

38 Other possible floor-plans? NoC to Instruction Memory NoC to Data Memory

39 programming model Compile into task-dependency-graph = ‘task map’ task codes Task maps loaded into scheduler Tasks loaded into memory regular duplicable task xxx( dependencies ) join/fork { … INSTANCE …. ….. } Task template: P P P PPP P P external memory shared memory P-to-M resolving NoC scheduler P-to-S scheduling NoC P LURALITY

40 Fine Grain Parallelization Convert (independent) loop iterations for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; } into parallel tasks duplicable task XX(…) { ii = INSTANCE; a[ii] = b[ii]*c[ii]; } All tasks, or any subset, can be executed in parallel

41 Task map example (2D FFT) Duplicable task Conditional task Join / fork task

42 Another task map (linear solver)

43 Linear Solver: Simulation snap-shots

44 Architectural Benefits Shared, uniform (equi-distant) memory no worry which core does what no advantage to any core because it already holds the data Many-bank memory + fast P-to-M NoC low latency no bottleneck accessing shared memory Fast scheduling of tasks to free cores (many at once) enables fine grain data parallelism impossible in other architectures due to: task scheduling overhead data locality Any core can do any task equally well on short notice scales automatically Programming model: intuitive to programmers easy for automatic parallelizing compiler P LURALITY

45 Target design (no silicon yet) 256 cores 500 MHz For 2 MB, slower for 20 MB Access time: 2 cycles (+) 3 Watts Designed to be Attractive to programmers (simple) Scalable Fight Amdahl’s rule P LURALITY

46 Analysis

47 Two (analytic) approaches to many cores 1) How many cores can fit into a fixed size VLSI chip? 2) One core is fixed size. How many cores can be integrated? The following analysis employs approach (1)

48 The VLSI-aware many-core (crude) analysis

49 The VLSI-aware many-core (crude) analysis K 4K power  1/  N perf   N freq  1/  N Perf / power  N

50 A modified analysis

51 The modified analysis K 4K power   N perf   N freq  1/  N Perf / power  const

52 things we shouldn’t do in many-cores No processor-sensitive code No heterogeneous processors No speculation No speculative execution No speculative storage (aka cache) No speculative latency (aka packet-switched or circuit-switched NoC) No bottlenecks No scheduling bottleneck (aka OS) No issue bottlenecks (aka multithreading) No memory bottlenecks (aka local storage) No programming bottlenecks No multithreading / GPGPU / SIMD / static mappings / heterogeneous processors / … No statics No static task mapping No static communication patterns

53 Conclusions Powerful processors are inefficient Principles of high-end CPU are damaging Speculative anything, cache, locality, hierarchy Complexity harms (when exposed) Hard to program Doesn’t scale Hacking (static anything) is hacking Hard to program Doesn’t scale Keep it simple, stupid [Pythagoras, 520 BC]

54 Research issues Scalability Memory and P-M NoC: async / multicycle Add local memories? How to reduce power there? Scheduler and P-S NoC: active / distributed Schedule large number of tasks together? Shared memory as 1-cache to external memory Special instruction memories Programming model New PRAM, special case of functional? Hierarchical task map? Other task maps? Dynamic task map? Algorithm / program mapping to tasks Combine HPC and GP models? Power delivery

55