GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24

GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24 roboticist@voice.korea.ac.kr

Three Challenges for Parallel- Computing Chips Limited power budget Bandwidth gap between computation and memory Parallel programmability http://voice.korea.ac.kr p.2

Computers have been constrained by power and energy rather than area Power budget is limited – about 150W for desktops or 3W for mobile devices ( leakage, cooling) transistor components per chip have been continuously increased (Moores law) – Total power consumption also has been increased http://voice.korea.ac.kr p.3

Computers have been constrained by power and energy rather than area E.g. Supercomputer – Power budget= 20 MW – Target compute capability= 10 18 Flops/sec= 1 exaFlops/sec – Power/Flop= 20×10 -12 = 20 pJ/Flop However, modern CPUs (Intels Westmere) – 1700 pJ/Flop (double-precision, 130W/77GFlops) GPU (Fermi architecture) – 225 pJ/Flop (single-precision, 130W/665GFlops) ×1/85, ×1/11 improvement is needed http://voice.korea.ac.kr p.4

Energy-efficiency will require reducing both instruction execution and data movement overheads http://voice.korea.ac.kr p.5 Instruction overheads – Modern CPUs were optimized for single-thread performance E.g. Branch prediction, out-of-order execution, and large primary instruction and data caches So, energy is consumed in overheads of data supply, instruction supply, and control – To get higher throughput, future architectures must consume their energy to more useful work (i.e. computation)

Energy-efficiency will require reducing both instruction execution and data movement overheads http://voice.korea.ac.kr p.6 Todays, energy consumption of double-precision fused- multiply add (DFMA) is around 50 pJ Data movement power dissipation is also large – E.g. Power to read three 64-bit source operands and to write one destination operand to SRAM= 56 pJ (DFMA) to 10 mm more distance memory= 56×6 (pJ) to external DRAM= 56 ×200 (pJ) 14 pJ × 4 = 56 pJ

Because communication dominates energy, both within the chip and across the external memory interface, energy-efficient architectures must decrease the amount of data movement by exploiting locality Energy-efficiency will require reducing both instruction execution and data movement overheads p.7 With the scaling projection to 10 nm, The ratios between DFMA, on-chip SRAM, and off-chip DRAM access energy stay relatively constant However, the relative energy cost of 10 mm global wires goes up to 23 times the DFMA energy ( wire C remains constant) – Feature size relative power consumption of wire 1: 6.2 1: 23 3.6 :1

Bandwidth gap between computation and memory is severe. Also, power consumption by data movement is pretty serious Bandwidth gap between computation and memory becomes bigger and bigger How to narrow this gap is very important http://voice.korea.ac.kr p.9 Despite the relatively narrow memory bandwidth, chip-to-chip power comsumption is too big! ( DRAM max. BW 175 GB/sec 20 pJ/bit= 28W + 21W for signaling= 49 W/sec, 49W accounts for 20% of total GPU TDP (thermal design power) ) Again, reducing data movement is necessary

To cope with the bandwidth gap problem, Architects are trying Multichip modules (MCMs) DRAMs on-chip (to reduce latency) CPU + GPU on-chip (to reduce transfer overheads) but also sharing bandwidth by both CPU and GPU can aggravate bandwidth utilization 3D chip stacking Deeper memory hierarchy Bandwidth utilization – Coalescing – Prefetch – Data compression (more data per transaction) http://www.extremetech.com/computing/95319-ibm-and-3m-to-stack-100-silicon-chips-together- using-glue http://voice.korea.ac.kr p.10

For Parallel Programmability, Programmers must be able to Represent data access pattern and data placement ( Memory model is no more flat, coalesced access) Deal thousands of threads Choose what kind of processing cores their tasks are running on ( heterogeneity will be increased) Also, coherence and consistency should be relaxed to facilitate grater memory-level parallelism – the cost of coherence protocol is too high – Sol) Give programmers selective coherence http://voice.korea.ac.kr p.12

To cope with these challenges Limited power budget Bandwidth gap between computation and memory Parallel programmability http://voice.korea.ac.kr p.13

Echelon: A Research GPU Architecture Goals – Double precision 16 TFlops/sec – Memory bandwidth= 1.6 TB/sec – Power budget 150W – 20 pJ/Flop http://voice.korea.ac.kr p.14

Echelon Block Diagram: Chip Level Architecture http://voice.korea.ac.kr p.15 - 64 Tiles - Each tile consists of 4 throughput optimized cores (TOCs) i.e. GPU for throughput oriented parallel tasks 16 DRAM memory controllers (MCs) 8 latency optimized cores (LOCs) i.e. CPUs for operating system, serial portion

Echelon Block Diagram: Throughput Tile Architecture http://voice.korea.ac.kr p.16 - 4 TOCs per tile. - Each TOC has secondary on-chip storage. - It may be DRAMs on-chip.

Characteristics of a TOC: MIMD + SIMD, Configurable and Sharable SRAM, and LIW per lane p.17 Temporal SIMT -Divergent code MIMD -Non-divergent SIMT (more energy-efficient) Two-level register files -Operand register file (ORF) for producer-consumer relationship between subsequent instructions -Main register file (MRF) Multilevel scheduling -4 active and 60 on-deck sets (total 64 threads) Lane Memory

Malleable Memory System Selective SRAM – H/W controlled cache + scratch pads (S/W controlled cache) – The ratio can be determined by programmers E.g. 16KB/48KB or 48KB/16KB (total 64KB) – Where to inherit can be determined by programmers (GMEMs, L2, ranges) http://voice.korea.ac.kr p.18

To make writing a parallel program as easy as writing a sequential program Unified memory addressing – An address space spanning LOCs and TOCs, as well as across multiple Echelon chips Selective memory coherence – First, place data on coherence domain, Later, remove coherence to get better performance (energy, execution time) H/W fine-grained thread creation – Automated fine-grained parallelization by H/W http://www.hardwarecanucks.com/reviews/processors/huma-amds-new-heterogeneous-unified- memory-architecture/ http://voice.korea.ac.kr p.19

This work is licensed under a Creative Commons Attribution 3.0 Unported License.Creative Commons Attribution 3.0 Unported License http://voice.korea.ac.kr p.20

GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24

Similar presentations

Presentation on theme: "GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24

Similar presentations

Presentation on theme: "GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24"— Presentation transcript:

Similar presentations

About project

Feedback