Download presentation
Presentation is loading. Please wait.
Published byالناز جنتی Modified over 6 years ago
1
Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture
William J. Dally Computer Systems Laboratory Stanford University Tomorrow's Computing Engines
2
Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions Tomorrow's Computing Engines
3
Some Previous “Wars” (1/3)
Reliable Router 1994 Torus Routing Chip 1985 MARS Router 1984 Network Design Frame 1988
4
Some Previous “Wars” (2/3)
MDP Chip J-Machine Cray T3D MAP Chip
5
Some Previous “Wars” (3/3)
6
Tomorrow’s Computing Engines
Driven by tomorrow’s applications - media Constrained by tomorrow’s technology Tomorrow's Computing Engines
7
90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000
Quote from Scott Kirkpatric of IBM (talk abstract) Media applications include video encode/decode polygon & image-based graphics audio processing - compression, music, speech - recognition/synthesis modulation/demodulation at audio and video rates These applications involve stream processing So do radar processing: SAR, STAP, MTI ...
8
Typical Media Kernel Image Warp and Composite
Read 10,000 pixels from memory Perform bit integer operations on each pixel Test each pixel Write 3,000 result pixels that pass to memory Little reuse of data fetched from memory each pixel used once Little interaction between pixels very insensitive to operation latency Challenge is to maximize bandwidth Tomorrow's Computing Engines
9
Telepresence: A Driving Application
Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression Channel Decompression Rendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Tomorrow's Computing Engines
10
Tomorrow’s Technology is Wire Limited
Lots of devices A little faster Slow wires Tomorrow's Computing Engines
11
Technology scaling makes communication the scarce resource
1997 2007 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks
12
On-chip wires are getting slower
x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg
13
Bandwidth and Latency of Modern VLSI
103 1 Bandwidth 100 0.01 Bandwidth Latency 10 10-4 Latency 1 10-6 1 10 100 103 104 105 Size Chip Boundary Tomorrow's Computing Engines
14
Architecture for Locality Exploit high on-chip bandwidth
Pin-Bandwidth, 2GB/s Off-chip RAM Vector Reg File 104 32-bit ALUs Switch 50GB/s 500GB/s
15
Tomorrow’s Computing Engines
Aimed at media processing stream based latency tolerant low-precision little reuse lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources bandwidth hierarchy no centralized resources Approach the performance of a special-purpose processor Tomorrow's Computing Engines
16
Why do Special-Purpose Processors Perform Well?
Lots (100s) of ALUs Fed by dedicated wires/memories Tomorrow's Computing Engines
17
Care and Feeding of ALUs
Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Tomorrow's Computing Engines
18
Tomorrow's Computing Engines
Three Key Problems Instruction bandwidth Data bandwidth Conditional execution Tomorrow's Computing Engines
19
Tomorrow's Computing Engines
A Bandwidth Hierarchy 13 ALUs per cluster SDRAM ALU Cluster ALU Cluster SDRAM Streaming Memory Register File Vector SDRAM 500GB/s SDRAM ALU Cluster 1.6GB/s 50GB/s Solves data bandwidth problem Matched to bandwidth curve of technology Tomorrow's Computing Engines
20
A Streaming Memory System
Reorder Queue SDRAM Bank IX Address Generator D Crossbar Address Generator Reorder Queue SDRAM Bank
21
Streaming Memory Performance
Exploit latency insensitivity for improved bandwidth 1.75:1 Performance improvement from relatively short reorder queue Tomorrow's Computing Engines
22
Compound Vector Operations 1 Instruction does lots of work
Memory Instructions Compound Vector Instruction 1 CV Inst (50b) LD Vd Vx Op V0 V1 V2 V3 V4 V5 V6 V7 uInst (300b) x 20uInst/Op x 1000el/vec 6 x 106 b Control Store uIP Mem AG VRF Op Ra Rb Op Ra Rb Op Ra Rb Tomorrow's Computing Engines
23
Scheduling by Simulated Annealing
List scheduling assumes global communication does poorly when communication exposed View scheduling as a CAD problem (place and route) generate naïve ‘feasible’ schedule iteratively improve schedule by moving operations. ALUs Ready Ops Time Tomorrow's Computing Engines
24
Typical Annealing Schedule
166 Energy function changed 13 Tomorrow's Computing Engines
25
Conventional Approaches to Data-Dependent Conditional Execution
Y N x>0 y=(x>0) x>0 Y Speculative Loss D x W ~1000 B B if y Exponentially Decreasing Duty Factor B J J C if ~y C K C Whoops if y Data-Dependent Branch J K if ~y K Tomorrow's Computing Engines
26
Zero-Cost Conditionals
Most Approaches to Conditional Operations are Costly Branching control flow - dead issue slots on mispredicted branches Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. Conditional Vectors append an element to an output stream depending on a case variable. Output Stream 0 Result Stream Output Stream 1 1 Case Stream {0,1} Tomorrow's Computing Engines
27
Application Sketch - Polygon Rendering
V3 V1 V2 V3 X Y RGB UV Vertex V2 V1 Y X1 X2 RGB1 DRGB UV1 DUV Y Span X1 X2 X Y RGB UV Pixel Y UV RGB X Textured Pixel X Y RGB Tomorrow's Computing Engines
28
Status Working simulator of Imagine
Simple kernels running on simulator FFT Applications being developed Depth extraction, video compression, polygon rendering, image-based graphics Circuit/Layout studies underway
29
Tomorrow's Computing Engines
Acknowledgements Students/Staff Don Alpert (Intel) Chris Buehler (MIT) J.P Grossman (MIT) Brad Johanson Ujval Kapasi Brucek Khailany Abelardo Lopez-Lagunas Peter Mattson John Owens Scott Rixner Helpful Suggestions Henry Fuchs (UNC) Pat Hanrahan Tom Knight (MIT) Marc Levoy Leonard McMillan (MIT) John Poulton (UNC) Tomorrow's Computing Engines
30
Conclusion Work toward tomorrow’s computing engines
Targeted toward media processing streams of low-precision samples little reuse latency tolerant Matched to the capabilities of communication-limited technology explicit bandwidth hierarchy explicit communication between units communication exposed Insight not numbers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.