William J. Dally Computer Systems Laboratory Stanford University

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture
William J. Dally Computer Systems Laboratory Stanford University Tomorrow's Computing Engines

Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions Tomorrow's Computing Engines

Some Previous “Wars” (1/3)
Reliable Router 1994 Torus Routing Chip 1985 MARS Router 1984 Network Design Frame 1988

MDP Chip J-Machine Cray T3D MAP Chip

Tomorrow’s Computing Engines
Driven by tomorrow’s applications - media Constrained by tomorrow’s technology Tomorrow's Computing Engines

90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000
Quote from Scott Kirkpatric of IBM (talk abstract) Media applications include video encode/decode polygon & image-based graphics audio processing - compression, music, speech - recognition/synthesis modulation/demodulation at audio and video rates These applications involve stream processing So do radar processing: SAR, STAP, MTI ...

Typical Media Kernel Image Warp and Composite
Read 10,000 pixels from memory Perform bit integer operations on each pixel Test each pixel Write 3,000 result pixels that pass to memory Little reuse of data fetched from memory each pixel used once Little interaction between pixels very insensitive to operation latency Challenge is to maximize bandwidth Tomorrow's Computing Engines

Telepresence: A Driving Application
Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression Channel Decompression Rendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Tomorrow's Computing Engines

Tomorrow’s Technology is Wire Limited
Lots of devices A little faster Slow wires Tomorrow's Computing Engines

Technology scaling makes communication the scarce resource
1997 2007 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks

On-chip wires are getting slower
x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg

Bandwidth and Latency of Modern VLSI
103 1 Bandwidth 100 0.01 Bandwidth Latency 10 10-4 Latency 1 10-6 1 10 100 103 104 105 Size Chip Boundary Tomorrow's Computing Engines

Architecture for Locality Exploit high on-chip bandwidth
Pin-Bandwidth, 2GB/s Off-chip RAM Vector Reg File 104 32-bit ALUs Switch 50GB/s 500GB/s

Tomorrow’s Computing Engines
Aimed at media processing stream based latency tolerant low-precision little reuse lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources bandwidth hierarchy no centralized resources Approach the performance of a special-purpose processor Tomorrow's Computing Engines

Why do Special-Purpose Processors Perform Well?
Lots (100s) of ALUs Fed by dedicated wires/memories Tomorrow's Computing Engines

Care and Feeding of ALUs
Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Tomorrow's Computing Engines

Tomorrow's Computing Engines
Three Key Problems Instruction bandwidth Data bandwidth Conditional execution Tomorrow's Computing Engines

A Bandwidth Hierarchy 13 ALUs per cluster SDRAM ALU Cluster ALU Cluster SDRAM Streaming Memory Register File Vector SDRAM 500GB/s SDRAM ALU Cluster 1.6GB/s 50GB/s Solves data bandwidth problem Matched to bandwidth curve of technology Tomorrow's Computing Engines

A Streaming Memory System
Reorder Queue SDRAM Bank IX Address Generator D Crossbar Address Generator Reorder Queue SDRAM Bank

Streaming Memory Performance
Exploit latency insensitivity for improved bandwidth 1.75:1 Performance improvement from relatively short reorder queue Tomorrow's Computing Engines

Compound Vector Operations 1 Instruction does lots of work
Memory Instructions Compound Vector Instruction 1 CV Inst (50b) LD Vd Vx Op V0 V1 V2 V3 V4 V5 V6 V7 uInst (300b) x 20uInst/Op x 1000el/vec 6 x 106 b Control Store uIP Mem AG VRF Op Ra Rb Op Ra Rb Op Ra Rb Tomorrow's Computing Engines

Scheduling by Simulated Annealing
List scheduling assumes global communication does poorly when communication exposed View scheduling as a CAD problem (place and route) generate naïve ‘feasible’ schedule iteratively improve schedule by moving operations. ALUs Ready Ops Time Tomorrow's Computing Engines

Typical Annealing Schedule
166 Energy function changed 13 Tomorrow's Computing Engines

Conventional Approaches to Data-Dependent Conditional Execution
Y N x>0 y=(x>0) x>0 Y Speculative Loss D x W ~1000 B B if y Exponentially Decreasing Duty Factor B J J C if ~y C K C Whoops if y Data-Dependent Branch J K if ~y K Tomorrow's Computing Engines

Zero-Cost Conditionals
Most Approaches to Conditional Operations are Costly Branching control flow - dead issue slots on mispredicted branches Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. Conditional Vectors append an element to an output stream depending on a case variable. Output Stream 0 Result Stream Output Stream 1 1 Case Stream {0,1} Tomorrow's Computing Engines

Application Sketch - Polygon Rendering
V3 V1 V2 V3 X Y RGB UV Vertex V2 V1 Y X1 X2 RGB1 DRGB UV1 DUV Y Span X1 X2 X Y RGB UV Pixel Y UV RGB X Textured Pixel X Y RGB Tomorrow's Computing Engines

Status Working simulator of Imagine
Simple kernels running on simulator FFT Applications being developed Depth extraction, video compression, polygon rendering, image-based graphics Circuit/Layout studies underway

Acknowledgements Students/Staff Don Alpert (Intel) Chris Buehler (MIT) J.P Grossman (MIT) Brad Johanson Ujval Kapasi Brucek Khailany Abelardo Lopez-Lagunas Peter Mattson John Owens Scott Rixner Helpful Suggestions Henry Fuchs (UNC) Pat Hanrahan Tom Knight (MIT) Marc Levoy Leonard McMillan (MIT) John Poulton (UNC) Tomorrow's Computing Engines

Conclusion Work toward tomorrow’s computing engines
Targeted toward media processing streams of low-precision samples little reuse latency tolerant Matched to the capabilities of communication-limited technology explicit bandwidth hierarchy explicit communication between units communication exposed Insight not numbers

William J. Dally Computer Systems Laboratory Stanford University

Similar presentations

Presentation on theme: "William J. Dally Computer Systems Laboratory Stanford University "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

William J. Dally Computer Systems Laboratory Stanford University

Similar presentations

Presentation on theme: "William J. Dally Computer Systems Laboratory Stanford University "— Presentation transcript:

Similar presentations

About project

Feedback