Presentation is loading. Please wait.

Presentation is loading. Please wait.

WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.

Similar presentations


Presentation on theme: "WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William."— Presentation transcript:

1 WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

2 WJD Feb 3, 19982Tomorrow's Computing Engines Focus on Tomorrow, not Yesterday General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions

3 WJD Feb 3, 19983 Some Previous “Wars” (1/3) MARS Router 1984 Torus Routing Chip 1985 Network Design Frame 1988 Reliable Router 1994

4 WJD Feb 3, 19984 Some Previous “Wars” (2/3) MDP Chip J-Machine Cray T3D MAP Chip

5 WJD Feb 3, 19985 Some Previous “Wars” (3/3)

6 WJD Feb 3, 19986Tomorrow's Computing Engines Tomorrow’s Computing Engines Driven by tomorrow’s applications - media Constrained by tomorrow’s technology

7 WJD Feb 3, 19987 90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000 Quote from Scott Kirkpatric of IBM (talk abstract) Media applications include –video encode/decode –polygon & image-based graphics –audio processing - compression, music, speech - recognition/synthesis –modulation/demodulation at audio and video rates These applications involve stream processing So do –radar processing: SAR, STAP, MTI...

8 WJD Feb 3, 19988Tomorrow's Computing Engines Typical Media Kernel Image Warp and Composite Read 10,000 pixels from memory Perform 100 16-bit integer operations on each pixel Test each pixel Write 3,000 result pixels that pass to memory Little reuse of data fetched from memory –each pixel used once Little interaction between pixels –very insensitive to operation latency Challenge is to maximize bandwidth

9 WJD Feb 3, 19989Tomorrow's Computing Engines Telepresence: A Driving Application Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression DecompressionRendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Channel

10 WJD Feb 3, 199810Tomorrow's Computing Engines Tomorrow’s Technology is Wire Limited Lots of devices A little faster Slow wires

11 WJD Feb 3, 199811 Technology scaling makes communication the scarce resource 0.35  m 64Mb DRAM 16 64b FP Proc 400MHz 0.10  m 4Gb DRAM 1K 64b FP Proc 2.5GHz 1997 2007 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks P

12 WJD Feb 3, 199812 On-chip wires are getting slower x1x1 x2x2 y y x 2 = s x 1 0.5x R 2 = R 1 /s 2 4x C 2 = C 1 1x t w2 = R 2 C 2 y 2 = t w1 /s 2 4x t w2 /t g2 = t w1 /(t g1 s 3 )8x v = 0.5(t g RC) -1/2 (m/s) v 2 = v 1 s 1/2 0.7x vt g = 0.5(t g /RC) 1/2 (m/gate) v 2 t g2 = v 1 t g1 s 3/2 0.35x t w = RCy 2 RCy 2 tgtg tgtg tgtg

13 WJD Feb 3, 199813Tomorrow's Computing Engines Bandwidth and Latency of Modern VLSI Size 10110010 3 10 4 10 5 10 100 1 10 3 Latency Bandwidth 1 0.01 10 -4 10 -6 Bandwidth Chip Boundary

14 WJD Feb 3, 199814 Architecture for Locality Exploit high on-chip bandwidth Off-chip RAM Pin-Bandwidth, 2GB/s Vector Reg File 104 32-bit ALUs 50GB/s Switch 500GB/s

15 WJD Feb 3, 199815Tomorrow's Computing Engines Tomorrow’s Computing Engines Aimed at media processing –stream based –latency tolerant –low-precision –little reuse –lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources –bandwidth hierarchy –no centralized resources Approach the performance of a special-purpose processor

16 WJD Feb 3, 199816Tomorrow's Computing Engines Why do Special-Purpose Processors Perform Well? Fed by dedicated wires/memoriesLots (100s) of ALUs

17 WJD Feb 3, 199817Tomorrow's Computing Engines Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

18 WJD Feb 3, 199818Tomorrow's Computing Engines Three Key Problems Instruction bandwidth Data bandwidth Conditional execution

19 WJD Feb 3, 199819Tomorrow's Computing Engines A Bandwidth Hierarchy SDRAM Streaming Memory 1.6GB/s Vector Register File 50GB/s ALU Cluster 500GB/s 13 ALUs per cluster Solves data bandwidth problem Matched to bandwidth curve of technology

20 WJD Feb 3, 199820 A Streaming Memory System Address Generator Address Generator IX D Crossbar Reorder Queue Reorder Queue SDRAM Bank SDRAM Bank

21 WJD Feb 3, 199821Tomorrow's Computing Engines Streaming Memory Performance Exploit latency insensitivity for improved bandwidth 1.75:1 Performance improvement from relatively short reorder queue

22 WJD Feb 3, 199822Tomorrow's Computing Engines Compound Vector Operations 1 Instruction does lots of work LDVdVx MemAG VRF Memory Instructions Control Store uIP OpV0V1V2V3V4V5V6V7 Compound Vector Instruction OpRaRbOpRaRbOpRaRb 1 CV Inst (50b) uInst (300b) x 20uInst/Op x 1000el/vec ------------------ 6 x 10 6 b

23 WJD Feb 3, 199823Tomorrow's Computing Engines Scheduling by Simulated Annealing List scheduling assumes global communication –does poorly when communication exposed View scheduling as a CAD problem (place and route) –generate naïve ‘feasible’ schedule –iteratively improve schedule by moving operations. ALUs Time Ready Ops

24 WJD Feb 3, 199824Tomorrow's Computing Engines Typical Annealing Schedule 13 166 Energy function changed

25 WJD Feb 3, 199825Tomorrow's Computing Engines Conventional Approaches to Data-Dependent Conditional Execution A x>0 B C J K Data-Dependent Branch YN A x>0 Y B C Whoops J K Speculative Loss D x W ~1000 A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor

26 WJD Feb 3, 199826Tomorrow's Computing Engines Zero-Cost Conditionals Most Approaches to Conditional Operations are Costly –Branching control flow - dead issue slots on mispredicted branches –Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. Conditional Vectors –append an element to an output stream depending on a case variable. Result Stream Case Stream {0,1} 0 1 Output Stream 0 Output Stream 1

27 WJD Feb 3, 199827Tomorrow's Computing Engines Application Sketch - Polygon Rendering V1 V2 V3 V1V2V3XYRGB YX1X2RGB1  RGB Y X1X2 UV UV1  UV Vertex Span XYRGBUV Pixel Y X XYRGB Textured Pixel UVRGB

28 WJD Feb 3, 199828 Status Working simulator of Imagine Simple kernels running on simulator –FFT Applications being developed –Depth extraction, video compression, polygon rendering, image-based graphics Circuit/Layout studies underway

29 WJD Feb 3, 199829Tomorrow's Computing Engines Acknowledgements Students/Staff –Don Alpert (Intel) –Chris Buehler (MIT) –J.P Grossman (MIT) –Brad Johanson –Ujval Kapasi –Brucek Khailany –Abelardo Lopez-Lagunas –Peter Mattson –John Owens –Scott Rixner Helpful Suggestions –Henry Fuchs (UNC) –Pat Hanrahan –Tom Knight (MIT) –Marc Levoy –Leonard McMillan (MIT) –John Poulton (UNC)

30 WJD Feb 3, 199830 Conclusion Work toward tomorrow’s computing engines Targeted toward media processing –streams of low-precision samples –little reuse –latency tolerant Matched to the capabilities of communication-limited technology –explicit bandwidth hierarchy –explicit communication between units –communication exposed Insight not numbers


Download ppt "WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William."

Similar presentations


Ads by Google