Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL.

Similar presentations


Presentation on theme: "Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL."— Presentation transcript:

1 Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker Sourav Chatterji, Jason Duell, Manikandan Narayanan

2 Motivation  Commodity cache-based SMP clusters perform at small % of peak for memory intensive problems (esp irregular prob)  But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)  Power and packaging are becoming significant bottlenecks  Better software is improving some problems: ATLAS, FFTW, Sparsity, PHiPAC  Alternative arch allow tighter integration of proc & memory Can we build HPC systems w/ high-end media proc tech? VIRAM: PIM technology combines embedded DRAM with vector coprocessor to exploit large bandwidth potential IMAGINE: Stream-aware memory supports large processing potential of SIMD controlled VLIW clusters

3 Motivation  General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption  Application-specific ASICs Good, but expensive/slow to design.  Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation

4 VIRAM Overview  MIPS core (200 MHz)  Main memory system  8 banks w/13 MB of on-chip DRAM  Large 6.4 GBytes/s on-chip peak bandwidth  Cach-less Vector unit  Energy efficient way to express fine-grained parallelism and exploit bandwidth  Single issue, in order  Low power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabricated by IBM: Taped-out 02/2003  To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages)  We use simulator with Cray’s vcc compiler

5 VIRAM Vector Lanes  Parallel lane design has adv in performance, design complex, scalability  Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal  Vector instr specify 64 way-parallelism, hardware exec 8-way  8 KB vector register file partitioned into 32 vector registers  Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit Data width cut in half, # of elems per register (and peak) doubles  Limitations: no 64-bit FP & compiler doesn’t generate fused MADD

6 VIRAM Power Efficiency  Comparable performance with lower clock rate  Large power/performance advantage for VIRAM from PIM technology, data parallel execution model

7 Stream Processing  Stream: ordered set of records (homogenous, arbitrary data type)  Stream programming: data is streams, compu is kernel  Kernel loop through all stream elements (sequential order)  Perform compound (multiword) operation on each stream elem  Vectors perform single arith op on each vector elem (then store in reg) Example: stereo depth extraction  Data and Functional Parallelism  High Comp rate  Little Data Reuse  Producer-Consumer and Spatial locality  Ex: Multimedia, sign proc, graphics

8 Imagine Overview  “ Vector VLIW” processor  Coprocessor to off-chip host processor  8 arithmet clusters control in SIMD w/ VLIW instr  Central 128KB Stream Register File @ 32GB/s  SRF can overlap comp with mem (double buff)  SRF cab reuse intermed results (prod-cons local)  Stream-aware mem sys with 2.7 GB/s off-chip  544 GB/s interclustr comm  Host sends inst to stream controller, SC issues commands to on-chip modules

9 Imagine Arithmetic Clusters  400 MHz clock, 8 clusters w/ 6 FU each (48 FU total)  Reads/writes streams to SRF  Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit  32 bit arch: subword operations support 16 and 8 bit data (no 64 bit support)  Local registers on functional units hold 16 words each (total 1.5 KB)  Clusters receive VLIW-style instructions broadcast from microcontroller.

10 VIRAM and Imagine  Imagine order of magnitude higher performance  VIRAM twice mem bandwidth, less power consumption  Notice peak Flop/Word ratios VIRAM IMAGINE Memory IMAGINE SRF Bwdth GB/s6.42.732 Peak Fl 32bit1.6 GF/s20 GF/s20 Peak Fl/Wd1302.5 Speed MHz200400 Chip Area15x18mm12x12mm Data widths64/32/1632/16/8 Transistors130 x 10 6 21 x 10 6 Pwr Consmp2 Watts10 Watts

11 SQMAT Architectural Probe  Sqmat: scalable synthetic probe, control comput intensity, vector len  Imagine stream model req large # of ops per word to amortize mem ref Poor use of SRF, no producer-consumer locality  Long stream helps hide mem latency but only 7% of algorithmic peak  VIRAM: performs well for low op/word (40% when L=256)  Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten) 3x3 Matrix Multiply

12 SQMAT: Performance Crossover  Large number of ops/word N 10 where N=3x3  Crossover point L=64 (cycles), L = 256 (MFlop)  Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch

13 VIRAM/Imagine Optimization  Example optimization RGB→YIQ conversion from EEMBC Input format: R 1 G 1 B 1 R 2 G 2 R 2 R 3 G 3 B 3 … Required format: R 1 R 2 R 3 … G 1 G 2 G 3 … B 1 B 2 B 3 ….  Optimization strat: speed up slower of comp or mem  Restructure computation for better kernel perform Mem is waiting for ALUS  Add more computation for better memory perform ALU memory starved  Subtle overlap effects: vect chaining, stream doub buff

14 VIRAM RGB→YIQ Optimization VIRAM: poor memory performance Strided accesses (~1/2 performance) - RGBRGBRGB… -- strided loads → RRR…GGG…BBB… - Only 4 address generators for 8 addresses (sufficient for 64 bit) Word operations on byte data (1/4 th performance) Optimization: replace strided w/ unit access, using in-register shuffle Increased computational overhead (packing and unpacking)

15 VIRAM RGB→YIQ Results Used functional units instead of memory to extract components, increasing the computational overhead VIRAM Kernel (cycles) Memory (cycles) Unoptimized11495 Optimized10817 Chunk Size64

16 Imagine RGB→YIQ Optimization  Imagine bottleneck is comp due poor ALU schedule (left) Unoptimized 15 cycles per pixel  Software pipelining makes VLIW schedule denser (right) Optimized 8 cycles per pixel

17 Imagine RGB→YIQ Results Imagine Kernel (cycles) Memory (cycles) Unoptimized21531167 Optimized11471165 Chunk Size1024 Optimized kernel takes only ½ the cycles per element Memory is now the new bottleneck

18 EEMBC Benchmark  Vec-add: one add/elem, perf limited by memory system  RGB →(YIQ,CMYK): VIRAM limited by processing (cannot use avail bdwidth)  Grayfiler: Difficult to efficiently impl on Imagine (sliding 3x3 window)  Autocorr: Uses short streams, Imagine host latency is high BenchmarkWidth VIR/IMAApplication AreaRemarks Vec addition32/32 bitsMicrobenchmarkc[i]=a[i]+b[i] RGB →YIQ32/32 bitsEEMBC ConsumerColor-conver RGB →CMYK16/8 bitsEEMBC ConsumerColor-conver Gray Filter16/32 bitsEEMBC Consumer3x3 convolu Autocorrelation16/32 bitsEEMBC TelecomDot product

19 Scientific Kernels SPMV Performance  Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle  LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz  Imagine lacks irreg access, reorder matrix before kernelC  VIRAM better suited for this class of apps (low comp/mem) Matrix Rows/NNZ Perform Metric VIRAMImagine CRSSegSumEllpckCRSStreamsEllpck LSHAPE 1008 6958 % Peak2.8%7.4%31%1.1%0.8%1.2% Cycles67K24K5.6K40K48K38K MFlop/s44118496136114149 LARGE DIS 10000 117820 % Peak3.2%8.4%32%1.5%0.6%6.3% Cycles802K567K641K742K1840K754K MFlop/s9113551119277870

20 Scientific Kernels Complex QR Decomposition  A=QR Q orthrog & A upper triag,  Blocked Househoulder variant – rich in level 3 BLAS ops  Complex elems increases ops/word & locality (1 MUL = 6 ops)  VIRAM uses CLAPACK port (insertion of vector directives)  Imagine: complex indexing of matrix stream (each iter smaller matrix)  Imagine over 10GFlops (19x VIRAM) – well suited for this arch Low VIRAM perf due strided access and compiler limitations Complex QR Decomposition VIRAMImagine MatrixPerformance MITRE RT_STRAP 192x96 complex % of Peak34.1%65.5% Total Cycles5189K712K MFlops/s54610480

21 Overview  Significantly different balance of memory organization  Relative performance depends on computational intensity  Programming complexity is high for both approaches, although VIRAM is based on established vector technology  For well-suited applications IMAGINE processor can sustain over 10GFlop/s (simulated results)  Large # homogeneous computation required to sufficiently saturate IMAGINE while VIRAM can operate on small vector sizes  IMAGINE can take advantage of producer-consumer locality  Both present significant reduction in power and space  May be used as coprocessors in future generation architectures

22 Next Generation CODE: next generation of VIRAM –More functional units/ faster clock speed –Local registers per unit instead of single register file. –Looking more like Imagine… Multi VIRAM architecture – network interface issues? Brook: new language for Imagine –Eliminate exposure of hardware details (# of clusters) Streaming Supercomputer – multi Imagine configuration – Streams can be used for functional/data parallelism Currently evaluating DIVA architecture


Download ppt "Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL."

Similar presentations


Ads by Google