Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

DSPs Vs General Purpose Microprocessors

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 Chapter 04 Authors: John Hennessy & David Patterson.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Investigating Architectural Balance using Adaptable Probes.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

The Alpha Thomas Daniels Other Dude Matt Ziegler.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Playstation2 Architecture Architecture Hardware Design.

The Alpha – Data Stream Matt Ziegler.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Sunpyo Hong, Hyesoon Kim

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

My Coordinates Office EM G.27 contact time:

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Vector computers.

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Visit for more Learning Resources

A programmable communications processor for future wireless systems

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Morgan Kaufmann Publishers

Vector Processing => Multimedia

Stream Architecture: Rethinking Media Processor Design

Lecture on High Performance Processor Architecture (CS05162)

Out of Order Processors

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

CSE 502: Computer Architecture

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL Sourav Chatterji, Jason Duell, Manikandan Narayanan

Motivation  Commodity cache-based SMP clusters perform at small % of peak for memory intensive problems (esp irregular prob)  But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)  Power and packaging are becoming significant bottlenecks  Better software is improving some problems: ATLAS, FFTW, Sparsity, PHiPAC  Alternative arch allow tighter integration of proc & memory Can we build HPC systems w/ high-end media proc tech? VIRAM: PIM technology combines embedded DRAM with vector coprocessor to exploit large bandwidth potential IMAGINE: Stream-aware memory supports large processing potential of SIMD controlled VLIW clusters

Motivation  General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption  Application-specific ASICs Good, but expensive/slow to design.  Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation

VIRAM Overview  MIPS core (200 MHz)  Main memory system  8 banks w/13 MB of on-chip DRAM  Large 6.4 GBytes/s on-chip peak bandwidth  Cach-less Vector unit  Energy efficient way to express fine-grained parallelism and exploit bandwidth  Single issue, in order  Low power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabricated by IBM: Taped-out 02/2003  To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages)  We use simulator with Cray’s vcc compiler

VIRAM Vector Lanes  Parallel lane design has adv in performance, design complex, scalability  Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal  Vector instr specify 64 way-parallelism, hardware exec 8-way  8 KB vector register file partitioned into 32 vector registers  Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit Data width cut in half, # of elems per register (and peak) doubles  Limitations: no 64-bit FP & compiler doesn’t generate fused MADD

VIRAM Power Efficiency  Comparable performance with lower clock rate  Large power/performance advantage for VIRAM from PIM technology, data parallel execution model

Stream Processing  Stream: ordered set of records (homogenous, arbitrary data type)  Stream programming: data is streams, compu is kernel  Kernel loop through all stream elements (sequential order)  Perform compound (multiword) operation on each stream elem  Vectors perform single arith op on each vector elem (then store in reg) Example: stereo depth extraction  Data and Functional Parallelism  High Comp rate  Little Data Reuse  Producer-Consumer and Spatial locality  Ex: Multimedia, sign proc, graphics

Imagine Overview  “ Vector VLIW” processor  Coprocessor to off-chip host processor  8 arithmet clusters control in SIMD w/ VLIW instr  Central 128KB Stream Register 32GB/s  SRF can overlap comp with mem (double buff)  SRF cab reuse intermed results (prod-cons local)  Stream-aware mem sys with 2.7 GB/s off-chip  544 GB/s interclustr comm  Host sends inst to stream controller, SC issues commands to on-chip modules

Imagine Arithmetic Clusters  400 MHz clock, 8 clusters w/ 6 FU each (48 FU total)  Reads/writes streams to SRF  Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit  32 bit arch: subword operations support 16 and 8 bit data (no 64 bit support)  Local registers on functional units hold 16 words each (total 1.5 KB)  Clusters receive VLIW-style instructions broadcast from microcontroller.

VIRAM and Imagine  Imagine order of magnitude higher performance  VIRAM twice mem bandwidth, less power consumption  Notice peak Flop/Word ratios VIRAM IMAGINE Memory IMAGINE SRF Bwdth GB/s Peak Fl 32bit1.6 GF/s20 GF/s20 Peak Fl/Wd Speed MHz Chip Area15x18mm12x12mm Data widths64/32/1632/16/8 Transistors130 x x 10 6 Pwr Consmp2 Watts10 Watts

SQMAT Architectural Probe  Sqmat: scalable synthetic probe, control comput intensity, vector len  Imagine stream model req large # of ops per word to amortize mem ref Poor use of SRF, no producer-consumer locality  Long stream helps hide mem latency but only 7% of algorithmic peak  VIRAM: performs well for low op/word (40% when L=256)  Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten) 3x3 Matrix Multiply

SQMAT: Performance Crossover  Large number of ops/word N 10 where N=3x3  Crossover point L=64 (cycles), L = 256 (MFlop)  Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch

VIRAM/Imagine Optimization  Example optimization RGB→YIQ conversion from EEMBC Input format: R 1 G 1 B 1 R 2 G 2 R 2 R 3 G 3 B 3 … Required format: R 1 R 2 R 3 … G 1 G 2 G 3 … B 1 B 2 B 3 ….  Optimization strat: speed up slower of comp or mem  Restructure computation for better kernel perform Mem is waiting for ALUS  Add more computation for better memory perform ALU memory starved  Subtle overlap effects: vect chaining, stream doub buff

VIRAM RGB→YIQ Optimization VIRAM: poor memory performance Strided accesses (~1/2 performance) - RGBRGBRGB… -- strided loads → RRR…GGG…BBB… - Only 4 address generators for 8 addresses (sufficient for 64 bit) Word operations on byte data (1/4 th performance) Optimization: replace strided w/ unit access, using in-register shuffle Increased computational overhead (packing and unpacking)

VIRAM RGB→YIQ Results Used functional units instead of memory to extract components, increasing the computational overhead VIRAM Kernel (cycles) Memory (cycles) Unoptimized11495 Optimized10817 Chunk Size64

Imagine RGB→YIQ Optimization  Imagine bottleneck is comp due poor ALU schedule (left) Unoptimized 15 cycles per pixel  Software pipelining makes VLIW schedule denser (right) Optimized 8 cycles per pixel

Imagine RGB→YIQ Results Imagine Kernel (cycles) Memory (cycles) Unoptimized Optimized Chunk Size1024 Optimized kernel takes only ½ the cycles per element Memory is now the new bottleneck

EEMBC Benchmark  Vec-add: one add/elem, perf limited by memory system  RGB →(YIQ,CMYK): VIRAM limited by processing (cannot use avail bdwidth)  Grayfiler: Difficult to efficiently impl on Imagine (sliding 3x3 window)  Autocorr: Uses short streams, Imagine host latency is high BenchmarkWidth VIR/IMAApplication AreaRemarks Vec addition32/32 bitsMicrobenchmarkc[i]=a[i]+b[i] RGB →YIQ32/32 bitsEEMBC ConsumerColor-conver RGB →CMYK16/8 bitsEEMBC ConsumerColor-conver Gray Filter16/32 bitsEEMBC Consumer3x3 convolu Autocorrelation16/32 bitsEEMBC TelecomDot product

Scientific Kernels SPMV Performance  Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle  LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz  Imagine lacks irreg access, reorder matrix before kernelC  VIRAM better suited for this class of apps (low comp/mem) Matrix Rows/NNZ Perform Metric VIRAMImagine CRSSegSumEllpckCRSStreamsEllpck LSHAPE % Peak2.8%7.4%31%1.1%0.8%1.2% Cycles67K24K5.6K40K48K38K MFlop/s LARGE DIS % Peak3.2%8.4%32%1.5%0.6%6.3% Cycles802K567K641K742K1840K754K MFlop/s

Scientific Kernels Complex QR Decomposition  A=QR Q orthrog & A upper triag,  Blocked Househoulder variant – rich in level 3 BLAS ops  Complex elems increases ops/word & locality (1 MUL = 6 ops)  VIRAM uses CLAPACK port (insertion of vector directives)  Imagine: complex indexing of matrix stream (each iter smaller matrix)  Imagine over 10GFlops (19x VIRAM) – well suited for this arch Low VIRAM perf due strided access and compiler limitations Complex QR Decomposition VIRAMImagine MatrixPerformance MITRE RT_STRAP 192x96 complex % of Peak34.1%65.5% Total Cycles5189K712K MFlops/s

Overview  Significantly different balance of memory organization  Relative performance depends on computational intensity  Programming complexity is high for both approaches, although VIRAM is based on established vector technology  For well-suited applications IMAGINE processor can sustain over 10GFlop/s (simulated results)  Large # homogeneous computation required to sufficiently saturate IMAGINE while VIRAM can operate on small vector sizes  IMAGINE can take advantage of producer-consumer locality  Both present significant reduction in power and space  May be used as coprocessors in future generation architectures

Next Generation CODE: next generation of VIRAM –More functional units/ faster clock speed –Local registers per unit instead of single register file. –Looking more like Imagine… Multi VIRAM architecture – network interface issues? Brook: new language for Imagine –Eliminate exposure of hardware details (# of clusters) Streaming Supercomputer – multi Imagine configuration – Streams can be used for functional/data parallelism Currently evaluating DIVA architecture