Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Lecture 6: Multicore Systems
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Computer Architecture and Data Manipulation Chapter 3.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Processor Architectures and Program Mapping 5kk10 TU/e 2006 Henk Corporaal Jef van Meerbergen Bart Mesman.
Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Advanced Computer Architectures
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Chapter One Introduction to Pipelined Processors.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner.
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Lecture 3: Computer Architectures
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Data/Frame Memory PE 0 PE 1 PE 2 PE 3 PE N … Control Instruction Memory Interconnect The SIMD Concept.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
My Coordinates Office EM G.27 contact time:
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Parallel Processing - introduction
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Architecture & Organization 1
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Morgan Kaufmann Publishers
Processor Architectures and Program Mapping
Vector Processing => Multimedia
Platform-based Design
Stream Architecture: Rethinking Media Processor Design
Lecture on High Performance Processor Architecture (CS05162)
Architecture & Organization 1
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Comparison of Two Processors
Chapter 1 Introduction.
ADSP 21065L.
Presentation transcript:

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 SIMD Performance Computational efficiency [MOPS/W] Feature size [um] Application specific cores Programmable processors [Roza] SIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture operation 1operation 2operation 3operation 4 Instruction format: operation 5

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 SIMD: Topics Overview Enhance performance: architecture methods Data Level Parallelism –Application area –Subword parallelism Locally connected SIMDs –Xetal Fully connected SIMDs –Imagine Communication in SIMD processors –RCSIMD –DCSIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Enhance performance: 3 architecture methods (Super)-pipelining Powerful instructions –MD-technique multiple data operands per operation –MO-technique multiple operations per instruction Multiple instruction issue

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 Characteristics of Media Applications Poorly matched to conventional architectures –Caches –Instruction-Level Parallelism –Few arithmetic units Well-matched to modern VLSI technology –Lots (100’s ’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 Architecture methods Powerful Instructions (1) SIMD computing Exploit data locality of e.g. image processing applications Effect on code size? Effect on power consumption? SIMD Execution Method time Instruction 1 Instruction 2 Instruction 3 Instruction n node1node2node-K

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Architecture methods Powerful Instructions (1) Sub-word parallelism –SIMD on restricted scale: –Used for Multi-media instructions –Motivation: use a powerful 64-bit alu as 4 x 16-bit alus Examples –MMX, SUN-VIS, HP MAX-2, AMD- K7/Athlon 3Dnow, Trimedia II –Example:  i=1..4 |a i -b i | ****

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 LC-SIMD LC-SIMD (Locally connected; e.g. Xetal, Imap)  long communication delays: shift operations PE1PE2PE319PE0 Instructions Bus Memory One wide port

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 FC-SIMD FC-SIMD (Fully Connected; Imagine)  expensive communication network PE1PE2PE319PE0 Instructions Bus Fully Connected Communication Network

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 LC: Xetal Objectives High-degree of system integration  CMOS imaging + DSP  low cost camera systems Low power consumption  mobile & remote sensing Flexibility  programmable DSP and control functions

1 Xetal Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 Global Controller tuned for Xetal Archit. functions  loop/iteration control  system synchronization  exposure-time control  white balancing...

1 Xetal Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz ) clock gating shared address decoding minimal memory read access  LOW-POWER

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 Parallel Processing (Contd.)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 Xetal Specs & Performance

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 Simulation Results(1-input)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 Simulation Results(1-output)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 Simulation Results (2)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 Imagine Combining DLP (SIMD) and ILP (VLIW) –toplevel SIMD –per PE: VLIW

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Stereo Depth Extraction Polygon Rendering MPEG Encoding/Decoding Encoded 2D Data 2D Video Stream Encode/Decode Imagine: Representative Applications Render

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 Stream Processing Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic ops per memory reference) SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 Stream Architecture Provides Data Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s ALU Cluster SIMD/VLIW Control Peak BW:

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 Application Data: Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Stream Register File: Details SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Arbiter To/From: Arithmetic Clusters, I/O, Interprocessor communication, and Main Memory

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 CU Intercluster Network + From SRF To SRF + * / Cross Point Local Register File Arithmetic Cluster: Details Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions –4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC –17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) + *

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 The Imagine Stream Processor Stream Register File: 32kW SRAM Network Interface Stream Controller Imagine Stream Processor Host Processor Network ALU Cluster 0ALU Cluster 1ALU Cluster 2ALU Cluster 3ALU Cluster 4ALU Cluster 5ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller : 2K VLIW Instrs

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Imagine Floorplan 22 million transistors 500 MHz TI GS30KA: –0.15  m L drawn –0.  m L eff –CMOS process

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 RC-SIMD Imagine support full interconnect between PEs Do we need this expensive interconnect? Alternative: RC-SIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Basic template of communication architecture S0 PE1 S1 PE2 S2 PE3PE Instructions Bus

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Example 4-tap filter LD 0 * C0 LD +1 * C1 LD +2 * C2 LD +3 * C3 + ST PE 0PE 1PE 2PE 3 0LD 0 1* C0 2LD +1 3* C1 4LD +2 5* C2 6LD +3 7* C3 8sum 9ST

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Example PE 0PE 1PE 2PE 3 0LD 0 1* C0 2LD +1 3* C1 4LD +2 5* C2 6LD +3 7* C3 8sum 9ST Resource sharing conflict How to solve???? Pipeline (shift 1 cycle) PE1 S1 PE2PE3PE0 1 0 S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 RC-SIMD: Basic architecture cyclePE 0PE 1PE 2PE 3 0LD * C0LD 0-- 2LD +1* C0LDP0- 3* C1LD +1* C0LD 0 4LD +2* C1LD +1* C0 5* C2LD +2* C1LD +1 6LD +3* C2LD +2* C1 7* C3LD +3* C2LD +2 8sum* C3LD +3* C2 9STsum* C3LD STsum* C3 11--STsum 12---ST Schedule with delay-line PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Conflict model Ld +2 S0S1 S Schedule PE0 (using FACTS) Node: resource usage Sequence edge: timing dependency Fact tools Move problem From hardware to software PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Basic architecture Valid schedule PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0 cyclePE 0PE 1PE 2PE 3 0LD LD +1LD LD 0LD +1LD +3- 3* C0LD 0LD +1LD +3 4LD +2* C0LD 0LD +1 5* C1LD +2* C0LD 0 6* C2* C1LD +2* C0 7* C3* C2* C1LD +2 8sum* C3* C2* C1 9STsum* C3* C2 10-STsum* C3 11--STsum 12---ST

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 Drawback Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins cycle between PE0 & PE319 Size of conflict model (compile time) PE 0PE 1PE 2PE 3PE 319

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 Update Architecture Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4 Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 PE 0PE 1PE 2PE 3PE 4PE 5 PE 6

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 Updated RC-SIMD Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 Results of mapping several kernels Kernel Num operationNum cycle in LC-SIMD Num cycle in FC-SIMD Num cycle in RC-SIMD Comm. Overhead in LC-SIMD Cycle Improvement (compare to LC- SIMD) 4-tap filter % Image sub-sampling % Convolution 7x % Haar filter % FFT

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman44 Imap

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman45 Imap

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman46 Difficult SIMD Applications Algorithms need Dynamic communication: –lens distortion –bucket processing –Mirroring,…

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman47 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 PE_6  PE_3 PE_4  PE_2 Vdst-adddatasrc-add Message format

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman48 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 Larger distance: PE_7  PE_1

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman49 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 PE_7  PE_5 PE_6  PE_2 Priority

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman50 DC-SIMD: arbitration PE Vdes-adddatasrc-add xor Read data write: give priority to further PES Read: PEnPEn+1PEn+2 Vdes-adddatasrc-add Next reg. Select (ab) ab v00 n+201 n+110 n11 a=v’.2’ b=a’.v’+a.1 ’ n+2 : 2.v n+1 : (2+v).1 n : (1+2+v).0 Buffer instruction: PEid

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman51 Measurements: Area overhead

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman52 Measurements: Performance

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman53 Measurements: required instruction buffer size

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman54 In PS3: CELL architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman55 CELL Highlights Observed clock speed –– > 4 GHz Peak performance (single precision) –– > 256 GFlops Peak performance (double precision) –– >26 GFlops Area 221 mm2 Technology 90nm SOI Total # of transistors 234M

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman56 Conclusions SIMD nicely matches –Image applications: data-level parallelism –VLSI efficiency: copy-paste of simple elements So –Very efficient architecture for image processing –Low power! Also by trading off clock vs peroformance But –Programmer is burdened with vector thinking –Vectorizing compilers are not good at recognizing opportunities for vector executions –Need for a “control” processor for control code and if-then-else Communication is a problem: –Dimensioned for peak BW requirements -> RCSIMD –Unable to perform indirect PE addressing-> DCSIMD