Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 SIMD Performance 10 6 10 5 10 4 10 3 10 2 10 1 10 0 2 1 0.5 0.25 0.13 0.07 Computational efficiency [MOPS/W] Feature size [um] Application specific cores Programmable processors [Roza] SIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture operation 1operation 2operation 3operation 4 Instruction format: operation 5

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 SIMD: Topics Overview Enhance performance: architecture methods Data Level Parallelism –Application area –Subword parallelism Locally connected SIMDs –Xetal Fully connected SIMDs –Imagine Communication in SIMD processors –RCSIMD –DCSIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Enhance performance: 3 architecture methods (Super)-pipelining Powerful instructions –MD-technique multiple data operands per operation –MO-technique multiple operations per instruction Multiple instruction issue

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 Characteristics of Media Applications Poorly matched to conventional architectures –Caches –Instruction-Level Parallelism –Few arithmetic units Well-matched to modern VLSI technology –Lots (100’s - 1000’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 Architecture methods Powerful Instructions (1) SIMD computing Exploit data locality of e.g. image processing applications Effect on code size? Effect on power consumption? SIMD Execution Method time Instruction 1 Instruction 2 Instruction 3 Instruction n node1node2node-K

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Architecture methods Powerful Instructions (1) Sub-word parallelism –SIMD on restricted scale: –Used for Multi-media instructions –Motivation: use a powerful 64-bit alu as 4 x 16-bit alus Examples –MMX, SUN-VIS, HP MAX-2, AMD- K7/Athlon 3Dnow, Trimedia II –Example:  i=1..4 |a i -b i | ****

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 LC-SIMD LC-SIMD (Locally connected; e.g. Xetal, Imap)  long communication delays: shift operations PE1PE2PE319PE0 Instructions Bus Memory One wide port

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 FC-SIMD FC-SIMD (Fully Connected; Imagine)  expensive communication network PE1PE2PE319PE0 Instructions Bus Fully Connected Communication Network

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 LC: Xetal Objectives High-degree of system integration  CMOS imaging + DSP  low cost camera systems Low power consumption  mobile & remote sensing Flexibility  programmable DSP and control functions

1 Xetal Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 Global Controller tuned for Xetal Archit. functions  loop/iteration control  system synchronization  exposure-time control  white balancing...

1 Xetal Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz ) clock gating shared address decoding minimal memory read access  LOW-POWER

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 Parallel Processing (Contd.)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 Xetal Specs & Performance

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 Simulation Results(1-input)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 Simulation Results(1-output)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 Simulation Results (2)

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 Imagine Combining DLP (SIMD) and ILP (VLIW) –toplevel SIMD –per PE: VLIW

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Stereo Depth Extraction Polygon Rendering MPEG Encoding/Decoding Encoded 2D Data 2D Video Stream Encode/Decode Imagine: Representative Applications Render 101100 010110 001001

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 Stream Processing Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic ops per memory reference) SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 Stream Architecture Provides Data Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s ALU Cluster SIMD/VLIW Control Peak BW:

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 Application Data: Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Stream Register File: Details SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Arbiter To/From: Arithmetic Clusters, I/O, Interprocessor communication, and Main Memory

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 CU Intercluster Network + From SRF To SRF + * / Cross Point Local Register File Arithmetic Cluster: Details Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions –4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC –17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) + *

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 The Imagine Stream Processor Stream Register File: 32kW SRAM Network Interface Stream Controller Imagine Stream Processor Host Processor Network ALU Cluster 0ALU Cluster 1ALU Cluster 2ALU Cluster 3ALU Cluster 4ALU Cluster 5ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller : 2K VLIW Instrs

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Imagine Floorplan 22 million transistors 500 MHz TI GS30KA: –0.15  m L drawn –0.  m L eff –CMOS process

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 RC-SIMD Imagine support full interconnect between PEs Do we need this expensive interconnect? Alternative: RC-SIMD

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Basic template of communication architecture S0 PE1 S1 PE2 S2 PE3PE0 1 1 11 11 0 0 0 0 0 0 Instructions Bus

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Example 4-tap filter LD 0 * C0 LD +1 * C1 LD +2 * C2 LD +3 * C3 + ST PE 0PE 1PE 2PE 3 0LD 0 1* C0 2LD +1 3* C1 4LD +2 5* C2 6LD +3 7* C3 8sum 9ST

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Example PE 0PE 1PE 2PE 3 0LD 0 1* C0 2LD +1 3* C1 4LD +2 5* C2 6LD +3 7* C3 8sum 9ST Resource sharing conflict How to solve???? Pipeline (shift 1 cycle) PE1 S1 PE2PE3PE0 1 0 S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 RC-SIMD: Basic architecture cyclePE 0PE 1PE 2PE 3 0LD 0--- 1* C0LD 0-- 2LD +1* C0LDP0- 3* C1LD +1* C0LD 0 4LD +2* C1LD +1* C0 5* C2LD +2* C1LD +1 6LD +3* C2LD +2* C1 7* C3LD +3* C2LD +2 8sum* C3LD +3* C2 9STsum* C3LD +3 10-STsum* C3 11--STsum 12---ST Schedule with delay-line PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Conflict model Ld +2 S0S1 S2 1 00 00 0 0 Schedule PE0 (using FACTS) Node: resource usage Sequence edge: timing dependency Fact tools Move problem From hardware to software PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Basic architecture Valid schedule PE1 S1 PE2PE3PE0 1 0 delay S2 1 0 S0 1 0 cyclePE 0PE 1PE 2PE 3 0LD +3--- 1LD +1LD +3-- 2LD 0LD +1LD +3- 3* C0LD 0LD +1LD +3 4LD +2* C0LD 0LD +1 5* C1LD +2* C0LD 0 6* C2* C1LD +2* C0 7* C3* C2* C1LD +2 8sum* C3* C2* C1 9STsum* C3* C2 10-STsum* C3 11--STsum 12---ST

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 Drawback Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4 319 cycle between PE0 & PE319 Size of conflict model (compile time) PE 0PE 1PE 2PE 3PE 319

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 Update Architecture Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4 Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2Ins 1 Ins 2 Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4Ins 3 Ins 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 PE 0PE 1PE 2PE 3PE 4PE 5 PE 6

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 Updated RC-SIMD Architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 Results of mapping several kernels Kernel Num operationNum cycle in LC-SIMD Num cycle in FC-SIMD Num cycle in RC-SIMD Comm. Overhead in LC-SIMD Cycle Improvement (compare to LC- SIMD) 4-tap filter81088220% Image sub-sampling 212621 519% Convolution 7x7 9812698 2822% Haar filter162216162 5425% FFT26- 28--

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman44 Imap

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman45 Imap

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman46 Difficult SIMD Applications Algorithms need Dynamic communication: –lens distortion –bucket processing –Mirroring,…

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman47 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 PE_6  PE_3 PE_4  PE_2 Vdst-adddatasrc-add Message format

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman48 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 Larger distance: PE_7  PE_1

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman49 DC-SIMD Architecture PE_1 Bus_0 Bus_1 Bus_2 PE_2PE_3 R3 R2 R1 PE_4PE_5PE_6 R6 R5 R4 PE_7 R7 PE_7  PE_5 PE_6  PE_2 Priority

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman50 DC-SIMD: arbitration PE Vdes-adddatasrc-add xor Read data write: give priority to further PES Read: PEnPEn+1PEn+2 Vdes-adddatasrc-add Next reg. Select (ab) ab v00 n+201 n+110 n11 a=v’.2’ b=a’.v’+a.1 ’ n+2 : 2.v n+1 : (2+v).1 n : (1+2+v).0 Buffer instruction: PEid

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman51 Measurements: Area overhead

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman52 Measurements: Performance

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman53 Measurements: required instruction buffer size

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman54 In PS3: CELL architecture

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman55 CELL Highlights Observed clock speed –– > 4 GHz Peak performance (single precision) –– > 256 GFlops Peak performance (double precision) –– >26 GFlops Area 221 mm2 Technology 90nm SOI Total # of transistors 234M

6/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman56 Conclusions SIMD nicely matches –Image applications: data-level parallelism –VLSI efficiency: copy-paste of simple elements So –Very efficient architecture for image processing –Low power! Also by trading off clock vs peroformance But –Programmer is burdened with vector thinking –Vectorizing compilers are not good at recognizing opportunities for vector executions –Need for a “control” processor for control code and if-then-else Communication is a problem: –Dimensioned for peak BW requirements -> RCSIMD –Unable to perform indirect PE addressing-> DCSIMD

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures."— Presentation transcript:

Similar presentations

About project

Feedback