Download presentation
Presentation is loading. Please wait.
Published byChantal Gehrig Modified over 6 years ago
1
Lecture on High Performance Processor Architecture (CS05162)
DLP Architecture Case Study: Stream Processor Xu Guang Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99
2
Discussion Outline Motivation Related work Imagine Conclusion TPA-PD
Future work 2018/11/13 CS of USTC CS258 S99
3
Motivation VLSI technology More ALUs, Computation is relatively cheap
Keeping them feed hard The problem is bandwidth Energy Delay 2018/11/13 CS of USTC
4
Motivation Data level parallel (DLP) applications
Media application Real-time graphics Signal processing Video processing Scientific computing Application characteristics Dense computing Parallelism Territorial 2018/11/13 CS of USTC
5
Motivation Application characteristics
Poorly match conventional architectures Cache Instruction-level parallelism Few arithmetic units Well matched to modern VLSI technology Lots (100’s ’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 2018/11/13 CS of USTC
6
Related work Vector Dsp GPU General purpose processor
Large set of temporary values Dsp Signal register file GPU Special General purpose processor ILP cache 2018/11/13 CS of USTC
7
Related work Imagine Merrimac Prototype HW 2002 prototype processor
256mm2 die in 150nm , 21M transistors Collaboration with TI ASIC SW based on StreamC / KernelC Stream scheduler Communication scheduler Merrimac Stream supercomputer For scientific applications No prototype 2018/11/13 CS of USTC
8
Related work Cell AMD&ATI NUDT USTC Prototype HW Stream processor
221 mm2 in 90nm, 234M transistors AMD&ATI Stream processor NUDT Fei Teng 64 USTC ACSA TPA 2018/11/13 CS of USTC
9
Programming model Stream Vs array organized as a sequence of records
simplex complex ordered finite-length Vs array ordered use 2018/11/13 CS of USTC
10
Stream type Basic stream:an array of records
Derived stream:a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); 记录在地址空间中的分布是连续的 2018/11/13 CS of USTC CS258 S99
11
Stream type Sequential access pattern: y=x (start, end)
Strided access pattern: y=x (start,end,data Dependence,stride) Indexed access pattern y=x (start,end,data Dependence,index Stream) 2018/11/13 CS of USTC
12
Programming model Stream Program
Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. Express a computation as streams flowing through kernels Represent applications as a set of computation kernels that consume and produce data streams 2018/11/13 CS of USTC
13
Programming example Stereo depth extractor application
Operations within a kernel operate on local data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Output data Streams expose data parallelism Input data 2018/11/13 CS of USTC
14
Programming example Vect add 2018/11/13 CS of USTC
15
Why Organize an Application This Way?
Expose parallelismat three levels ILP within kernels DLP across stream elements TLP across sub-streams and across kernels Keeps ‘easy’ parallelism easy Expose locality in two ways Within a kernel – kernel locality Between kernels – producer-consumer locality Put another way, stream programs make communication explicit 2018/11/13 CS of USTC
16
Overall Imagine block diagram
Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller 2018/11/13 CS of USTC
17
Instructions of Imagine
Stream level 2018/11/13 CS of USTC
18
Instructions of Imagine
Kernel level Integer/Float arithmetic Bitwise logic and comparison Data permutation Stream in/out Loop control Operate on packed data like short vectors (SIMD) 2018/11/13 CS of USTC
19
Architecture of Imagine Host Interface
All interactions between the host processor and the imagine core occur via the Host Interface Stream instructions can be loaded onto Imagine Several status words can read from Imagine Individual data words can be read from Imagine Entire data streams can be transferred to/from Imagine 2018/11/13 CS of USTC
20
Architecture of Imagine Stream controller
Responsibilities Handles the data flow between and control of all of the modules on the chip Controlling which stream instructions to issue and when they are executed Parameters 32 entries instruction queue 2018/11/13 CS of USTC
21
Architecture of Imagine SRF
Responsibilities Source and destination for all memory operations The source and sink of data to the arithmetic clusters and the network router Parameters 8 banks 128KB Software control SDR SCR Read/Write a block at a time 2018/11/13 CS of USTC
22
Architecture of Imagine SRF
2018/11/13 CS of USTC
23
Architecture of Imagine SRF
4 4 2018/11/13 CS of USTC
24
Architecture of Imagine Memory System
Responsibilities Load data from Memory to SRF Store data from SRF to Memory Parameters 4 memory bank 2 address generators MSCR 2018/11/13 CS of USTC
25
Architecture of Imagine Memory System
2018/11/13 CS of USTC
26
Architecture of Imagine Microcontroller
Responsibilities passing parameters from/to the host interface loading microprograms into its microcode store controlling the execution of the microprograms on the arithmetic clusters. Parameters 1024×VLIW 32×32bit regfiles UCRF 2×1bit regfiles UCONDRF 2018/11/13 CS of USTC
27
Architecture of Imagine Microcontroller
2018/11/13 CS of USTC
28
Architecture of Imagine Cluster
Responsibilities 8 clusters perform identical operations in parallel Controller by microcontroller Parameters Each cluster has 3 ADDER 2 MULER 1 DIVIDER 1 Scratchpad 1 Jukebox and 1 Valid Unit 1 Comm Internal data path width of 32 bits. Each functional unit has its own local register files(LRF) All functional units accept 32-bit inputs and produce 32-bit results For floating point operations, the units use IEEE floating-point format 2018/11/13 CS of USTC
29
Architecture of Imagine Cluster
CU Intercluster Network + From SRF To SRF * / Cross Point Local Register File 2018/11/13 CS of USTC
30
Streams expose Kernel Locality missed by Vectors
Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time 2018/11/13 CS of USTC
31
Streams expose Kernel Locality missed by Vectors
2018/11/13 CS of USTC
32
Mapping App to Imagine mapping 2018/11/13 CS of USTC
33
Mapping App to Imagine compile Stream level 2018/11/13 CS of USTC
34
Mapping App to Imagine Kernel level 2018/11/13 CS of USTC
35
Mapping App to Imagine Before 2018/11/13 CS of USTC
36
Mapping App to Imagine Stream inst
Memop Load stream input from memory to srf Memop Load stream ucode from memory to srf 2018/11/13 CS of USTC
37
Mapping App to Imagine SRF data 2018/11/13 CS of USTC
38
Mapping App to Imagine Stream inst
Load ucode fetch ucode from srf to microcode store 2018/11/13 CS of USTC
39
Mapping App to Imagine Stream inst Cluster op execute ucode vadd
2018/11/13 CS of USTC
40
Mapping App to Imagine Stream inst
Memop Load stream output from srf to memory 2018/11/13 CS of USTC
41
Performance Bandwidth Hierarchy 2GB/s 32GB/s SDRAM Register File
Stream ALU Cluster 544GB/s 2018/11/13 CS of USTC
42
Performance Bandwidth demand of stream programs fits bandwidth hierarchy of architecture 2018/11/13 CS of USTC
43
Performance floating-point application 16-bit applications
16-bit kernels floating-point kernel 2018/11/13 CS of USTC
44
Performance Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 2018/11/13
CS of USTC
45
Conclusion Performance Power
compound stream operations realize >10GOPS on key applications can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power three-level register hierarchy gives 2-10GOPS/W 2018/11/13 CS of USTC
46
Conclusion Disadvantage Programming model Rewrite application
Programmers need to know details of hardware 2018/11/13 CS of USTC
47
TPA-PD Motivation Tiled Application Wire delay constraints
Difficult centralized structures dominating today’s designs Architectural partitioning encourages regularity and re-use Application Media APP Scientific computing Irregular control and data access 2018/11/13 CS of USTC
48
TPA-PD 2018/11/13 CS of USTC
49
TPA-PD Instruction set Stream level Kernel level
Explicit Data Graph Execution (EDGE) Block-Oriented Direct Target Encoding 2018/11/13 CS of USTC
50
TPA-PD Not centralized control 2018/11/13 CS of USTC
51
TPA-PD Programming model Binary Translator StreamC/KernelC Static
VLIW(bin) for input New inst(bin) for output 2018/11/13 CS of USTC
52
Future work TPA-PD Architecture Instruction Set C simulator
RTL simulator 2018/11/13 CS of USTC
53
Thank you 2018/11/13 CS of USTC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.