Lecture on High Performance Processor Architecture (CS05162)

Lecture on High Performance Processor Architecture (CS05162)
DLP Architecture Case Study: Stream Processor Xu Guang Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99

Discussion Outline Motivation Related work Imagine Conclusion TPA-PD
Future work 2018/11/13 CS of USTC CS258 S99

Motivation VLSI technology More ALUs, Computation is relatively cheap
Keeping them feed hard The problem is bandwidth Energy Delay 2018/11/13 CS of USTC

Motivation Data level parallel (DLP) applications
Media application Real-time graphics Signal processing Video processing Scientific computing Application characteristics Dense computing Parallelism Territorial 2018/11/13 CS of USTC

Motivation Application characteristics
Poorly match conventional architectures Cache Instruction-level parallelism Few arithmetic units Well matched to modern VLSI technology Lots (100’s ’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 2018/11/13 CS of USTC

Related work Vector Dsp GPU General purpose processor
Large set of temporary values Dsp Signal register file GPU Special General purpose processor ILP cache 2018/11/13 CS of USTC

Related work Imagine Merrimac Prototype HW 2002 prototype processor
256mm2 die in 150nm , 21M transistors Collaboration with TI ASIC SW based on StreamC / KernelC Stream scheduler Communication scheduler Merrimac Stream supercomputer For scientific applications No prototype 2018/11/13 CS of USTC

Related work Cell AMD&ATI NUDT USTC Prototype HW Stream processor
221 mm2 in 90nm, 234M transistors AMD&ATI Stream processor NUDT Fei Teng 64 USTC ACSA TPA 2018/11/13 CS of USTC

Programming model Stream Vs array organized as a sequence of records
simplex complex ordered finite-length Vs array ordered use 2018/11/13 CS of USTC

Stream type Basic stream：an array of records
Derived stream：a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); 记录在地址空间中的分布是连续的 2018/11/13 CS of USTC CS258 S99

Stream type Sequential access pattern: y=x (start, end)
Strided access pattern: y=x (start，end，data Dependence，stride) Indexed access pattern y=x (start，end，data Dependence，index Stream) 2018/11/13 CS of USTC

Programming model Stream Program
Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. Express a computation as streams flowing through kernels Represent applications as a set of computation kernels that consume and produce data streams 2018/11/13 CS of USTC

Programming example Stereo depth extractor application
Operations within a kernel operate on local data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Output data Streams expose data parallelism Input data 2018/11/13 CS of USTC

Programming example Vect add 2018/11/13 CS of USTC

Why Organize an Application This Way?
Expose parallelismat three levels ILP within kernels DLP across stream elements TLP across sub-streams and across kernels Keeps ‘easy’ parallelism easy Expose locality in two ways Within a kernel – kernel locality Between kernels – producer-consumer locality Put another way, stream programs make communication explicit 2018/11/13 CS of USTC

Overall Imagine block diagram
Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller 2018/11/13 CS of USTC

Instructions of Imagine
Stream level 2018/11/13 CS of USTC

Instructions of Imagine
Kernel level Integer/Float arithmetic Bitwise logic and comparison Data permutation Stream in/out Loop control Operate on packed data like short vectors (SIMD) 2018/11/13 CS of USTC

Architecture of Imagine Host Interface
All interactions between the host processor and the imagine core occur via the Host Interface Stream instructions can be loaded onto Imagine Several status words can read from Imagine Individual data words can be read from Imagine Entire data streams can be transferred to/from Imagine 2018/11/13 CS of USTC

Architecture of Imagine Stream controller
Responsibilities Handles the data flow between and control of all of the modules on the chip Controlling which stream instructions to issue and when they are executed Parameters 32 entries instruction queue 2018/11/13 CS of USTC

Architecture of Imagine SRF
Responsibilities Source and destination for all memory operations The source and sink of data to the arithmetic clusters and the network router Parameters 8 banks 128KB Software control SDR SCR Read/Write a block at a time 2018/11/13 CS of USTC

2018/11/13 CS of USTC

4 4 2018/11/13 CS of USTC

Architecture of Imagine Memory System
Responsibilities Load data from Memory to SRF Store data from SRF to Memory Parameters 4 memory bank 2 address generators MSCR 2018/11/13 CS of USTC

Architecture of Imagine Memory System
2018/11/13 CS of USTC

Architecture of Imagine Microcontroller
Responsibilities passing parameters from/to the host interface loading microprograms into its microcode store controlling the execution of the microprograms on the arithmetic clusters. Parameters 1024×VLIW 32×32bit regfiles UCRF 2×1bit regfiles UCONDRF 2018/11/13 CS of USTC

Architecture of Imagine Microcontroller
2018/11/13 CS of USTC

Architecture of Imagine Cluster
Responsibilities 8 clusters perform identical operations in parallel Controller by microcontroller Parameters Each cluster has 3 ADDER 2 MULER 1 DIVIDER 1 Scratchpad 1 Jukebox and 1 Valid Unit 1 Comm Internal data path width of 32 bits. Each functional unit has its own local register files(LRF) All functional units accept 32-bit inputs and produce 32-bit results For floating point operations, the units use IEEE floating-point format 2018/11/13 CS of USTC

Architecture of Imagine Cluster
CU Intercluster Network + From SRF To SRF * / Cross Point Local Register File 2018/11/13 CS of USTC

Streams expose Kernel Locality missed by Vectors
Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time 2018/11/13 CS of USTC

Streams expose Kernel Locality missed by Vectors
2018/11/13 CS of USTC

Mapping App to Imagine mapping 2018/11/13 CS of USTC

Mapping App to Imagine compile Stream level 2018/11/13 CS of USTC

Mapping App to Imagine Kernel level 2018/11/13 CS of USTC

Mapping App to Imagine Before 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst
Memop Load stream input from memory to srf Memop Load stream ucode from memory to srf 2018/11/13 CS of USTC

Mapping App to Imagine SRF data 2018/11/13 CS of USTC

Load ucode fetch ucode from srf to microcode store 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst Cluster op execute ucode vadd
2018/11/13 CS of USTC

Memop Load stream output from srf to memory 2018/11/13 CS of USTC

Performance Bandwidth Hierarchy 2GB/s 32GB/s SDRAM Register File
Stream ALU Cluster 544GB/s 2018/11/13 CS of USTC

Performance Bandwidth demand of stream programs fits bandwidth hierarchy of architecture 2018/11/13 CS of USTC

Performance floating-point application 16-bit applications
16-bit kernels floating-point kernel 2018/11/13 CS of USTC

Performance Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 2018/11/13
CS of USTC

Conclusion Performance Power
compound stream operations realize >10GOPS on key applications can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power three-level register hierarchy gives 2-10GOPS/W 2018/11/13 CS of USTC

Conclusion Disadvantage Programming model Rewrite application
Programmers need to know details of hardware 2018/11/13 CS of USTC

TPA-PD Motivation Tiled Application Wire delay constraints
Difficult centralized structures dominating today’s designs Architectural partitioning encourages regularity and re-use Application Media APP Scientific computing Irregular control and data access 2018/11/13 CS of USTC

TPA-PD 2018/11/13 CS of USTC

TPA-PD Instruction set Stream level Kernel level
Explicit Data Graph Execution (EDGE) Block-Oriented Direct Target Encoding 2018/11/13 CS of USTC

TPA-PD Not centralized control 2018/11/13 CS of USTC

TPA-PD Programming model Binary Translator StreamC/KernelC Static
VLIW(bin) for input New inst(bin) for output 2018/11/13 CS of USTC

Future work TPA-PD Architecture Instruction Set C simulator
RTL simulator 2018/11/13 CS of USTC

Thank you 2018/11/13 CS of USTC

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

Feedback