Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations


Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

1 Lecture on High Performance Processor Architecture (CS05162)
DLP Architecture Case Study: Stream Processor Xu Guang Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99

2 Discussion Outline Motivation Related work Imagine Conclusion TPA-PD
Future work 2018/11/13 CS of USTC CS258 S99

3 Motivation VLSI technology More ALUs, Computation is relatively cheap
Keeping them feed hard The problem is bandwidth Energy Delay 2018/11/13 CS of USTC

4 Motivation Data level parallel (DLP) applications
Media application Real-time graphics Signal processing Video processing Scientific computing Application characteristics Dense computing Parallelism Territorial 2018/11/13 CS of USTC

5 Motivation Application characteristics
Poorly match conventional architectures Cache Instruction-level parallelism Few arithmetic units Well matched to modern VLSI technology Lots (100’s ’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 2018/11/13 CS of USTC

6 Related work Vector Dsp GPU General purpose processor
Large set of temporary values Dsp Signal register file GPU Special General purpose processor ILP cache 2018/11/13 CS of USTC

7 Related work Imagine Merrimac Prototype HW 2002 prototype processor
256mm2 die in 150nm , 21M transistors Collaboration with TI ASIC SW based on StreamC / KernelC Stream scheduler Communication scheduler Merrimac Stream supercomputer For scientific applications No prototype 2018/11/13 CS of USTC

8 Related work Cell AMD&ATI NUDT USTC Prototype HW Stream processor
221 mm2 in 90nm, 234M transistors AMD&ATI Stream processor NUDT Fei Teng 64 USTC ACSA TPA 2018/11/13 CS of USTC

9 Programming model Stream Vs array organized as a sequence of records
simplex complex ordered finite-length Vs array ordered use 2018/11/13 CS of USTC

10 Stream type Basic stream:an array of records
Derived stream:a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); 记录在地址空间中的分布是连续的 2018/11/13 CS of USTC CS258 S99

11 Stream type Sequential access pattern: y=x (start, end)
Strided access pattern: y=x (start,end,data Dependence,stride) Indexed access pattern y=x (start,end,data Dependence,index Stream) 2018/11/13 CS of USTC

12 Programming model Stream Program
Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. Express a computation as streams flowing through kernels Represent applications as a set of computation kernels that consume and produce data streams 2018/11/13 CS of USTC

13 Programming example Stereo depth extractor application
Operations within a kernel operate on local data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Output data Streams expose data parallelism Input data 2018/11/13 CS of USTC

14 Programming example Vect add 2018/11/13 CS of USTC

15 Why Organize an Application This Way?
Expose parallelismat three levels ILP within kernels DLP across stream elements TLP across sub-streams and across kernels Keeps ‘easy’ parallelism easy Expose locality in two ways Within a kernel – kernel locality Between kernels – producer-consumer locality Put another way, stream programs make communication explicit 2018/11/13 CS of USTC

16 Overall Imagine block diagram
Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller 2018/11/13 CS of USTC

17 Instructions of Imagine
Stream level 2018/11/13 CS of USTC

18 Instructions of Imagine
Kernel level Integer/Float arithmetic Bitwise logic and comparison Data permutation Stream in/out Loop control Operate on packed data like short vectors (SIMD) 2018/11/13 CS of USTC

19 Architecture of Imagine Host Interface
All interactions between the host processor and the imagine core occur via the Host Interface Stream instructions can be loaded onto Imagine Several status words can read from Imagine Individual data words can be read from Imagine Entire data streams can be transferred to/from Imagine 2018/11/13 CS of USTC

20 Architecture of Imagine Stream controller
Responsibilities Handles the data flow between and control of all of the modules on the chip Controlling which stream instructions to issue and when they are executed Parameters 32 entries instruction queue 2018/11/13 CS of USTC

21 Architecture of Imagine SRF
Responsibilities Source and destination for all memory operations The source and sink of data to the arithmetic clusters and the network router Parameters 8 banks 128KB Software control SDR SCR Read/Write a block at a time 2018/11/13 CS of USTC

22 Architecture of Imagine SRF
2018/11/13 CS of USTC

23 Architecture of Imagine SRF
4 4 2018/11/13 CS of USTC

24 Architecture of Imagine Memory System
Responsibilities Load data from Memory to SRF Store data from SRF to Memory Parameters 4 memory bank 2 address generators MSCR 2018/11/13 CS of USTC

25 Architecture of Imagine Memory System
2018/11/13 CS of USTC

26 Architecture of Imagine Microcontroller
Responsibilities passing parameters from/to the host interface loading microprograms into its microcode store controlling the execution of the microprograms on the arithmetic clusters. Parameters 1024×VLIW 32×32bit regfiles UCRF 2×1bit regfiles UCONDRF 2018/11/13 CS of USTC

27 Architecture of Imagine Microcontroller
2018/11/13 CS of USTC

28 Architecture of Imagine Cluster
Responsibilities 8 clusters perform identical operations in parallel Controller by microcontroller Parameters Each cluster has 3 ADDER 2 MULER 1 DIVIDER 1 Scratchpad 1 Jukebox and 1 Valid Unit 1 Comm Internal data path width of 32 bits. Each functional unit has its own local register files(LRF) All functional units accept 32-bit inputs and produce 32-bit results For floating point operations, the units use IEEE floating-point format 2018/11/13 CS of USTC

29 Architecture of Imagine Cluster
CU Intercluster Network + From SRF To SRF * / Cross Point Local Register File 2018/11/13 CS of USTC

30 Streams expose Kernel Locality missed by Vectors
Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time 2018/11/13 CS of USTC

31 Streams expose Kernel Locality missed by Vectors
2018/11/13 CS of USTC

32 Mapping App to Imagine mapping 2018/11/13 CS of USTC

33 Mapping App to Imagine compile Stream level 2018/11/13 CS of USTC

34 Mapping App to Imagine Kernel level 2018/11/13 CS of USTC

35 Mapping App to Imagine Before 2018/11/13 CS of USTC

36 Mapping App to Imagine Stream inst
Memop Load stream input from memory to srf Memop Load stream ucode from memory to srf 2018/11/13 CS of USTC

37 Mapping App to Imagine SRF data 2018/11/13 CS of USTC

38 Mapping App to Imagine Stream inst
Load ucode fetch ucode from srf to microcode store 2018/11/13 CS of USTC

39 Mapping App to Imagine Stream inst Cluster op execute ucode vadd
2018/11/13 CS of USTC

40 Mapping App to Imagine Stream inst
Memop Load stream output from srf to memory 2018/11/13 CS of USTC

41 Performance Bandwidth Hierarchy 2GB/s 32GB/s SDRAM Register File
Stream ALU Cluster 544GB/s 2018/11/13 CS of USTC

42 Performance Bandwidth demand of stream programs fits bandwidth hierarchy of architecture 2018/11/13 CS of USTC

43 Performance floating-point application 16-bit applications
16-bit kernels floating-point kernel 2018/11/13 CS of USTC

44 Performance Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 2018/11/13
CS of USTC

45 Conclusion Performance Power
compound stream operations realize >10GOPS on key applications can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power three-level register hierarchy gives 2-10GOPS/W 2018/11/13 CS of USTC

46 Conclusion Disadvantage Programming model Rewrite application
Programmers need to know details of hardware 2018/11/13 CS of USTC

47 TPA-PD Motivation Tiled Application Wire delay constraints
Difficult centralized structures dominating today’s designs Architectural partitioning encourages regularity and re-use Application Media APP Scientific computing Irregular control and data access 2018/11/13 CS of USTC

48 TPA-PD 2018/11/13 CS of USTC

49 TPA-PD Instruction set Stream level Kernel level
Explicit Data Graph Execution (EDGE) Block-Oriented Direct Target Encoding 2018/11/13 CS of USTC

50 TPA-PD Not centralized control 2018/11/13 CS of USTC

51 TPA-PD Programming model Binary Translator StreamC/KernelC Static
VLIW(bin) for input New inst(bin) for output 2018/11/13 CS of USTC

52 Future work TPA-PD Architecture Instruction Set C simulator
RTL simulator 2018/11/13 CS of USTC

53 Thank you 2018/11/13 CS of USTC


Download ppt "Lecture on High Performance Processor Architecture (CS05162)"

Similar presentations


Ads by Google