Lecture on High Performance Processor Architecture (CS05162)

Slides:



Advertisements
Similar presentations
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Advertisements

The University of Adelaide, School of Computer Science
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Computer Architecture And Organization UNIT-II General System Architecture.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Lecture 15 Microarchitecture Level: Level 1. Microarchitecture Level The level above digital logic level. Job: to implement the ISA level above it. The.
Overview von Neumann Architecture Computer component Computer function
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
My Coordinates Office EM G.27 contact time:
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
Controller Implementation
Computer Organization and Architecture Lecture 1 : Introduction
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Advanced Architectures
Computer Organization and Architecture + Networks
Assembly language.
ECE354 Embedded Systems Introduction C Andras Moritz.
What is a computer? Simply put, a computer is a sophisticated electronic calculating machine that: Accepts input information, Processes the information.
CMSC 611: Advanced Computer Architecture
Lecture 5: Computer systems architecture
What is a computer? Simply put, a computer is a sophisticated electronic calculating machine that: Accepts input information, Processes the information.
A programmable communications processor for future wireless systems
Computer Organization & Design Microcode for Control Sec. 5
Morgan Kaufmann Publishers
Course Name: Computer Application Topic: Central Processing Unit (CPU)
Introduction to Computer Engineering
Processor Organization and Architecture
Chapter 3 Top Level View of Computer Function and Interconnection
CENTRAL PROCESSING UNIT CPU (microprocessor)
Computer Architecture
Vector Processing => Multimedia
Stream Architecture: Rethinking Media Processor Design
Pipelining and Vector Processing
Systems Architecture I (CS ) Lecture 5: MIPS Instruction Set*
Compiler Supports and Optimizations for PAC VLIW DSP Processors
Computer Architecture
Chapter 5: Computer Systems Organization
Mattan Erez The University of Texas at Austin
The Processor Lecture 3.1: Introduction & Logic Design Conventions
Computer Architecture
COMS 361 Computer Organization
Computer Architecture
Introduction to Heterogeneous Parallel Computing
What is Computer Architecture?
The Vector-Thread Architecture
Introduction to Computer Architecture
Mattan Erez The University of Texas at Austin
What is Computer Architecture?
What is Computer Architecture?
Chapter 12 Pipelining and RISC
Computer Architecture
Systems Architecture I (CS ) Lecture 5: MIPS Instruction Set*
COMPUTER ORGANIZATION AND ARCHITECTURE
CSE 502: Computer Architecture
Introduction to Computer Engineering
Introduction to Computer Engineering
Introduction to Computer Engineering
CSE378 Introduction to Machine Organization
Chapter 4 The Von Neumann Model
Presentation transcript:

Lecture on High Performance Processor Architecture (CS05162) DLP Architecture Case Study: Stream Processor Xu Guang xuguang5@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99

Discussion Outline Motivation Related work Imagine Conclusion TPA-PD Future work 2018/11/13 CS of USTC CS258 S99

Motivation VLSI technology More ALUs, Computation is relatively cheap Keeping them feed hard The problem is bandwidth Energy Delay 2018/11/13 CS of USTC

Motivation Data level parallel (DLP) applications Media application Real-time graphics Signal processing Video processing Scientific computing Application characteristics Dense computing Parallelism Territorial 2018/11/13 CS of USTC

Motivation Application characteristics Poorly match conventional architectures Cache Instruction-level parallelism Few arithmetic units Well matched to modern VLSI technology Lots (100’s - 1000’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 2018/11/13 CS of USTC

Related work Vector Dsp GPU General purpose processor Large set of temporary values Dsp Signal register file GPU Special General purpose processor ILP cache 2018/11/13 CS of USTC

Related work Imagine Merrimac Prototype HW 2002 prototype processor 256mm2 die in 150nm , 21M transistors Collaboration with TI ASIC SW based on StreamC / KernelC Stream scheduler Communication scheduler Merrimac Stream supercomputer For scientific applications No prototype 2018/11/13 CS of USTC

Related work Cell AMD&ATI NUDT USTC Prototype HW Stream processor 221 mm2 in 90nm, 234M transistors AMD&ATI Stream processor NUDT Fei Teng 64 USTC ACSA TPA 2018/11/13 CS of USTC

Programming model Stream Vs array organized as a sequence of records simplex complex ordered finite-length Vs array ordered use 2018/11/13 CS of USTC

Stream type Basic stream:an array of records Derived stream:a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); 记录在地址空间中的分布是连续的 2018/11/13 CS of USTC CS258 S99

Stream type Sequential access pattern: y=x (start, end) Strided access pattern: y=x (start,end,data Dependence,stride) Indexed access pattern y=x (start,end,data Dependence,index Stream) 2018/11/13 CS of USTC

Programming model Stream Program Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. Express a computation as streams flowing through kernels Represent applications as a set of computation kernels that consume and produce data streams 2018/11/13 CS of USTC

Programming example Stereo depth extractor application Operations within a kernel operate on local data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Output data Streams expose data parallelism Input data 2018/11/13 CS of USTC

Programming example Vect add 2018/11/13 CS of USTC

Why Organize an Application This Way? Expose parallelismat three levels ILP within kernels DLP across stream elements TLP across sub-streams and across kernels Keeps ‘easy’ parallelism easy Expose locality in two ways Within a kernel – kernel locality Between kernels – producer-consumer locality Put another way, stream programs make communication explicit 2018/11/13 CS of USTC

Overall Imagine block diagram Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller 2018/11/13 CS of USTC

Instructions of Imagine Stream level 2018/11/13 CS of USTC

Instructions of Imagine Kernel level Integer/Float arithmetic Bitwise logic and comparison Data permutation Stream in/out Loop control Operate on packed data like short vectors (SIMD) 2018/11/13 CS of USTC

Architecture of Imagine Host Interface All interactions between the host processor and the imagine core occur via the Host Interface Stream instructions can be loaded onto Imagine Several status words can read from Imagine Individual data words can be read from Imagine Entire data streams can be transferred to/from Imagine 2018/11/13 CS of USTC

Architecture of Imagine Stream controller Responsibilities Handles the data flow between and control of all of the modules on the chip Controlling which stream instructions to issue and when they are executed Parameters 32 entries instruction queue 2018/11/13 CS of USTC

Architecture of Imagine SRF Responsibilities Source and destination for all memory operations The source and sink of data to the arithmetic clusters and the network router Parameters 8 banks 128KB Software control SDR SCR Read/Write a block at a time 2018/11/13 CS of USTC

Architecture of Imagine SRF 2018/11/13 CS of USTC

Architecture of Imagine SRF 4 4 2018/11/13 CS of USTC

Architecture of Imagine Memory System Responsibilities Load data from Memory to SRF Store data from SRF to Memory Parameters 4 memory bank 2 address generators MSCR 2018/11/13 CS of USTC

Architecture of Imagine Memory System 2018/11/13 CS of USTC

Architecture of Imagine Microcontroller Responsibilities passing parameters from/to the host interface loading microprograms into its microcode store controlling the execution of the microprograms on the arithmetic clusters. Parameters 1024×VLIW 32×32bit regfiles UCRF 2×1bit regfiles UCONDRF 2018/11/13 CS of USTC

Architecture of Imagine Microcontroller 2018/11/13 CS of USTC

Architecture of Imagine Cluster Responsibilities 8 clusters perform identical operations in parallel Controller by microcontroller Parameters Each cluster has 3 ADDER 2 MULER 1 DIVIDER 1 Scratchpad 1 Jukebox and 1 Valid Unit 1 Comm Internal data path width of 32 bits. Each functional unit has its own local register files(LRF) All functional units accept 32-bit inputs and produce 32-bit results For floating point operations, the units use IEEE floating-point format 2018/11/13 CS of USTC

Architecture of Imagine Cluster CU Intercluster Network + From SRF To SRF * / Cross Point Local Register File 2018/11/13 CS of USTC

Streams expose Kernel Locality missed by Vectors Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time 2018/11/13 CS of USTC

Streams expose Kernel Locality missed by Vectors 2018/11/13 CS of USTC

Mapping App to Imagine mapping 2018/11/13 CS of USTC

Mapping App to Imagine compile Stream level 2018/11/13 CS of USTC

Mapping App to Imagine Kernel level 2018/11/13 CS of USTC

Mapping App to Imagine Before 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst Memop Load stream input from memory to srf Memop Load stream ucode from memory to srf 2018/11/13 CS of USTC

Mapping App to Imagine SRF data 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst Load ucode fetch ucode from srf to microcode store 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst Cluster op execute ucode vadd 2018/11/13 CS of USTC

Mapping App to Imagine Stream inst Memop Load stream output from srf to memory 2018/11/13 CS of USTC

Performance Bandwidth Hierarchy 2GB/s 32GB/s SDRAM Register File Stream ALU Cluster 544GB/s 2018/11/13 CS of USTC

Performance Bandwidth demand of stream programs fits bandwidth hierarchy of architecture 2018/11/13 CS of USTC

Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel 2018/11/13 CS of USTC

Performance Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 2018/11/13 CS of USTC

Conclusion Performance Power compound stream operations realize >10GOPS on key applications can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power three-level register hierarchy gives 2-10GOPS/W 2018/11/13 CS of USTC

Conclusion Disadvantage Programming model Rewrite application Programmers need to know details of hardware 2018/11/13 CS of USTC

TPA-PD Motivation Tiled Application Wire delay constraints Difficult centralized structures dominating today’s designs Architectural partitioning encourages regularity and re-use Application Media APP Scientific computing Irregular control and data access 2018/11/13 CS of USTC

TPA-PD 2018/11/13 CS of USTC

TPA-PD Instruction set Stream level Kernel level Explicit Data Graph Execution (EDGE) Block-Oriented Direct Target Encoding 2018/11/13 CS of USTC

TPA-PD Not centralized control 2018/11/13 CS of USTC

TPA-PD Programming model Binary Translator StreamC/KernelC Static VLIW(bin) for input New inst(bin) for output 2018/11/13 CS of USTC

Future work TPA-PD Architecture Instruction Set C simulator RTL simulator 2018/11/13 CS of USTC

Thank you 2018/11/13 CS of USTC