Prof. Zhang Gang School of Computer Sci. & Tech.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

The CPU The Central Presentation Unit What is the CPU?
Machine cycle.
DSPs Vs General Purpose Microprocessors
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
CMPT 334 Computer Organization
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)
11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
What’s on the Motherboard? The two main parts of the CPU are the control unit and the arithmetic logic unit. The control unit retrieves instructions from.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ECE 265 – LECTURE 3 68HC11 Address Space, Memory, Registers, and data transfers 3/29/ ECE265.
Computer Structure & Architecture 7b - CPU & Buses.
System Unit Working of CPU. The CPU CPU The CPU CPU stands for central processing unit. it is brain of computer It is most important component of the.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Physical Memory and Physical Addressing ( Chapter 10 ) by Polina Zapreyeva.
Prof. Zhang Gang School of Computer Sci. & Tech.
These slides are based on the book:
CS 704 Advanced Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
What is a computer? Simply put, a computer is a sophisticated electronic calculating machine that: Accepts input information, Processes the information.
Improving Memory Access 1/3 The Cache and Virtual Memory
What is a computer? Simply put, a computer is a sophisticated electronic calculating machine that: Accepts input information, Processes the information.
Lecture: Pipelining Basics
Embedded Systems Design
The University of Adelaide, School of Computer Science
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 13 Using Energy Efficiently Inside the Server Prof. Zhang.
Prof. Zhang Gang School of Computer Sci. & Tech.
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Cache Memory Presentation I
Prof. Zhang Gang School of Computer Sci. & Tech.
Phnom Penh International University (PPIU)
Morgan Kaufmann Publishers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
COMP4211 : Advance Computer Architecture
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Array Processor.
Central Processing Unit
CS149D Elements of Computer Science
Multivector and SIMD Computers
Introduction to Micro Controllers & Embedded System Design
CPU Key Revision Points.
Chapter 5: Computer Systems Organization
BIC 10503: COMPUTER ARCHITECTURE
Architecture Overview
The Processor Lecture 3.1: Introduction & Logic Design Conventions
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Part 2: Parallel Models (I)
Computer Architecture
Introduction to Computer Architecture
October 29 Review for 2nd Exam Ask Questions! 4/26/2019
Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 9 Memory Banks Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

Memory Banks A memory bank is built for timely and logical data access (1)Each logical unit of storage is arranged into a consecutive configuration so that all data can be accessed quickly (2)Interleaved memory is another format for a memory bank It allows for data to be accessed even faster by putting specific components of memory in the same place across a series of chips Data can be retrieved across parallel strips instead of being indexed all on one chip

Memory Banks The behavior of the load/store vector unit is significantly more complicated than that of the arithmetic functional units The start-up time for a load is the time to get the first word from memory into a register If the rest of the vector can be supplied without stalling, then the vector initiation rate is equal to the rate at which new words are fetched or stored The initiation rate may not necessarily be one clock cycle because memory bank stalls can reduce effective throughput

Memory Banks Typically, penalties for start-ups on load/store units are higher than those for arithmetic units over 100 clock cycles on many processors For VMIPS we assume a start-up time of 12 clock cycles, the same as the Cray-1 To maintain an initiation rate of one word fetched or stored per clock, the memory system must be capable of producing or accepting this much data

Memory Banks Memory system must be designed to support high bandwidth for vector loads and stores Spread accesses across multiple banks Control bank addresses independently Load or store non sequential words Support multiple vector processors sharing the same memory

Example of Memory Banks The largest configuration of a Cray T90 (Cray T932) has 32 processors each generating 4 loads and 2 stores/cycle Processor cycle time is 2.167ns SRAM cycle time is 15ns How many memory banks needed? 32x6=192 accesses, 15/2.167≈7 processor cycles 1344!

Exercises What is a memory bank? What is an interleaved memory? What is the meaning of start-up time for a load? Why does a memory system must to support high bandwidth for vector loads and stores?