Modeling of Digital Systems

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

DSPs Vs General Purpose Microprocessors

1 Optimizing compilers Managing Cache Bercovici Sivan.

CSCI 4717/5717 Computer Architecture

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Memory management Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

The Central Processing Unit (CPU) and the Machine Cycle.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Introduction to Computer Organization Pipelining.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.

CS 352H: Computer Systems Architecture

CS 704 Advanced Computer Architecture

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Yu-Lun Kuo Computer Sciences and Information Engineering

CDA3101 Recitation Section 8

Variable Word Width Computation for Low Power

Reducing Hit Time Small and simple caches Way prediction Trace caches

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

5.2 Eleven Advanced Optimizations of Cache Performance

Memory Units Memories store data in units from one to eight bits. The most common unit is the byte, which by definition is 8 bits. Computer memories are.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Database Performance Tuning and Query Optimization

CDA 3101 Spring 2016 Introduction to Computer Organization

COMP4211 : Advance Computer Architecture

Pipelining and Vector Processing

Introduction to Computer Systems

Florin Balasa University of Illinois at Chicago

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 20: OOO, Memory Hierarchy

Siddhartha Chatterjee

Sampoorani, Sivakumar and Joshua

Spring 2008 CSE 591 Compilers for Embedded Systems

Chapter 11 Database Performance Tuning and Query Optimization

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Modified from notes by Saeid Nooshabadi

Code Transformation for TLB Power Reduction

Main Memory Background

Cache Performance Improvements

Lecture 5: Pipeline Wrap-up, Static ILP

Cache Memory and Performance

COMPUTER ORGANIZATION AND ARCHITECTURE

MIPS Pipelined Datapath

Introduction to Computers

Presentation transcript:

Modeling of Digital Systems CS 812 High Level Design & Modeling of Digital Systems MEMORY SYNTHESIS Bhuvan Middha (csu98133) Arun Kejariwal (eeu98172)

Presentation Plan Motivation Impact of Memory Architecture Decisions Optimizations in Memory Synthesis Memory Assignment of array variables Scratch-Pad Memory Conclusion References

Motivation Rate of Performance Improvement is different CPU Speed Memory Year Speed CPU Rate of Performance Improvement is different

Impact on Processor Pipeline Dec ALU MEM IF WB IF Dec ALU MEM WB IF Dec ALU MEM WB Clock cycle determined by slowest pipeline stage

Impact of Memory Architecture Decisions Area 50-70% of ASIC/ASIP may be memory Performance 10-90% of system performance may be memory related Power 25-40% of system power may be memory related

Issues in Memory Synthesis Number of distributed registers Number of register files Number of register file ports On-chip or Off-chip memory Cache Parameters Cache Vs Scratch pad Number of memory ports Memory bus Bandwidth Data Organization and Partitioning

Optimizations in Memory Synthesis Code Optimizations R-M-W Mode Clustering of Scalar variables Reordering Hoisting Loop Transformations Memory assignment of array variables Hardware Optimizations Scratch Pad Banking

Storing Multi-dimensional Arrays: Row-major int X [4][4]; Row-major Storage Physical Memory Logical Memory 15

Storing Multi-dimensional Arrays: Column-major int X [4][4]; Column-major Storage Physical Memory Logical Memory 15

Storing Multi-dimensional Arrays: Tile-based int X [4][4]; Tile-based Storage Physical Memory Logical Memory 15

Array Layout and Data Cache a[i] int a [1024]; int b[1024]; int c [1024]; ... for (i = 0; i < N; i++) c [i] = a [i] + b [i]; b b[i] c Data Cache (Direct-mapped, 512 words) c[i] Memory Problem: Every access leads to cache miss

Data Alignment a a[i] int a [1024]; int b[1024]; int c [1024]; ... for (i = 0; i < N; i++) c [i] = a [i] + b [i]; DUMMY b b[i] DUMMY c c[i] Data Cache (Direct-mapped, 512 words) Memory Data alignment avoids cache conflicts

Data Layout Transformation Splitting structs into individual arrays Account for pointer arithmetic, dereferencing Clustering of arrays

Motivating Example Arrays Loop 1 Loop 2 struct x { int a; int b; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Arrays Loop 1 Loop 2

Cache Performance: Loop 1 Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Useless Data p[0].a p[0].b Loop 1 p[1].a p[1].b 1 p[2].a p[2].b 2 p[3].a p[3].b 3 Line

Cache Performance: Loop 2 Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } p[0].a p[0].b p[1].a p[1].b 1 Loop 2 q[0] q[1] 2 3 Line Useless Data

Cache Performance 1000 cache misses for p[i].a 1500 cache misses struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } 1000 cache misses for p[i].a 1500 cache misses 1000 misses for p[i].b 500 misses for q[i] Cache miss rate: 62.5%

Transformed Data Layout struct x { int a; int b; } p [1000]; int q [1000]; struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a Loop 2 Loop 1 ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; }

Cache Performance: Loop 1 Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } a[0] a[1] Loop 1 a[2] a[3] 1 2 3 No useless data in cache Line

Cache Performance: Loop 2 Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } r[0].q r[0].b r[1].q r[1].b 1 2 Loop 2 3 No useless data in cache Line

Cache Performance Cache miss rate: 37.5% 500 cache misses struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } 500 cache misses 1000 cache misses Cache miss rate: 37.5%

Clustering of Arrays 8 + 16 24 int a[16], b[16], c[16] For i = 0 to 7 a[i] = b[i+3] + 3 For j = 0 to 15 a[i] = b[i] * c[i] a b 16 16 16 c

Scratch Pad Memory Data memory residing on chip Address space disjoint from off-chip memory Same address and data bus as that for off chip memory Guaranteed small access time as no read/write miss

Memory Address Space On-chip Memory CPU Off-chip Memory Data Cache 1 1-cycle cycle On-chip Memory CPU P-1 Off-chip Memory P Data Cache (on-chip) Memory Address Space 1-cycle 1 cycle 10-20 cycles 10-20 cycles N-1

Scratch Pad Model Organization of scratch pad memory No comparison is needed A priori knowledge of the memory objects an added advantage Scratch pad memory constitutes the data array unit, decoder unit and the peripheral unit

Why Scratchpad? Unordered array variables and scalars lead to a large number of conflict misses in the cache Accesses are data dependent, so data layout techniques are ineffective Example : char BrightnessLevel[512][512] int Hist[256] for i = 0 to 512 for j = 0 to 512 level = Brightnesslevel[I][j] Hist[level] = Hist[level]+1

Data Partitioning Minimize the interference between different variables in the data cache Partitioning of variables is governed by the following code characteristics : - scalar variables and constants - size of arrays - life time of variables - access frequency of variables - loop conflicts

Access Frequency of Variables and Loop Conflicts Variable Access Count (VAC) Interference Access Count (IAC) Interference Factor (IF) IF(u) = VAC(u) + IAC(u) Map variables with high IF values into the scratch pad memory Loop Conflict Factor (LCF) Map variables with high LCF number to scratch pad memory

Formulation of Partitioning problem Total Conflict Factor (TCF) TCF(u) = IF(u) + LCF(u) Given a set of n arrays with corresponding TCF values find an optimal subset such that total size <= size of SRAM and total TCF value is maximized Similar to knapsack problem except the fact that several arrays with non intersecting lifetimes can share the same SRAM space

Conclusion Z-Buffering - Graphics Stream buffers - Data pre-fetching Stride Prediction tables - predict memory references Inter-array windowing - multi-dimensional arrays

References Books Survey Paper P. Panda, N. Dutt, A. Nicolau - Memory issues in embedded systems-on-chip: optimization and exploration, Kluwer Academic Publishers, 1999 F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, A. Vandecappelle – Custom memory management methodology, Kluwer Academic Publishers, 1998o Survey Paper P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle – Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Embedded Systems, April 2001