Performance Modeling and Validation of C66x DSP multilevel cache memory system Rama Venkatasubramanian, Pete Hippleheuser, Oluleye Olorode, Abhijeet Chachad,

Slides:

Advertisements

Similar presentations

Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.

Advertisements

Verifying Performance of a HDL design block

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Using emulation for RTL performance verification

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Computer Organization and Architecture

On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

Chapter 12 CPU Structure and Function. Example Register Organizations.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

1 Integration Verification: Re-Create or Re-Use? Nick Gatherer Trident Digital Systems.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Processor Architecture

Slide #1Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Alpha Inside out Zhihui.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.

EKT303/4 Superscalar vs Super-pipelined.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

Computer Organization CS224 Fall 2012 Lessons 39 & 40.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

PipeliningPipelining Computer Architecture (Fall 2006)

Cache Memory and Performance

Lecture: Large Caches, Virtual Memory

William Stallings Computer Organization and Architecture 8th Edition

CSC 4250 Computer Architectures

A Study on Snoop-Based Cache Coherence Protocols

Cache Memory Presentation I

Chapter 8: Main Memory.

CMSC 611: Advanced Computer Architecture

Lecture 08: Memory Hierarchy Cache Performance

Lecture: Cache Innovations, Virtual Memory

Adapted from slides by Sally McKee Cornell University

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

* From AMD 1996 Publication #18522 Revision E

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Lecture: Cache Hierarchies

CSE 486/586 Distributed Systems Cache Coherence

Memory & Cache.

Presentation transcript:

Performance Modeling and Validation of C66x DSP multilevel cache memory system Rama Venkatasubramanian, Pete Hippleheuser, Oluleye Olorode, Abhijeet Chachad, Dheera Balasubramanian, Naveen Bhoria, Jonathan Tran, Hung Ong and David Thompson Texas Instruments Inc, Dallas TX 1

Pre-silicon Performance Validation Improved IPC –Better energy efficiency Processor Memory systems becoming more and more complex –Multicore vs clock speed - trend seen in industry. Memory systems are becoming difficult to validate. Cost of bug fix : Exponentially increases as left undetected through the design flow. Performance validation Goal: Identify and fix all performance bugs during design development phase. –Modeling and validation of a multi-level memory system is complex Novelty of this work: –Unique latency crediting scheme allows pre-silicon performance validation with minimal CPU simulation time increase –Reusable performance validation framework across the DV stack (multiple levels of design verification) 2

C66x DSP Memory system architecture L1P C66x DSP C66x DSP L1P SRAM/Cache 32KB L1P SRAM/Cache 32KB L2 L1D Embedded Debug Prefetch Power Management Interrupt controller Emulation Register file A Register file B Fetch L1D SRAM/Cache 32KB L1D SRAM/Cache 32KB L2 SRAM/Cache 1MB L2 SRAM/Cache 1MB DMA L L M M S S D D L L M M S S D D Dispatch Exectute Two levels of on-die caches –32KB direct mapped L1I cache –32KB 2-way set associative writeback L1D cache –1MB 4-way private unified L2 cache L1/L2 configurable as SRAM (or) cache (or) both Controllers operate at CPU clock rate –Minimize CPU read latency DMA –Slave DMA Engine –Internal DMA Engine Stream based prefetch engine Coherency: All-inclusive coherent memory system.

Performance bottlenecks Typical architectural constraints in a Processor memory system: Memory system pipeline stalls –Stalls due to movement of data (controlled by the availability of buffer space) –Stall conditions to avoid a hazard scenario. Arbitration points –Memory access arbitrated between multiple requestors (or) data arbitration in a shared bus. FIFO’s Bank stalls and Bank conflicts –Bank Stalls –Bank conflicts: Burst mode SRAMs used to implement the memories Bandwidth Management Architecture –Bandwidth requirement dictated by an application (real-time applications) –A minimum bandwidth may have to be guaranteed. Miscellaneous Stalls 4

Performance Validation framework Implementation: –Theoretical analysis based on System micro architecture. –The model framework was developed in “Specman-e” language –Overlaid on top of the functional verification environment –Complex, but scalable architecture Overall Goal: –Identify any performance bottlenecks in the memory system pipeline –Measure worst case latency for all the transactions –Ensure there are no blocking/hang scenarios at arbitration points –Re-usable framework across DV stack 5

Performance Validation framework (contd.) Model probes into the designs –All controllers, the internal interfaces etc., Measures the number of cycles for each transaction –Initiated by CPU (or) DMA (or) cache operations Stalls in arbitration points, bandwidth management etc are tracked. Novel latency credit based transaction modeling system developed to determine the true latency incurred by a transfer in the system. 6

Example 1 – Single traffic stream CPU load from L2 SRAM. Miss in L1I cache and request reaches unified L2 controller. –Ex: Transaction goes through A 3, P 0, and P 1 and then reads the data from L2SRAM and the data is returned to the program memory controller. Flight time for entire transfer calculated inside the memory system. 7 Pipeline stages – rectangles Arbitration points - circles. FIFO shown for illustration purposes.

Latency Crediting methodology The model tracks the transfer through the system and buffer space availability in the pipeline stages and arbitration points. Assume: –Total flight time for the transfer within the L2 controller = t L2lat –Each pipeline stall cycle = t I0, t I1, t P0, t P1 etc., –Arb stall cycles = t A0, t A3 etc., –Unused Arb stall cycles = t A1 =t A3 =0 (arbitration path not taken) –Adjusted latency t AdjLat inside L2 controller for this transfer is: Ideally, adjusted latency for the transfer should equal to the pipeline depth inside the controller. –Measuring the latency to that level of accuracy would require a cycle-accurate performance validation model  impractical. Hence the adjusted latency for each transfer was measured and ensured to be within an acceptable latency defined by the architecture. 8

Example 2: Multiple concurrent traffic streams Three concurrent streams: –CPU program read from L2SRAM, –CPU data read from MDMA path (through the fifo) –A coherence transaction – say a writeback invalidate operation that arbitrates for the L2 cache, checks for hit or miss, writebacks the data (through the MDMA path) and invalidates the cache entry. Model has to be aware of : –Interactions of the pipeline stages –Apply credits accordingly Millions of functional tests in the regression suite. Every conceivable traffic type inferred by the model and tracked 9

Performance bugs identification The data collected for each transaction type is plotted for each memory controller interface (or) transaction type. If latency value is above a certain value (a checker value), it is either a design bug (or) an incorrect modeling in the performance validation environment, which is fixed and re-analyzed. –Checkers modeled based on theoretical analysis. Outliers analyzed. Over a period of time, resulting plot shows the minimum and maximum number of cycles spent by any given transfer type in that particular memory controller for the various stimuli provided by the testbench. 10

Bandwidth analysis and validation C66x DSP supports various bandwidth programmations. Theoretical expectations for the various bandwidth settings when multiple requestors are arbitrating for a resource is calculated: –Example: CPU and DMA traffic are arbitrating for the L2SRAM resource. –Various configurations and throughput is tabulated as shown below: 11 L2 SRAM Arb (BW Mgmt) CPU DMA Bandwidth configuration

Bandwidth validation (contd..) Efficiency of bandwidth allocation –To improve energy efficiency in the system –Targeted stress tests written to exercise the full bandwidth scenarios in all the interfaces. With bandwidth arbitration enabled, total bandwidth utilized is plotted per requestor. –Ex: Bandwidth that each of the requestors - CPU, DMA and the coherence engine get when they access the same resource L2SRAM concurrently Total available L2SRAM bandwidth is 32 Bytes/cycle. But when all the three requestors are accessing L2SRAM, the L2 controller provides a maximum of only 24 Bytes/cycle, which may or may not be the architecture intent. Scenarios like this are highlighted to the design team for review Architecture revised during the design phase accordingly. 12

Validation of Cache coherency operations The C66x DSP Core memory system supports block and global cache operations.. –The global cache operations –The block cache operations The DSP core supports a snoop interface between the data memory controller and the L2 controller to support all-inclusive coherent memory system. –For the snoop operations, the latency for each snoop transaction is tracked and reviewed against architectural expectations. The latency of each of the cache coherency operations is a function of the size of the cache, the number of clean/dirty/valid lines in the cache, the block word count, and the number of empty lines in the cache. For different step sizes of cache size and block sizes, the total number of cycles taken for each operation is determined and a formula is derived. –Used by the performance validation model with random stimuli such that whenever a cache operation is initiated, the latency and credits are checked against their respective formulae. Example: 13

Conclusion Post-silicon performance validation to identify performance issues is very late in the design development cycle and can prove very costly. –There is an ever increasing need for detection of performance issues early on so that the cost of the design fixes needed to fix performance issues is minimized. The “Specman-e” model overlays on top of the functional simulation framework –Collates traffic information for every transfer in the system. –Computes the total latency incrementally and also calculates the expected latency either by using the theoretical equations or default values based on pipeline depth. Numerous performance bugs identified and fixed during design development phase The performance validation model probes into the design and is reused across all the levels of the DV stack with minimal simulation time overhead. –The framework can guarantee that performance is validated across the entire design/system instead of just the unit level functional verification environment. Furthermore, between different revisions of the processor cores, if feature addition at a later stage (or) a functional bug fix introducing a performance bug, the performance model checker will fail and can thus determine any performance issue created by design changes. 14

Q&A Thank you 15