SystemC Simulation Based Memory Controller Optimization

Slides:



Advertisements
Similar presentations
Verifying Performance of a HDL design block
Advertisements

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
November 23, 2005 Egor Bondarev, Michel Chaudron, Peter de With Scenario-based PA Method for Dynamic Component-Based Systems Egor Bondarev, Michel Chaudron,
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Lecture 5 Memory Management Part I. Lecture Highlights  Introduction to Memory Management  What is memory management  Related Problems of Redundancy,
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
A Case for Refresh Pausing in DRAM Memory Systems
Recent Progress In Embedded Memory Controller Design
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
An Application-Specific Design Methodology for STbus Crossbar Generation Author: Srinivasan Murali, Giovanni De Micheli Proceedings of the DATE’05,pp ,2005.
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
SystemC: A Complete Digital System Modeling Language: A Case Study Reni Rambus Inc.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.
OPERETTA: An Optimal Energy Efficient Bandwidth Aggregation System Karim Habak†, Khaled A. Harras‡, and Moustafa Youssef† †Egypt-Japan University of Sc.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Nios II Processor: Memory Organization and Access
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Framework For Exploring Interconnect Level Cache Coherency
Reducing Memory Interference in Multicore Systems
CSE 502: Computer Architecture
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Improving the Defect Life Cycle Management Process
ISPASS th April Santa Rosa, California
UI-Performance Optimization by Identifying its Bottlenecks
A Requests Bundling DRAM Controller for Mixed-Criticality System
Exploring Concentration and Channel Slicing in On-chip Network Router
Cache Memory Presentation I
Network-on-Chip & NoCSim
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
FPro Bus Protocol and MMIO Slot Specification
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
A Talk on Adaptive History-Based Memory Scheduling
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Lecture 22: Cache Hierarchies, Memory
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
DRAM Hwansoo Han.
Hardik Shah, Kai Huang and Alois Knoll
Measuring the Gap between FPGAs and ASICs
Haonan Wang, Adwait Jog College of William & Mary
Presented by Florian Ettinger
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

SystemC Simulation Based Memory Controller Optimization Primary Author: Ashutosh Pandey Secondary Author(s): Nitin Gupta, Amit Garg Presenter: Ashutosh Pandey Company/Organization: Synopsys

Agenda Background Challenges – System Level Memory Controller Architecture – An Example Optimization & Configuration Requirements Methodology A case Study Conclusion

Background SDRAM controller ‘s are an integral piece of today’s System on Chip (SoC) SDRAM access performance is one of the primary bottleneck Memory Controller is responsible for optimizing SDRAM accesses Across the system Optimizing JEDEC interface utilization

Challenges – System Level Early design-space and architecture exploration System level optimization for targeted use cases Interconnect configuration Memory hierarchy (Buffers/caches/ on-chip / off-chip memories) Memory architecture optimization Meeting bandwidth/latency requirements for each application / master in the system System level architecture and design for the targeted use cases and applications SDRAM hardware architecture optimization

Memory Controller Architecture – An Example AXI Port Request Multiplexer RdData / WrRsp Demux Scheduler Memory Access Controller SDRAM AXI IF Command Queue Port Arbiter Programmable interface AXI IF JEDEC IF AXI IF A Sample Memory Controller

Optimization & Configuration - Requirements System level visibility (end to end latency/throughput) Memory access co-relation with system traffic Visualization and analysis of memory interface activity Root cause analysis for various bottlenecks / limitations SDRAM architecture exploration

Methodology Specify system constraints like latency, throughput or utilization Simulate and analyze constraint violations Analyze system characteristics to identify bottleneck(s) Investigate to identify the root cause of the problem Re-configure the system to address bottleneck(s) Re-run / re-analyze refined configuration till constraints are satisfied

Memory Controller Optimization – A case study CORE0 SDRAM INTERFACE Bus AXI MEMORY CONTROLLER SDRAM (DDR3) CORE1 AXI PORT INTERFACE (XPI) ARBITER SCHEDULER Objective: Optimize memory controller to achieve desired latencies for CORE0 Optimization on throughput & Utilization is also possible

System Level – Latency Analysis Cumulative average duration de-composed per component Average Duration for Read Transaction

System Level – Latency Analysis CORE0 memory access latency Interconnect latencies Average Delay for CORE0 transactions in Memory Controller Arbiter is 262ns. But Arbiter alone is causing a delay of 100ns Analysis Result For Round-Robin Arbitration Scheme: Average SDRAM access delay for CORE0 is 72ns. Delay in different components of Memory Controller for transactions from CORE0 Delays for memory access for CORE0

Priority Based Arbitration for Memory Controller Arbiter CORE0 memory access latency reduces from 428ns to ~310ns. Average SDRAM access delay for CORE0 is ~68ns while it was 72ns previously. But it is still 22% of the total Latency. Average Delay for CORE0 transactions reduces from 428ns to 310ns. Result For Priority based Arbitration Scheme:- Delay in different components of Memory Controller for transactions from CORE0

Memory Channel Utilization Analysis for CORE0 Detailed Memory Channel Utilization for the entire system COMMAND and DATA PHASE utilization for CORE0 only HIT = 8.4 % MISS = 12.315 % For optimum architecture HIT % >> MISS %

Initial Inferences A large percentage of accesses are resulting in page misses, causing: Increased access latency, In-efficient usage of JEDEC interface and Higher power consumption due to increased “precharges” and “activates” Possible reasons for inefficient system could be Mapping of application addresses to memory addresses Page policy Page crossovers Rank crossovers

Memory Channel Utilization Analysis for CORE0 Maximum MISSES are due to transaction in same Bank but different rows COMMAND and DATA PHASE utilization for Core_0 with CMD_PHASE divided into CMD setup due to different reasons for MISS.

Refined Inferences The reason for almost all Page MISS in this system is transaction on same Bank but different Page Possible solution for resolving this Change in traffic (not always possible) Use a memory with bigger page sizes

Increasing Page Size from 1KB to 2KB Command Setup phase drastically reduces from 41.262% to 15% Increased page size results in desired performance Zero Page Miss. All memory accesses result in Page Hit.

Effect of increasing Page Size on overall Delay CORE0 memory access latency reduces from 310ns to ~222ns. Delays for memory access for CORE0 reduces from 68ns to 52 ns.

Conclusions System level performance analysis allows detection of performance problems Detailed data path visibility allows identification of hot-spots, e.g: Arbitration scheme & Hit/Miss ratio of SDRAM Analyzing the hot-spots allows identification of root causes Systematic refinement allows creation of optimum architecture for targeted use cases

Q&A