Nuno Neves1,2, Pedro Tomás1,2, Nuno Roma1,2

Slides:



Advertisements
Similar presentations
Rényi entropic profiles of DNA sequences and statistical significance of motifs Acknowledgments S.Vinga and J.S.Almeida thankfully acknowledge.
Advertisements

Specialization in Ocean Energy MODELLING OF WAVE ENERGY CONVERSION António F.O. Falcão Instituto Superior Técnico, Universidade de Lisboa 2014.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
High Performing Cache Hierarchies for Server Workloads
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 On Handling QoS Traffic in Wireless Sensor Networks 吳勇慶.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
A Tool for Describing and Evaluating Hierarchical Real-Time Bus Scheduling Policies Author: Trevor Meyerowitz, Claudio Pinello, Alberto DAC2003, June 24,2003.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.
The Thermal Interaction of Pulsed Sprays with Hot Surfaces – application to Port-Fuel Gasoline Injection Systems João Carvalho Miguel R. O. Panão António.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
RAW 2014 Over-Clocking of Linear Projection Designs Through Device Specific Optimisations Rui Policarpo Duarte 1, Christos-Savvas Bouganis
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
Microprogrammed Control Chapter11:. Two methods for generating the control signals are: 1)Hardwired control o Sequential logic circuit that generates.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Specialization in Ocean Energy MODELLING OF WAVE ENERGY CONVERSION António F.O. Falcão Instituto Superior Técnico, Universidade de Lisboa 2014.
Tracking Millions of Flows In High Speed Networks for Application Identification Tian Pan, Xiaoyu Guo, Chenhui Zhang, Junchen Jiang, Hao Wu and Bin Liut.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
These slides are based on the book:
Examples (D. Schmidt et al)
Modeling Chemical Reactions Project Proposal
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Virtual memory.
Advanced Architectures
Reza Yazdani Albert Segura José-María Arnau Antonio González
Microarchitecture.
Micro-programmed Control
The Impact of Replacement Granularity on Video Caching
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Improving Program Efficiency by Packing Instructions Into Registers
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Virtually Pipelined Network Memory
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Digital Processing Platform
CMSC 611: Advanced Computer Architecture
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Introduction Results and Discussion Conclusions
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Overview of Computer Architecture and Organization
Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds
Fixed transaction ordering and admission in blockchains
Tosiron Adegbija and Ann Gordon-Ross+
Presented By: Darlene Banta
A Case for Interconnect-Aware Architectures
Specialization in Ocean Energy MODELLING OF WAVE ENERGY CONVERSION
Presented by Florian Ettinger
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.
Presentation transcript:

Adaptive In-Cache Streaming for Efficient Data Management in General Purpose Processors Nuno Neves1,2, Pedro Tomás1,2, Nuno Roma1,2 Email: {nuno.neves,pedro.tomas,nuno.roma}@inesc-id.pt 1INESC-ID Investigação e Desenvolvimento, Rua Alves Redol 9, 1000-029 Lisboa, Portugal 2Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisboa, Portugal 1. Introduction Conventional address-based models, supported on cache structures to mitigate the memory wall problem, often struggle when dealing with memory-bound applications or arbitrarily complex data-patterns that can be hardly captured by prefetching mechanisms. Stream-based techniques have proven to efficiently tackle such limitations, although not well-suited to handle all types of applications. To mitigate the limitations of both communication paradigms, an efficient unification is proposed, by means of a novel in-cache stream paradigm, capable of seamlessly adapting the communication between address-based and stream-based models. A new data-pattern dynamic descriptor graph specification, capable of handling regular arbitrarily complex data-patterns, was designed to improve the main memory bandwidth utilization through data reutilization and reorganization techniques. 2. In-Cache Stream Communication Morphable In-Cache Stream controllers, supporting memory-addressed and packed-stream data accesses; Memory-aware stream management controller (SMC), deploying efficient memory access optimization techniques (bandwidth optimization, data reorganization and reutilization, and in-time stream manipulation). 3. Set-Associative Cache Hybridization A n-way set-associative memory is simultaneously managed by two independent modules: a hybrid cache controller and a stream controller. Typical memory-addressed accesses are assured by the cache controller, by using any arbitrarily replacement and write policies. The stream controller adapts and reuses the resources of the n-way set-associative cache memory as dedicated stream buffers. 4. Dynamic Descriptor Graph Specification Independently of their application domain, deterministic algorithms are characterized by complex memory access patterns that can be represented by the n-dimensional affine function: By combining several functions, any deterministic data-pattern can be described independently of its complexity. A new Dynamic Descriptor Graph specification encodes any number of n-dimensional affine functions and chains them together in a graph-like structure to describe arbitrarily complex memory access patterns. 5. Memory-Aware Stream Generation The generation of data streams is performed by a memory-aware SMC, which relies on the descriptor specification through a dedicated PDC module. A burst controller and a reorder buffer optimize memory bandwidth and exploit data reorganization and reutilization through in-time stream manipulation. 6. Data-Pattern Generation Efficiency 7. Memory Bandwidth Optimization 8. Performance and Energy Efficiency The combination of the efficient memory access generation and the memory-aware burst and buffering optimizations allow increasing a typical DDR memory throughput close to its theoretical maximum. The proposed specification allows a steady one address-per-cycle memory access generation, requiring up to 8100x less description memory space when compared to state-of-the-art solutions. Memory throughput and access latency optimization for a DDR3 module accessed via a 100 MHz AXI bus. The observed performance increases, averaging 65x, and the measured up to 91% energy savings, result in overall processing energy efficiency (EDP) improvements as high as 245x. Acknowledgments: This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under Project UID/CEC/50021/2013 and Grant SFRH/BD/100697/2014 .