INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.

Slides:



Advertisements
Similar presentations
Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.
Advertisements

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
International Conference on Supercomputing June 12, 2009
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
CS 300 – Lecture 21 Intro to Computer Architecture / Assembly Language Virtual Memory.
Caching and Demand-Paged Virtual Memory
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Systems I Locality and Caching
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.
Understanding Intrinsic Characteristics and System Implications of Flash Memory based Solid State Drives Feng Chen, David A. Koufaty, and Xiaodong Zhang.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
Computer System Architectures Computer System Software
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Data Storage Systems: A Survey Abdullah Aldhamin July 29, 2013 CMPT 880: Large-Scale Multimedia Systems and Cloud Computing Course Project.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
By: Aidahani Binti Ahmad
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
MEMS and Caching for File Systems Andy Wang COP 5611 Advanced Operating Systems.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Indexing strategies and good physical designs for performance tuning Kenneth Ureña /SpanishPASSVC.
CS 704 Advanced Computer Architecture
CS161 – Design and Architecture of Computer
CS 105 Tour of the Black Holes of Computing
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Storage Virtualization
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
CS 105 Tour of the Black Holes of Computing
/ Computer Architecture and Design
Virtual Memory: Working Sets
Database System Architectures
Lecture 9: Caching and Demand-Paged Virtual Memory
Presentation transcript:

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu Hu, Mingyu Chen Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

INSTITUTE OF COMPUTING TECHNOLOGY The role of I/O I/O is ubiquitous Load binary files : Disk  Memory Brower web, media stream : Network  Memory  … I/O is significant Many commercial applications are I/O intensive : Database etc.

INSTITUTE OF COMPUTING TECHNOLOGY State-of-the-Art I/O Technologies I/O Bus: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

INSTITUTE OF COMPUTING TECHNOLOGY Direct Memory Access (DMA) DMA is used for I/O operations in all modern computers DMA allows I/O subsystems to access system memory independently of CPU. Many I/O devices have DMA engines Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

INSTITUTE OF COMPUTING TECHNOLOGY DMA Engine CPU Memory Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ An Example of Disk Read: DMA Receiving Operation Cache Access Latency : ~20 Cycles Memory Access Latency : ~200 Cycles

INSTITUTE OF COMPUTING TECHNOLOGY DMA Engine CPU Memory Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ Direct Cache Access [Ram-ISCA05] This is a typical Shared-Cache Scheme Prefetch-Hint Approach [Kumar-Micro07]

INSTITUTE OF COMPUTING TECHNOLOGY Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing Not suitable for other I/O  Degrade performance when DMA requests are large (>100KB) for “ Oracle + TPC-H” application To address this problem deeply, we need to investigate the I/O data characteristics.

INSTITUTE OF COMPUTING TECHNOLOGY I/O Data V.S. CPU Data MemCtrl I/O Data CPU Data HMTT I/O Data + CPU Data

INSTITUTE OF COMPUTING TECHNOLOGY A short AD of HMTT [Bao-Sigmetrics08] A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g., virtual address Process id I/O operation Can collect the trace of commercial applications, e.g., Oracle Web server The HMTT System

INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(1) % of Memory References to I/O data % of References of various I/O types

INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(2) I/O request size distribution?

INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(3) Sequential access in I/O data Compared with CPU data, I/O data is very regular

INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(4) Reuse Distance (RD) LRU Stack Distance RD CDF x% <=n

INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(5) DMA-W CPU-R CPU-RW CPU-W DMA-R

INSTITUTE OF COMPUTING TECHNOLOGY Rethink I/O & DMA Operation 20~40% of memory references are for I/O data in I/O-intensive applications. Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential  Separating I/O data and CPU data

INSTITUTE OF COMPUTING TECHNOLOGY Separating I/O data and CPU data Before Separating After Separating

INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching Dedicated DMA Cache (DDC)

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through policies are available Write Policy Cache Coherence Replacement Policy Prefetching

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching IO-ESI Protocol for WT policy IO-MOESI Protocol for WB Policy The only difference between IO- MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

INSTITUTE OF COMPUTING TECHNOLOGY A Big Issue How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

INSTITUTE OF COMPUTING TECHNOLOGY A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98] DMA $ CPU $ …… OSIMIS OS + I + √ MS + I + X EI + R|E MI + W|* S+I+S+I+ R|I

INSTITUTE OF COMPUTING TECHNOLOGY Global State Cache Coherence Theorem Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine. 5 Global States: S + I + EI * I * MI * OS * I * √ √ √ √ √

INSTITUTE OF COMPUTING TECHNOLOGY MOESI + ESI 6 Global States: S + I + E C I * I * M C I * E D I * O C S * I * √ √ √ √ √ √

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching An LRU-like Replace Policy 1. Invalid 2. Shared 3. Owned 4. Exlusive 5. Modified

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

INSTITUTE OF COMPUTING TECHNOLOGY Design Complexity vs. Design Cost Dedicated DMA Cache (DDC) Partition-Based DMA Cache (PBDC)

INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

INSTITUTE OF COMPUTING TECHNOLOGY Speedup of Dedicated DMA Cache

INSTITUTE OF COMPUTING TECHNOLOGY % of Valid Prefetched Blocks DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

INSTITUTE OF COMPUTING TECHNOLOGY Performance Comparisons Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

INSTITUTE OF COMPUTING TECHNOLOGY Conclusions We have proposed a DMA cache technique to separate I/O data and CPU We adopt a Global State Method for Integrating Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are better than the existing approaches that use unified, shared caches for I/O data and CPU data Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different types of data? How to optimize MC with awareness of IO

INSTITUTE OF COMPUTING TECHNOLOGY Thanks ! & Question?

INSTITUTE OF COMPUTING TECHNOLOGY Design Complexity of PBDC

INSTITUTE OF COMPUTING TECHNOLOGY More References on Cache Coherence Protocol Verification Fong Pong, Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p , July 1998 Fong Pong, Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p , March 1997