Download presentation
Presentation is loading. Please wait.
Published byMadeleine Barnett Modified over 9 years ago
1
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University
2
Outline Introduction Architecture / Implementation Adaptive NEtwork MemOry engiNE (ANEMONE) Reliable Memory Access Protocol (RMAP) Two Level LRU Caching Early Acknowledgments Experimental Results Future Work Related Work Conclusions
3
Introduction Virtual Memory performance is bound by slow disks State of computers today lends to the idea of shared memory Gigabit Ethernet Machines on a LAN have lots of free memory Improvements to ANEMONE yield higher performance than disk and the original ANEMONE system Cache Memory ANEMONEDisk Registers
4
Contributions Pseudo Block Device (PBD) Reliable Memory Access Protocol Replace NFS Early Acknowledgments Shortcut Communication Path Two Level LRU-Based Caching Client Memory Engine
5
ANEMONE Architecture ANEMONE (NFS)ANEMONE Client NFS SwappingPseudo Block Device (PBD) Swap Daemon CacheClient Cache Memory Engine No cachingEngine Cache Must wait for server to receive page Early ACKs Memory Server Communicates with Memory Engine
6
Architecture Client Module RMAP Protocol Engine Cache
7
Pseudo Block Device Provides a transparent interface for swap daemon and ANEMONE Is not a kernel modification Begins handling READ/WRITE requests in order of arrival No expensive elevator algorithm
8
IP Transport Application RMAP Swap Daemon Ethernet Reliable Memory Access Protocol (RMAP) Lightweight Reliable Flow Control Protocol sits next to IP layer to allow swap daemon quick access to pages
9
RMAP Window Based Protocol Requests are served as they arrive Messages: REG/UNREG – Register the client with the ANEMONE cluster READ/WRITE – send/receive data from ANEMONE STAT – retrieves statistics from the ANEMONE cluster
10
Why do we need cache? It is a natural answer to on-disk buffers Caching reduces network traffic Decreases Latency Write latencies benefit the most Buffers requests before they are sent over the wire
11
Basic Cache Structure FIFO Queue is used to keep track of LRU page Hashtable is used for fast page lookups
12
ANEMONE Cache Details Client Cache 16 MB Write-Back Memory allocation at load time Engine Cache 80 MB Write-Through Partial memory allocation at load time sk_buffs are copied when they arrive at the Engine
13
Early Acknowledgments Reduces client wait time Can reduce write latency by up to 200 µs per write request Early ACK performance is slowed by small RMAP window size Small pool (~200) of sk_buffs are maintained for forward ACKing
14
Experimental Testbed Experimental testbed configured with 400,000 blocks (4KB page) of memory (~1.6 GB)
15
Experimental Description Latency 100,000 Read/Write requests Sequential/Random Application Run Times Quicksort / POV-Ray Single/Multiple Processes Execution Times Cache Performance Measured cache hit rates Client / Engine
16
Sequential Read
17
Sequential Write
18
Random Read
19
Random Write
20
Single Process Performance Increase single process size by 100 MB for each iteration Quicksort: 298% performance increase over disk, 226% increase over original ANEMONE POV-Ray: 370% performance increase over disk, 263% increase over original ANEMONE
21
Multiple Process Performance Increase number of 100 MB processes by 1 for each iteration Quicksort: 710% increase over disk, and 117% increase over original ANEMONE POV-Ray: 835% increase over disk, and 115% increase over original ANEMONE
22
Client Cache Performance Hits save ~500 µs POV-Ray hit rate saves ~270 seconds for 1200 MB test Quicksort hit rate saves ~45 seconds for 1200 MB test Swap daemon interferes with cache hit rates Prefetching
23
Engine Cache Performance Cache performance levels out ~10% POV-Ray does not exceed 10% because it performs over 3x the number of page swaps that Quicksort does Engine cache saves up to 1000 seconds for 1200 MB POV-Ray
24
Future Work More extensive testing Aggressive caching algorithms Data Compression Page Fragmentation P2P RDMA over Ethernet Scalability and Fault tolerance
25
Related Work Global Memory System [feeley95] Implements a global memory management algorithm over ATM Does not directly address Virtual Memory Reliable Remote Memory Pager [markatos96], Network RAM Disk [flouris99] TCP Sockets Samson [stark03] Myrinet Does not perform caching Remote Memory Model [comer91] Implements custom protocol Guarantees in-order delivery
26
Conclusions ANEMONE does not modify client OS or applications Performance increases by up to 263% for single processes Performance increases by up to 117% for multiple processes Improved caching is provocative line of research, but more aggressive algorithms are required.
27
Questions?
28
Appendix A: Quicksort Memory Access Patterns
29
Appendix B: POV-Ray Memory Access Patterns
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.