Download presentation
Presentation is loading. Please wait.
Published byAlia Greenwood Modified over 9 years ago
1
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike O’Connor* Tor M. Aamodt ɣ ɣ UBC *AMD University of British Columbia In Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS) Rich Miler – www.datacenterknowledge.com
2
Server farms require a lot of power – Need for efficient, cost-effective solutions – GPU/APUs New types of workloads – Non-HPC – Server applications Server applications – Memcached Programmer’s initial intuition into an application’s behavior Motivation Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 2 Bruno Giussani – ww.wired.com
3
Background Memcached 3 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU *Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar
4
Memcached - Compatible with GPU? Irregular control flow Irregular memory access patterns Large memory requirements Highly input data dependent 4 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
5
Porting Memcached Simple key-value lookup 5 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU GET Server 2 Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit READ (GET) requests on GPU WRITE (SET) requests on CPU
6
GET Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit GET Server 2 Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit GET Server 2 Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit Porting Memcached - Batching 6 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU GET Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit Server n GET Server 2 Hash Memory Key Comparison Return Hit/Miss Hash chaining Miss Hit
7
Porting Memcached 7 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Main Goals – Increase request throughput – Keep request latency reasonable Main Challenges – Irregular memory access patterns – Irregular control flow – Data transfer overheads
8
Methodology Hardware – AMD Radeon HD 5870 (Discrete) – AMD Llano A8-3850 (Fusion) – AMD Zacate E-350 (Fusion) Simulators – GPGPU-Sim v3.x – In-house GPU control flow simulator Testing and Simulation – Traces of Wikipedia accesses 8 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
9
Porting Memcached Memory Access One request per work item Data accesses for GET requests are input data dependent Data can be anywhere in memory – Poor performance on GPU? 9 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
10
Porting Memcached Memory Divergence 10 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
11
Porting Memcached Control Flow Recall the control flow graph Many branch outcomes are input data dependent 11 Work item ID 1 – 2 – 3 – 4 – 5 1 – 2 – 5 3 – 4 1 – 523 – 4 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
12
Porting Memcached Control Flow 12 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 15%40% 62% 29% Overall 51%
13
Porting Memcached Data Management 13 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Dynamic memory manager Transfer memory regions to device Virtual addresses different on host and device
14
Porting Memcached Data Transfer Reduction Fusion Systems – Physical shared memory region between host and device – Zero-copy data Discrete Systems – Possible transfer reduction techniques Reduction in unnecessary transfers Acyclic data transfers (Overlap comm. with comp.) Automatic data transfer frameworks 14 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
15
Porting Memcached 15 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
16
Results Radeon HD 5870 16 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU ~8000 requests yields highest ratio of throughput to latency
17
Summary Programmer intuition doesn’t always paint the whole picture We exploited the available parallelism on GPUs by batching requests, showing a 7.5X performance increase on the Llano system Data transfer overheads can have a large impact on overall performance Thank you – Questions? 17 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Rich Miler – www.datacenterknowledge.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.