Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Chapter 19: Network Management Business Data Communications, 5e.
FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
Architectural Impact of SSL Processing Jingnan Yao.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.
ECE 526 – Network Processing Systems Design
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
The Network Layer. Network Projects Must utilize sockets programming –Client and Server –Any platform Please submit one page proposal Can work individually.
Introduction To Windows Azure Cloud
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Computer Architecture Lecture 28 Fasih ur Rehman.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CMPE 421 Parallel Computer Architecture
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.
AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.
Timothy Whelan Supervisor: Mr Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University Hardware based packet filtering.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 4396 Computer Networks Lab Router Architectures.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Full and Para Virtualization
MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
ISPASS th April Santa Rosa, California
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Part V Memory System Design
Address-Value Delta (AVD) Prediction
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
CLUSTER COMPUTING.
Computer Evolution and Performance
Fast Accesses to Big Data in Memory and Storage Systems
Network Basics and Architectures Neil Tang 09/05/2008
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao

Overview Motivation & Introduction to Memcached Servers Performance Analysis on Existing Servers Identify Inefficiencies and bottlenecks Thin Servers with Smart Pipes (TSSP) Architecture TSSP Performance Results Conclusion & Questions

Internet Service Workload Internet services and data grow rapidly Database servers along cannot maintain performance with justified cost Other types of infrastructure needed to allow the modern internet services to scale

Memcached Servers Distributed key-value stores Implemented using a hash table, keys are unique and each key maps to only a single server at any time Easy to scale (connecting to additional servers is simple) Attractive because server interface is general

Workloads Memcached behavior can vary considerably based on the size and access frequency of the objects it stores Designed five workloads that capture a wide range of behavior to explore bottlenecks and inefficiencies with existing servers. FixedSize: fixed object size and uniform popularity distribution; small objects place the greatest stress on Memcached performance MicroBlog: object size and popularity distribution on a sample of “tweets” collected from Twitter Wiki: use the entire Wikipedia database, each object represents an individual article in HTML format ImgPreview: sample thumbnail photos and associated view counts from Flickr FriendFeed: same as Microblog

Workload Characteristics

System Under Test Two different server systems High-end Xeon-based server vs low-power Atom-based server Three different classes of network interface cards (NIC) A total of 21 different system configurations

Simulation Results (CPU Bottlenecks) On the left: requests-per-second for GET-requests to fixed-size objects; Gap becomes larger when size is below 1KB On the right: performance bottleneck shifts from network to CPU Xeon-class systems operate at less than 1/8 of their theoretical instruction throughput Atom-class systems 1/16 of their theoretical instruction throughput

Simulation Results (Caching Bottlenecks) L1 Icache misses really bad L1 Dcache and L2 performs moderately well

Simulation Results (TLB and Branch Prediction Bottlenecks) TLB behaviors for Xeon comparable to recent characterization works Atom provides an insufficient ITLB which causes most of its instruction fetch stalls Branch misprediction rates is high for Atom due to its less capable branch predictor than Xeon

Simulation Results (Impact of NIC quality)

Simulation Results even though the Atom is a low-power processor with a significantly lower peak power than the Xeon, its power efficiency is worse

Overcoming Bottlenecks: Bottleneck: GET operation The most latency and throughput critical memcached task Up to 30:1 GET/SET ratio Microarchitecture Solution: Thin Servers with Smart Pipes (TSSP) Shift GET from core to a hardware pipeline Using UDP protocol to replace TCP Thin Servers = Embedded class cores Smart Pipes = Integrated, on-die NIC with nearby hardware

TSSP Architecture Overview Integrated NIC and Accelerator Integrated NIC Handle incoming requests MMU Virtual address translation Memcached Accelerator Respond to GET request without software interaction All other request types and memory management are handled by software Atom Based Low power core GET Request All the other requests

NIC and Memcached Accelerator Routes based on IP address, port and protocol Data Center Network (TCP) NIC Memcached Accelerator System MMU SoC fabric Process GET requests Hardware traversable Hash table

Hardware and Software Data Structure Variable 1-255 bytes key -> 64-bit identifier key’ Split MemCached’s lookup structure (hash table) to the key and value storage (Slab-allocated memory) 4 possible slots for a given hash Need update key’s last-access timestamp //Hardware manages all access to the hash table to avoid expensive synchronization between hardware and software

Evaluation Results Implementation Area Power Performance Memcached as an FPGA appliance Area 8000 Look Up Tables, ~2% of the FPGA Power Estimated TSSP Total power = SoC Xilinx Zynq platform + non-CPU components of Atom-based system Performance Measuring processing time for the memcached applications, eg. FixedSize Expected greater improvements if implemented as an ASIC rather than FPGA Table: TSSP energy efficiency comparison for FixedSize Workload

Conclusion Identify several system bottlenecks for high-performance and low-power CPUs Propose the TSSP design: low-power embedded class core + Memcached accelerator Potential 6 ~ 16X improvement in energy efficiency over existing servers

Discussion Evaluation of TSSP only used FixedSize workload for comparison, is it enough? Analysis on TSSP power consumption is weak. Does the estimated power reliable? TSSP also modifies the software, is the change easy to make?