Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Name: Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Uploaded: 2017-08-20T20:16:10+00:00
Duration: PTM6S55
Channel: Erick Knight
Description: Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached
Bohua Kou Jing gao

Overview Motivation & Introduction to Memcached Servers
Performance Analysis on Existing Servers Identify Inefficiencies and bottlenecks Thin Servers with Smart Pipes (TSSP) Architecture TSSP Performance Results Conclusion & Questions

Internet Service Workload
Internet services and data grow rapidly Database servers along cannot maintain performance with justified cost Other types of infrastructure needed to allow the modern internet services to scale

Memcached Servers Distributed key-value stores
Implemented using a hash table, keys are unique and each key maps to only a single server at any time Easy to scale (connecting to additional servers is simple) Attractive because server interface is general

Workloads Memcached behavior can vary considerably based on the size and access frequency of the objects it stores Designed five workloads that capture a wide range of behavior to explore bottlenecks and inefficiencies with existing servers. FixedSize: fixed object size and uniform popularity distribution; small objects place the greatest stress on Memcached performance MicroBlog: object size and popularity distribution on a sample of “tweets” collected from Twitter Wiki: use the entire Wikipedia database, each object represents an individual article in HTML format ImgPreview: sample thumbnail photos and associated view counts from Flickr FriendFeed: same as Microblog

Workload Characteristics

System Under Test Two different server systems
High-end Xeon-based server vs low-power Atom-based server Three different classes of network interface cards (NIC) A total of 21 different system configurations

Simulation Results (CPU Bottlenecks)
On the left: requests-per-second for GET-requests to fixed-size objects; Gap becomes larger when size is below 1KB On the right: performance bottleneck shifts from network to CPU Xeon-class systems operate at less than 1/8 of their theoretical instruction throughput Atom-class systems 1/16 of their theoretical instruction throughput

Simulation Results (Caching Bottlenecks)
L1 Icache misses really bad L1 Dcache and L2 performs moderately well

Simulation Results (TLB and Branch Prediction Bottlenecks)
TLB behaviors for Xeon comparable to recent characterization works Atom provides an insufficient ITLB which causes most of its instruction fetch stalls Branch misprediction rates is high for Atom due to its less capable branch predictor than Xeon

Simulation Results (Impact of NIC quality)

Simulation Results even though the Atom is a low-power processor with a significantly lower peak power than the Xeon, its power efficiency is worse

Overcoming Bottlenecks:
Bottleneck: GET operation The most latency and throughput critical memcached task Up to 30:1 GET/SET ratio Microarchitecture Solution: Thin Servers with Smart Pipes (TSSP) Shift GET from core to a hardware pipeline Using UDP protocol to replace TCP Thin Servers = Embedded class cores Smart Pipes = Integrated, on-die NIC with nearby hardware

TSSP Architecture Overview
Integrated NIC and Accelerator Integrated NIC Handle incoming requests MMU Virtual address translation Memcached Accelerator Respond to GET request without software interaction All other request types and memory management are handled by software Atom Based Low power core GET Request All the other requests

NIC and Memcached Accelerator
Routes based on IP address, port and protocol Data Center Network (TCP) NIC Memcached Accelerator System MMU SoC fabric Process GET requests Hardware traversable Hash table

Hardware and Software Data Structure
Variable bytes key -> 64-bit identifier key’ Split MemCached’s lookup structure (hash table) to the key and value storage (Slab-allocated memory) 4 possible slots for a given hash Need update key’s last-access timestamp //Hardware manages all access to the hash table to avoid expensive synchronization between hardware and software

Evaluation Results Implementation Area Power Performance
Memcached as an FPGA appliance Area 8000 Look Up Tables, ~2% of the FPGA Power Estimated TSSP Total power = SoC Xilinx Zynq platform + non-CPU components of Atom-based system Performance Measuring processing time for the memcached applications, eg. FixedSize Expected greater improvements if implemented as an ASIC rather than FPGA Table: TSSP energy efficiency comparison for FixedSize Workload

Conclusion Identify several system bottlenecks for high-performance and low-power CPUs Propose the TSSP design: low-power embedded class core + Memcached accelerator Potential 6 ~ 16X improvement in energy efficiency over existing servers

Discussion Evaluation of TSSP only used FixedSize workload for comparison, is it enough? Analysis on TSSP power consumption is weak. Does the estimated power reliable? TSSP also modifies the software, is the change easy to make?

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Similar presentations

Presentation on theme: "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Similar presentations

Presentation on theme: "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao."— Presentation transcript:

Similar presentations

About project

Feedback