Fast Accesses to Big Data in Memory and Storage Systems

Slides:



Advertisements
Similar presentations
Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Craftiness for Fast Multicore Key-Value Storage
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Overview: Memory Memory Organization: General Issues (Hardware) –Objectives in Memory Design –Memory Types –Memory Hierarchies Memory Management (Software.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Optimizing RAM-latency Dominated Applications
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
CS 540 Database Management Systems
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Unit 2 Technology Systems
Hathi: Durable Transactions for Memory using Flash
CS 540 Database Management Systems
NFV Compute Acceleration APIs and Evaluation
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Lecture: Large Caches, Virtual Memory
CHP - 9 File Structures.
Enabling Effective Utilization of GPUs for Data Management Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Management Systems (CS 564)
ISPASS th April Santa Rosa, California
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Database Management Systems (CS 564)
COMP 430 Intro. to Database Systems
Database Performance Tuning and Query Optimization
Hustle and Bustle of SQL Pages
Lecture 11: DMBS Internals
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Lecture 10: Buffer Manager and File Organization
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Be Fast, Cheap and in Control
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Physical Database Design
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Ch 4. The Evolution of Analytic Scalability
Secondary Storage Management Brian Bershad
File Storage and Indexing
2018, Spring Pusan National University Ki-Joune Li
Contents Memory types & memory hierarchy Virtual memory (VM)
Chapter 11 Database Performance Tuning and Query Optimization
DRAM Hwansoo Han.
Secondary Storage Management Hank Levy
Performance And Scalability In Oracle9i And SQL Server 2000
Database System Architectures
Main Memory Background
Cache Memory and Performance
LSbM-tree:一个读写兼优的大数据存储结构
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Fan Ni Xing Lin Song Jiang
Virtual Memory 1 1.
Presentation transcript:

Fast Accesses to Big Data in Memory and Storage Systems Xiaodong Zhang Ohio State University

Fast Data Accesses in In-Memory Key-Value Stores

Information processing without tables: Key Value Stores A simple but effective method in data processing where a data record (or a value) is stored and retrieved with its associated key variable types and lengths of record (value) simple or no schema easy software development for many applications Key-value stores have been widely used in production systems:

Simple and Easy Interfaces of Key-Value Stores value = get (key) put (key, value) get (key) 38 John_age … Variable-length keys & values Index Client

Key-Value Store: A Data Model Supporting Scale-out Command Operation GET(key) Read value SET(key, value) Write a KV item DEL(key) Delete a KV item Hash (Key)  Server ID  finer control over availability. http://www.couchbase.com/nosql-resources/what-is-no-sql , such as relational databases and file systems can be , , demand infrastructures for data collecting, organizing, and management It is horizontally scalable, It removes scalability bottleneck constraining distributed databases and file systems . It can be potentially of (very) high performance. Functionality-richer systems can be build above it. It is horizontally scalable, It can be potentially of (very) high performance. Data Servers 5

Workflow of an in-memory Key-Value Store Network Processing Memory Management Index Operations Access Value

Workflow of a Typical Key-Value Store DELETE SET GET Extract Keys TCP/IP Processing Request Parsing Extract Key&Value Network Processing Delete from Index Insert into Index Search in Index Index Operations Memory Full Not Full Evict Allocate Memory Management Read & Send Value Access Value

Where does time go in KV-Store MICA [NSDI’14] Network processing & Memory Management Index Operations Access Value Index operation becomes one of the major bottlenecks

Data Workflow of Key-Value Stores Query Network Processing & Memory Management Index Operation Hash Table Random Memory Accesses Access Value

Random Memory Accesses of Indexing are Expensive Sequential memory access Random memory access 16 3 7 2 CPU: Intel Xeon E5-2650v2 Memory: 1600 MHz DDR3

Reason: long latency comes from row buffer misses Processor Bus bandwidth time Row Buffer Col. Access (cache line) Row Access Precharge DRAM Latency DRAM Core Row buffer hits: repeated accesses in the same page Row buffer misses: random accesses in different pages

Inabilities for CPUs to accelerate random accesses Do something for Index Operations (random memory accesses) Caching Working set is large (100 GB), CPU cache is small (10 MB) Prefetching Hard to predict next memory address Multithreading Limited number of hardware supported threads Limited by # of Miss Status Holding Registers (MSHRs) for large volumes

High Throughput is Desirable for Processing Big Data It measures the capability of a system to process a growing amount requests on an increasingly large data set Existing KV-Store throughput (MOPS), aiming for low latency MICA (CMU research prototype): 71 MassTree (MIT research prototype): 14 RamCloud (Stanford research prototype): 6 Memcached: (open source software): 1.3 Key-Value Store latency (millisecond) < 1ms is acceptable, e.g. facebook, Amazon, … Our goal: to maximize throughput subject to acceptable latency

To accelerate it by GPUs Mega-KV addresses two issues: large number of requests and random access delay Network Processing Memory Management Access Value Index Operation User space I/O, Multiget, UDP Bitmap, Optimistic concurrent access To accelerate it by GPUs Prefetch

CPU vs. GPU Massive ALUs Control Cache Intel Xeon E5-2650v2: 2.3 billion Transistors 8 Cores 59.7 GB/s memory bandwidth Nvidia GTX 780: 7 billion Transistors 2,304 Cores 288.4 GB/s memory bandwidth

Two Advantages of GPUs for Key-Value Stores Massive Parallel Processing Units in GPU To process a large number of simple and independent memory access operations Massively Hiding Random Memory Access Latency GPU can effectively hide memory latency with massive hardware threads and zero-overhead thread scheduling by hardware support We find that GPUs have several advantages for kv stores. First, it has massive processing units. The operations in kv stores are simple independent memory accesses, while the thousands of cores in GPUs are ideal for massive parallel processing. Second, it has an ability of massively hiding memory access latency. As we have shown previously, lots of random memory accesses are involved in index operations, and GPUs can effectively hide them with …

Two Unique Advantages of GPUs for Key-Value Stores Massive GPU Processing Units To process a large number of simple independent memory access operations Massively Hiding Random Memory Access Latency GPUs can effectively hide memory latency with massive hardware threads and zero-overhead thread scheduling by hardware support memory request issued, switch to another thread memory request issued, switch to another thread cache miss cache miss … Thread A … Thread B … Thread C GPU Core Instruction Buffer

Mega-KV System Framework Pre-Processing GPU Processing Post-Processing

Mega-KV System Framework Network processing, Memory management Index Operations Read & Send Value Pre-Processing GPU Processing Post-Processing Request TX

Mega-KV System Framework Network processing, Memory management Pre-Processing

Mega-KV System Framework Network processing, Memory management Pre-Processing Parallel Processing in GPU

Mega-KV System Framework Network processing, Memory management Pre-Processing Post-Processing Parallel Processing in GPU Read & Send Value

Challenges of Offloading Index Operations to GPUs GPU’s memory capacity is small: ~10 GB Keys may take hundreds of GBs Low PCIe bandwidth PCIe is generally the bottleneck of GPU if large bulk of data needs to be transferred Handling variable-length input is inefficient for GPUs Search along the input buffer or transfer another offset buffer Variable-length string comparison Linked list in GPUs is inefficient The locks to insert/delete linked items would hinder parallelism more random memory accesses

Our Solutions Input data (Keys) Index Compress key C C Compressed fixed-length signatures key GPU optimized cuckoo hash table that stores key signatures and value locations Address challenges 2, 3 (PCIe bandwidth and variable length data) Address challenges 1, 3, and 4 (GPU memory capacity and variable length data)

High Throughput Comes from large Batch Size subject to acceptable latencies

High Throughput Comes from large Batch Size subject to acceptable latencies

High Throughput Comes from large Batch Size subject to acceptable latencies

系统的可扩展性是一个延迟和吞吐量平衡的结果 最佳点 延迟 吞吐量 并发进程的数量

A Significant Execution Time Reduction by Fast Indexing Access Value Index Operations Network Processing & Memory Management Execution time of the query in CPU-based KV Execution time of the query in Mega-KV

Reaching a Record High Throughput 2.1x 1.9x 2.8x We measure the throughput with 95% GET and 5% SET queries under uniform and skewed distribution. With 8B key and 8B value, we can get up to 160MOPS throughput. For all data sets, we are about 2 times as fast as the fastest cpu-based kv stores.

Low and Acceptable Latency Compared with Facebook 1,200 (95th) 300 (50th) 95th: 390 50th: 256 This graph shows the distribution of MegaKV’s round trip latency under its maximum throughput 160 MOPS. The 50th and 95th percentile latency are 256 and 390 microseconds respectively. As that in Facebook is reported to be 300 and 1200 ms. Therefore, our latency is far below the requirement of production systems.

The Mega-KV software is open to the public (VLDB ‘15) Mega-KV is open source at http://kay21s.github.io/megakv/

Summary Restructuring basic data structures to respond big data Reserving buffer cache locality in fast writes: LSbM-tree Balancing network transfer and I/O bandwidth: RCFile/ORC Accelerating indexing operations by GPU: Mega-KV