Scalable Multi-Access Flash Store for Big Data Analytics

Slides:

Advertisements

Similar presentations

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

Advertisements

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Computer Networks TCP/IP Protocol Suite.

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Distributed Systems Architectures

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

Processes and Operating Systems

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

1 Multi-Channel Wireless Networks: Capacity and Protocols Nitin H. Vaidya University of Illinois at Urbana-Champaign Joint work with Pradeep Kyasanur Chandrakanth.

Scalable Routing In Delay Tolerant Networks

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Chapter 6 File Systems 6.1 Files 6.2 Directories

Database Systems: Design, Implementation, and Management

Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.

Extreme Performance with Oracle Data Warehousing

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Week 2 The Object-Oriented Approach to Requirements

1 Chapter One Introduction to Computer Networks and Data Communications.

Cache Storage For the Next Billion Students: Anirudh Badam, Sunghwan Ihm Research Scientist: KyoungSoo Park Presenter: Vivek Pai Collaborator: Larry Peterson.

Augmenting FPGAs with Embedded Networks-on-Chip

4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.

Fast Crash Recovery in RAMCloud

1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.

The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.

DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.

Deploying Virtualised Infrastructures for Improved Efficiency and Reduced Cost Adrian Groeneveld Senior Product Marketing Manager Adrian Groeneveld Senior.

Chapter 1: Introduction to Scaling Networks

Local Area Networks - Internetworking

Shredder GPU-Accelerated Incremental Storage and Computation

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 The OSI Model and the TCP/IP.

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis

1 COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. On the Capacity of Wireless CSMA/CA Multihop Networks Rafael Laufer and Leonard Kleinrock Bell.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

DAQmx下多點(Multi-channels)訊號量測

IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 2 Networking Fundamentals.

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Chapter 10: The Traditional Approach to Design

Analyzing Genes and Genomes

Systems Analysis and Design in a Changing World, Fifth Edition

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

User Security for e-Post Applications Dr Chandana Gamage University of Moratuwa.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.

Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.

New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.

BlueDBM: An Appliance for Big Data Analytics

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Seth Pugsley, Jeffrey Jestes,

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Fast Accesses to Big Data in Memory and Storage Systems

CS 295: Modern Systems Storage Technologies Introduction

Presentation transcript:

Scalable Multi-Access Flash Store for Big Data Analytics Sang-Woo Jun, Ming Liu Kermin Fleming*, Arvind Massachusetts Institute of Technology *Intel, work while at MIT Affiliation, elliott at intel , work while at mit sponsers This work is funded by Quanta and Samsung. We also thank Xilinx for their generous hardware donations February 27, 2014

Big Data Analytics Analysis of previously unimaginable amount of data provides deep insight Examples: Analyzing twitter information to predict flu outbreaks weeks before the CDC Analyzing personal genome to determine predisposition to disease Data mining scientific datasets to extract accurate models Start with examples, description in next slide (size, randomness < db sequential) Likely to be the biggest economic driver for the IT industry for the next 5 or more years February 27, 2014

Big Data Analytics Data so large and complex, that traditional processing methods are no longer effective Too large to fit reasonably in fast RAM Random access intensive, making prefetching and caching ineffective Data is often stored in the secondary storage of multiple machines on a cluster Storage system and network performance become first-order concerns Start with examples, description in next slide (size, randomness < db sequential) February 27, 2014

Designing Systems for Big Data Due to terrible random read latency of disks, analytic systems were optimized for reducing disk access, or making them sequential Widespread adoption of flash storage is changing this picture rescaled Add “re-scaling” to new chart Modern systems need to balance all aspects of the system to gain optimal performance February 27, 2014

Designing Systems for Big Data Network Optimizations High-performance network links (e.g. Infiniband) Storage Optimizations High-performance storage devices (e.g. FusionIO) In-Store computing (e.g. Smart SSD) Network-attached storage device (e.g. QuickSAN) Software Optimizations Hardware implementations of network protocols Flash-aware distributed file systems (e.g. CORFU) FPGA or GPU based accelerators (e.g. Netezza) Cross-layer optimization is required for optimal performance February 27, 2014

Current Architectural Solution Previously insignificant inter-component communications become significant When accessing remote storage: Read from flash to RAM Transfer from local RAM to remote RAM Transfer to accelerator and back Add request packet from cpu to cpu Loop animations? Infiniband Infiniband FPGA RAM FusionIO FPGA RAM FusionIO CPU CPU Motherboard Motherboard Hardware and software latencies are additive Potential bottleneck at multiple transport locations February 27, 2014

BlueDBM Architecture Goals Low-latency high-bandwidth access to scalable distributed storage Low-latency access to FPGA acceleration Many interesting problems do not fit in DRAM but can fit in flash Add request packet that goes from cpu to remote fpga Loop animations! CPU RAM Flash Network FPGA Motherboard CPU RAM Flash Network FPGA Motherboard February 27, 2014

BlueDBM System Architecture FPGA manages flash, network and host link Controller Network ~8 High-speed (12Gb/s) serial links (GTX) Node 20 1TB Flash Programmable Controller (Xilinx VC707) Host PC Node 19 1TB Flash Programmable Controller (Xilinx VC707) Host PC Node 0 1TB Flash Programmable Controller (Xilinx VC707) Host PC … PCIe Bandwidth numbers? Topology? Ethernet A proof-of-concept prototype is functional! Expected to be constructed by summer, 2014 February 27, 2014

Inter-Controller Network BlueDBM provides a dedicated storage network between controllers Low-latency high-bandwidth serial links (i.e. GTX) Hardware implementation of a simple packet-switched protocol allows flexible topology Extremely low protocol overhead (~0.5μs/hop) Controller Network ~8 High-speed (12Gb/s) serial links (GTX) Virtual channel? Virtual network? 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) … February 27, 2014

Inter-Controller Network BlueDBM provides a dedicated storage network between controllers Low-latency high-bandwidth serial links (i.e. GTX) Hardware implementation of a simple packet-switched protocol allows flexible topology Extremely low protocol overhead (~0.5μs/hop) High-speed (12Gb/s) serial link (GTX) Virtual channel? Virtual network? 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) … Protocol supports virtual channels to ease development February 27, 2014

Controller as a Programmable Accelerator BlueDBM provides a platform for implementing accelerators on the datapath of storage device Adds almost no latency by removing the need for additional data transport Compression or filtering accelerators can allow more throughput than hostside link (i.e. PCIe) allows Example: Word counting in a document Counting is done inside accelerators, and only the result need to be transported to host February 27, 2014

Two-Part Implementation of Accelerators Accelerators are implemented both before and after network transport Local accelerators parallelize across nodes Global accelerators have global view of data Virtualized interface to storage and network provided in these locations Host e.g., aggregate count Global Accelerator e.g., count words Network e.g., count words Local Accelerator Local Accelerator Flash Flash February 27, 2014

Prototype System (Proof-of-Concept) Prototype system built using Xilinx ML605 and a custom-built flash daughterboard 16GB capacity, 80MB/s bandwidth per node Network over GTX high-speed transceivers Flash board is 5 years old, from an earlier project February 27, 2014

Prototype System (Proof-of-Concept) Hub nodes We needed hub nodes because the ML605 only has one SMA port pinned out Forced to take a switched fabric topology because ML605 has only one SMA port. February 27, 2014

Prototype Network Performance February 27, 2014

Prototype Bandwidth Scaling rescaled Only reachable by accelerator Maximum PCIe bandwidth new graph with expected performance Animation? Show orig, scale, show compare February 27, 2014

Prototype Latency Scaling February 27, 2014

Benefits of in-path acceleration Less software overhead Not bottlenecked by flash-host link Experimented using hash counting In-path: Words are counted on the way out Off-path: Data copied back to FPGA for counting Software: Counting done in software Describe in-path vs out-path Describe the experiment (route data back to FPGA, pipeline bubbles, pcie bottlenecks) We’re actually building both configurations February 27, 2014

Work In Progress Improved flash controller and File System Wear leveling, garbage collection Distributed File System as accelerator Application-specific optimizations Optimized data routing/network protocol Storage/Link Compression Database acceleration Offload database operations to FPGA Accelerated filtering, aggregation and compression could improve performance Application exploration Including medical applications and Hadoop February 27, 2014

Thank you Takeaway: Analytics systems can be made more efficient by using reconfigurable fabric to manage data routing between system components Flash Flash CPU + RAM CPU + RAM FPGA FPGA Network Network February 27, 2014

FPGA Resource Usage February 27, 2014

New System Improvements A new, 20-node cluster will exist by summer New flash board with 1TB of storage at 1.6GB/s Total 20TBs of flash storage 8 serial links will be pinned out as SATA ports, allowing flexible topology 12Gb/s per link GTX Rack-mounted for ease of handling Xilinx VC707 FPGA boards Rackmount Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 … February 27, 2014

Example: Data driven Science Massive amounts of scientific data is being collected Instead of building an hypothesis and collecting relevant data, extract models from massive data University of Washington N-body workshop Multiple snapshots of particles in simulated universe Interactive analysis provides insights to scientists ~100GBs, not very large by today’s standards “Visualize clusters of particles moving faster than X” “Visualize particles that were destroyed, After being hotter than Y” February 27, 2014

Workload Characteristics Large volume of data to examine Access pattern is unpredictable Often only a random subset of snapshots are accessed Row Snapshot 1 Snapshot 2 Row Rows of interest “Hotter than Y” Row Row Row Row Access at random intervals “Destroyed?” Sequential scan Row Row February 27, 2014

Constraints in System Design Latency is important in a random-access intensive scenario Components of a computer system has widely different latency characteristics “even when DRAM in the cluster is ~40% the size of the entire database, [... we could only process] 900 reads per second, which is much worse than the 30,000” “Using HBase with ioMemory” -FusionIO February 27, 2014

Constraints in System Design Latency is important in a random-access intensive scenario Components of a computer system has widely different latency characteristics “even when DRAM in the cluster is ~40% the size of the entire database, [... we could only process] 900 reads per second, which is much worse than the 30,000” “Using HBase with ioMemory” -FusionIO February 27, 2014

Existing Solutions - 1 Baseline: Single PC with DRAM and Disk Size of dataset exceeds normal DRAM capacity Data on disk is accessed randomly Disk seek time bottlenecks performance Solution 1: More DRAM Expensive, Limited capacity Solution 2: Faster storage Flash storage provides fast random access PCIe flash cards like FusionIO Solution 3: More machines February 27, 2014

Existing Solutions - 2 Cluster of machines over a network Network becomes a bottleneck with enough machines Network fabric is slow Software stack adds overhead Fitting everything on DRAM is expensive Trade-off between network and storage bottleneck Solution 1: High-performance network such as Infiniband February 27, 2014

Existing Solutions - 3 Cluster of expensive machines Lots of DRAM FusionIO storage Infiniband network Now computation might become the bottleneck Solution 1: FPGA-based Application-specific Accelerators February 27, 2014

Previous Work Software Optimizations Storage Optimizations Hardware implementations of network protocols Flash-aware distributed file systems (e.g. CORFU) Accelerators on FPGAs and GPUs Storage Optimizations High-performance storage devices (e.g. FusionIO) Network Optimizations High-performance network links (e.g. Infiniband) February 27, 2014

BlueDBM Goals BlueDBM is a distributed storage architecture with the following goals: Low-latency high-bandwidth storage access Low-latency hardware acceleration Scalability via efficient network protocol & topology Multi-accessibility via efficient routing and controller Application compatibility via normal storage interface Flash Network RAM CPU FPGA Motherboard February 27, 2014

BlueDBM Node Architecture Address mapper determines the location of the requested page Nodes must agree on mapping Currently programmatically defined, to maximize parallelism by striping across nodes February 27, 2014

Prototype System Topology of prototype is restricted because ML605 has only one SMA port Network requires hub nodes February 27, 2014

Controller as a Programmable Accelerator BlueDBM provides a platform for implementing accelerators on the datapath of storage device Adds almost no latency by removing the need for additional data transport Compression or filtering accelerators can allow more throughput than hostside link (i.e. PCIe) allows Example: Word counting in a document Controller can be programmed to count words and only return the result Separate in-path to out-path into separate slide ( in prototype results ) Differentiate word counting example from prev slides Flash Flash Accelerator Accelerator Host Server Host Server February 27, 2014 We’re actually building both systems

Accelerator Configuration Comparisons February 27, 2014