Continuous Retrieval of Replicated Data from Heterogeneous Storage Arrays 9/10/2014 Nihat Altiparmak and Ali Saman Tosun Mascots 2014.

Slides:

Advertisements

Similar presentations

ICDT Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005.

Advertisements

I/O Management and Disk Scheduling

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.

The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

Query Evaluation Techniques for Cluster Database Systems Andrey V. Lepikhov, Leonid B. Sokolinsky South Ural State University Russia 22 September 2010.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.

1 A Framework for Lazy Replication in P2P VoD Bin Cheng 1, Lex Stein 2, Hai Jin 1, Zheng Zhang 2 1 Huazhong University of Science & Technology (HUST) 2.

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Storage Networks How to Handle Heterogeneity Bálint Miklós January 24th, 2005 ETH Zürich External Memory Algorithms and Data Structures.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

Solid-State Drive Ding Ruogu Kong Liang. A solid-state drive (SSD) is a data storage device that uses solid-state memory to store persistent data.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

1 CSE544 Database Architecture Tuesday, February 1 st, 2011 Slides courtesy of Magda Balazinska.

1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.

Power-Aware SoC Test Optimization through Dynamic Voltage and Frequency Scaling Vijay Sheshadri, Vishwani D. Agrawal, Prathima Agrawal Dept. of Electrical.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.

Computing Hardware Starter.

I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.

Exploiting Flash for Energy Efficient Disk Arrays Shimin Chen (Intel Labs) Panos K. Chrysanthis (University of Pittsburgh) Alexandros Labrinidis (University.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

1 I/O Management and Disk Scheduling Chapter Categories of I/O Devices Human readable Used to communicate with the user Printers Video display terminals.

1 IO Management and Disk Scheduling Chapter Categories of I/O Devices n Human readable u used to communicate with the user u video display terminals.

Module – 4 Intelligent storage system

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

SOLID STATE DRIVES By: Vaibhav Talwar UE84071 EEE(5th Sem)

FlashSystem family 2014 © 2014 IBM Corporation IBM® FlashSystem™ V840 Product Overview.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Wei-Shen, Hsu 2013 IEE5011 –Autumn 2013 Memory Systems Solid State Drive with Flash Memory Wei-Shen, Hsu Department of Electronics Engineering National.

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

Practice 8 Chapter Ten. 1. Is disk scheduling, other than FCFS scheduling, useful in a single-user environment? Explain your answer. Answer: In a single-user.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.

Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.

Tackling I/O Issues 1 David Race 16 March 2010.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 27 – Media Server (Part 2) Klara Nahrstedt Spring 2009.

Copyright © 2010 Hitachi Data Systems. All rights reserved. Confidential – NDA Strictly Required Hitachi Storage Solutions Hitachi HDD Directions HDD Actual.

Rethinking RAID for SSD based HPC Systems Yugendra R. Guvvala, Yong Chen, and Yu Zhuang Department of Computer Science, Texas Tech University, Lubbock,

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Internal Parallelism of Flash Memory-Based Solid-State Drives

Chapter 10: Mass-Storage Systems

Dynamic Data Layout Optimization for High Performance Parallel I/O

Database Applications (15-415) DBMS Internals- Part I Lecture 11, February 16, 2016 Mohammad Hammoud.

Applying Control Theory to Stream Processing Systems

Database Management Systems (CS 564)

Parallel Programming By J. H. Wang May 2, 2017.

Introduction to Computing

Toward Advocacy-Free Evaluation of Packet Classification Algorithms

Parallel Programming in C with MPI and OpenMP

Introduction to Computer Systems

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

CPU SCHEDULING.

Mass-Storage Systems.

2.C Memory GCSE Computing Langley Park School for Boys.

Database System Architectures

Presentation transcript:

Continuous Retrieval of Replicated Data from Heterogeneous Storage Arrays 9/10/2014 Nihat Altiparmak and Ali Saman Tosun Mascots 2014

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Background  Big Data, Storage Arrays, Distributed and Heterogeneous Storage Architectures  Replicated Declustering and Retrieval Continuous Retrieval Techniques  Batching, conservative, adaptive Evaluation Outline 2

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Total amount of data existing in the digital universe today is in the order of zettabytes (~ B) now and it is constantly growing  A couple of exabytes (~ B) of new information is created every day through sensors, Internet transactions, s, social media, video surveillance, genome sequencing etc. Many organizations store this data to enable breakthrough discoveries and innovation in science, engineering, medicine, commerce, national security etc.  Spent some time in a start-up receiving 2 petabytes (~ B) of data every month As data grows, disk I/O performance needs further attention since it can significantly limit the performance and scalability of applications Especially for high performance parallel I/O, efficient storage and retrieval of data is crucial Big Data 3

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA One way to achieve scalable storage and high performance I/O is the usage of storage arrays A group of disk drives that collectively acts as a single storage system  Multiple disk drives  Controller (CPU + Memory)  Single EMC Symmetrix VMAX 240 disk drives Four Quad-core 2.33 GHz Intel Xeon Processors Up to 128 GB of memory  It is possible to connect multiple Vmax arrays Up to 2400 drives and 1 TB of memory Costs millions of dollars Storage Arrays 4

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Traditionally, storage arrays are composed of rotating Hard Disk Drives (HDD)  7.2K Revolutions Per Minute (RPM)  10K RPM  15K RPM Solid-state Drive (SSD)  Uses flash memory packages  Same interface as HDD, easily replaceable  Faster start-up, fast random access, low power consumption, silent operation, less heat, shock resistance  Expensive, wears out, limited capacity, slower sequential write Storage Arrays 5

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Entirely based on flash technology Some flash arrays currently available: Nimbus S-Class, Nimbus E-Class, RamSan 810, Violin 6000, Violin 3000 Hybrid Storage Arrays: Balance cost and performance (SSD + HDD)  Better performance compared to homogeneous HDD based storage arrays, cheaper than homogeneous SSD based flash arrays  Some hybrid storage arrays currently available: EqualLogic PS6100XS, Zebi Storage Arrays, Adaptec Hybrid RAID Solutions Flash and Hybrid Arrays Violin 3200 Flash Array 6

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Distributed and Heterogeneous Storage Architecture 15K RPM HDD 15K RPM HDD SSD HYBRID STORAGE ARRAY SSD FLASH ARRAY 10K RPM HDD 10K RPM HDD 10K RPM HDD 10K RPM HDD HDD STORAGE ARRAY 7

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Declustering for High Performance Parallel I/O Disk 0Disk 1Disk 2Disk 3Disk One Disk Access Disk Modulo [Du’82] Field-wise Exclusive OR [Kim’88] Hilbert [Faloutsos’93] Generalized Fibonacci [Prabhakar’98] AOPT: Almost Optimal [Atallah’00] 8

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Replication Replication is a common technique used for redundancy and better performance in declustering schemes Several replicated declustering schemes were proposed recently  [Chen ’03], [Ferhat.’04], [Tosun’04 and ‘05], [Frikken’02 and ‘05], [Oktay’09], [Turk’12] Optimal Response Time Retrieval (Replica Selection) Problem  N disks and |Q| buckets  Each bucket can be replicated among multiple disks  Find a retrieval schedule minimizing the retrieval time of the query Q Replica 1Replica 2 Retrieval using the first copy requires two disk accesses We can use the second copy to retrieve Q in one access Which replica should be used for the best performance? Query (Q) 9

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA How to Solve the Basic Retrieval Problem s t BucketsDisks Max-flow = |Q| = 6. If not, increment capacities of disk-t edges and call max-flow again. O(|Q|) calls in the worst case. Max-flow solution [Chen’93] [0,0] [0,1] [1,0] [1,1] [2,0] [2,1] 1.Disks are homogeneous 2.No initial load 3.No network delay Generalized Max-flow solution [Altiparmak’12 and 13] 10

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Max-flow guarantees the optimal retrieval schedule of a given (single) request In reality, requests are arriving continuously Finding the retrieval schedules individually might not result in the best performance Continuous Retrieval Request Queues Devices 11

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA We focus on optimizing continuous disk requests Multiple trade-offs are considered:  Batching for better load balancing and smaller Service Time vs. immediately retrieving requests for shorter Waiting Time  Usage of a maximum flow based retrieval algorithm guaranteeing the optimal Service Time vs. a faster retrieval heuristic with lower Execution Time Minimize Average Response (Elapsed)Time of disk requests considering their Waiting Time, Execution Time, and Service Time Continuous Retrieval 12

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA When a new request arrives;  If the storage system is idle Determine the retrieval schedule  Else Batch the incoming requests Lower total Service Time (better load balancing) Extra Waiting Time Batching 13

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA When a new request arrives, immediately determine the retrieval schedule using the initial load information of the disks  Eliminates the Waiting Time introduced by the batching strategy  Expected to yield a larger total Service Time Immediate-conservative 14

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Allows rescheduling of the previously scheduled but non- retrieved buckets. When a new request arrives, immediately determine the retrieval schedule using the initial loads and non- retrieved buckets These non-retrieved buckets are combined with the new request providing more flexibility and resulting in better total Service Time Immediate-adaptive 15

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Simulations using real world traces  Exchange, TPC-E, TPC-C traces  Around 1K, 25K, 100K requests per second  Up to 2K, 120, 200 number of buckets in each request Homogeneous and heterogeneous storage configurations using real disk parameters Used several retrieval algorithms/heuristics  Max-flow, random, shortest queue, online etc. Evaluation 16

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Exchange 17

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA [Altiparmak’12] N. Altiparmak and A. S. Tosun, Integrated maximum flow algorithm for optimal response time retrieval of replicated data, in ICPP’12. [Altiparmak’13] N. Altiparmak and A. S. Tosun, Generalized optimal response time retrieval of replicated data from storage arrays, ACM Transactions on Storage, vol. 9, no. 2, pp. 5:1–5:36, Jul [Atallah’00] M. J. Atallah and S. Prabhakar. (Almost) optimal parallel block access for range queries, in PODS’00. [Chen’93] L. T. Chen and D. Rotem. Optimal response time retrieval of replicated data, in PODS’94. [Chen’03] C.-M. Chen and C. Cheng. Replication and Retrieval Strategies of Multidimensional Data on Parallel Disks, in CIKM’03. [Du’82] H. C. Du and J. S. Sobolewski. Disk allocation for cartesian product ﬁles on multiple-disk systems. ACM Trans. on Database Systems, 7(1):82–101, March [Faloutsos’93] C. Faloutsos and P. Bhagwat. Declustering using fractals, in PDIS’93. [Ferhat.’04] H. Ferhatosmanoglu, A.S. Tosun, and A. Ramachandran, Replicated Declustering of Spatial Data, in PODS’04. [Frikken ‘02] K. Frikken, M. J. Atallah, S. Prabhakar, and R. Safavi-Naini, Optimal parallel i/o for range queries through replication, in DEXA’02. [Frikken ‘05] K. Frikken, Optimal distributed declustering using replication, in ICDT’’05. [Kim’88] M. H. Kim and S. Pramanik. Optimal ﬁle distribution for partial match retrieval, in SIGMOD,’88. [Oktay’09] K. Yasin Oktay, A. Turk, and C. Aykanat. Selective Replicated Declustering for Arbitrary Queries, in Euro-Par’09. [Prabhakar’98] S. Prabhakar, K. Abdel-Ghaffar, D. Agrawal, and A. El Abbadi. Cyclic allocation of two- dimensional data, in ICDE’93. [Tosun’04] A.S. Tosun. Replicated Declustering for Arbitrary Queries, in SAC’ 04. [Tosun’05] A.S. Tosun. Design Theoretic Approach to Replicated Declustering, in ITCC’05. [Turk’12] A. Turk, K. Y. Oktay, and C. Aykanat. Query-Log Aware Replicated Declustering. IEEE Transactions on Parallel and Distributed Systems, vol. 99, no. PrePrints, 2012 References 18

9/10/2014N. Altiparmak, MASCOTS 2014 University of Louisville, USA Thank You! Any Questions? 19