Decentralized Distributed Storage System for Big Data Presenter: Wei Xie Data-Intensive Scalable Computing Laboratory(DISCL) Computer Science Department.

Slides:

Advertisements

Similar presentations

Distributed Data Processing

Advertisements

NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Availability in Globally Distributed Storage Systems

© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Wide-area cooperative storage with CFS

STORAGE Virtualization

DISTRIBUTED COMPUTING

Module – 7 network-attached storage (NAS)

Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

© Hitachi Data Systems Corporation All rights reserved. 1 1 Det går pænt stærkt! Tony Franck Senior Solution Manager.

IBM TotalStorage ® IBM logo must not be moved, added to, or altered in any way. © 2007 IBM Corporation Break through with IBM TotalStorage Business Continuity.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

1 The Google File System Reporter: You-Wei Zhang.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

Computer System Architectures Computer System Software

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

STEALTH Content Store for SharePoint using Caringo CAStor  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"

July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Module – 4 Intelligent storage system

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

LOGO Service and network administration Storage Virtualization.

1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.

Tag line, tag line Power Management in Storage Systems Kaladhar Voruganti Technical Director CTO Office, Sunnyvale June 12, 2009.

Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.

Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Malugo – a scalable peer-to-peer storage system..

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

Database Laboratory Regular Seminar TaeHoon Kim Article.

Dynamic and Scalable Distributed Metadata Management in Gluster File System Huang Qiulan Computing Center,Institute of High Energy Physics,

St. Petersburg, 2016 Openstack Disk Storage vs Amazon Disk Storage Computing Clusters, Grids and Cloud Erasmus Mundus Master Program in PERCCOM Author:

Using Pattern-Models to Guide SSD Deployment for Big Data in HPC Systems Junjie Chen 1, Philip C. Roth 2, Yong Chen 1 1 Data-Intensive Scalable Computing.

April 9-10, 2015 Texas Tech University Semiannual Meeting Unistore: A Unified Storage Architecture for Cloud Computing Project Members: Wei Xie,

Unistore: Project Updates

Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng

Introduction to Distributed Platforms

Distributed Network Traffic Feature Extraction for a Real-time IDS

BD-CACHE Big Data Caching for Datacenters

Unistore: A Unified Storage Architecture for Cloud Computing

Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen

Storage Virtualization

Real IBM C exam questions and answers

Unistore: Project Updates

Storage Trends: DoITT Enterprise Storage

Building a Database on S3

Introduction to Teradata

Specialized Cloud Architectures

Database System Architectures

OpenStack for the Enterprise

Presentation transcript:

Decentralized Distributed Storage System for Big Data Presenter: Wei Xie Data-Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Texas Tech University 2016 Symposium on Big Data

 Trends in Big Data and Cloud Storage  Decentralized storage technique  UniStore project at Texas Tech Outline Texas Tech University 2016 Symposium on Big Data

 Large capacity: 100s terabytes of data and more  Performance-intensive: demanding big data analytics applications, real-time response  Data protection: protect 100s terabytes of data from loss Big Data Storage Requirements Texas Tech University 2016 Symposium on Big Data

Why Data Warehousing Fails in Big Data Texas Tech University 2016 Symposium on Big Data Data warehousing has been used to process very large data sets for decades A core component of Business Intelligence Not designed to handle unstructured data ( s, log files, social media, etc.) Not designed for real-time and fast response

 Traditional data warehousing problem Retrieve the sales figures of a particular item in a chain of retail stores exist in a database Comparison Texas Tech University 2016 Symposium on Big Data  Big data problem Cross-reference sales of a particular item with weather conditions at time of sale, or with various customer details, and to retrieve that information quickly

 Scale-out storage A number of compute/storage elements connected via network Capacity and performance can be added incrementally Not limited by the RAID controller Big Data Storage Trends Texas Tech University 2016 Symposium on Big Data

 Scaled-out NAS NAS: network attached storage Scale-out offers more flexible capacity/performance expansion (add NAS instead of disk in the slots of NAS) Parallel/distributed file system (Hadoop) to handle scale-out NAS EMC Isilon, Hitachi Data Systems, Data Direct Networks hScaler, IBM SONAS, HP X9000, and NetApp DATA Ontap Big Data Storage Trends Texas Tech University 2016 Symposium on Big Data

 Object Storage Flat namespace instead of hierarchical namespace of a file system Objects are identified by IDs Better scalability and performance for very large number of objects Amazon S3  Hyperscale Architecture Mainly used for large infrastructure sites by Facebook, Google, Microsoft and Amazon Scaled-out DAS: direct attached server, commodity enterprise server attached with storage devices Redundancy: fail over entire server instead of components Hadoop run on top of a cluster of DAS to support big data analytics Part of the Software Defined Storage platform Commercial product: EMC’s ViPR Big Data Storage Trends Texas Tech University 2016 Symposium on Big Data

 Compute, network, storage and virtualization tightly integrated  Buy a hardware box and get all you need  VMware, Nutanix, Nimboxx Hyper-converged Storage Texas Tech University 2016 Symposium on Big Data

 A centralized storage cluster: metadata server, storage servers and interconnections Scalability is bounded by the metadata server Multi-site distributed storage? Redundancy achieved by RAID  Decentralized storage cluster No metadata server to limit the scalability Multi-site, geographically distributed Data replicated across servers, racks or sites Scale-out Storage Centralized vs. Decentralized Texas Tech University 2016 Symposium on Big Data

 How to distribute data across nodes/servers/disks? P2P based protocol Distributed hash table  Advantage Incremental scalability: build a small cluster and expand in the future Self-organizing Redundancy  Issues Data migration upon data center expansion and failures Handling heterogeneous servers Decentralized Storage Texas Tech University 2016 Symposium on Big Data

Decentralized Storage: Consistent Hashing 1 holds D1 2 holds D2 3 holds D3 4 holds D1 2 holds D2 3 holds D3 1 holds nothing SHA-1 function

Properties of Consistent Hashing Balance: each server owns equal portion of keys Smoothness: to add the k th server, 1/k fraction of keys located between it and predecessor server should be migrated Fault tolerance: multiple copies for each key, if one server down, find next successor with small change to the cluster view and balance still holds

Unistore Overview Workloads Access patterns Devices Bandwidth Throughput Block erasure Concurrency Wear-leveling Characterization Component I/O Pattern Random/Sequential Read/write Hot/cold I/O Functions Write_to_SSD Read_from_SSD Write_to_HDD Data Placement Component Placement Algorithm Modified Consistent Hash To build a unified storage architecture (Unistore) for Cloud storage systems with the co-existence and efficient integration of heterogeneous HDDs and SCM (Storage Class Memory) devices Based on a decentralized consistent hashing based storage system - Sheepdog guide

 Heterogeneous storage environment  Distinct throughput NVMe SSD: 2000 or more MB/s SATA SSD: ~500 MB/s Enterprise HDD: ~150 MB/s  Large SSDs are becoming available, but still expensive 1.2TB NVMe Intel 750 costs $1000 1TB SATA Saumsung 640 EVO costs $ or more costly than HDDs  SSDs still co-exist with HDDs as accelerator instead of replacing them Background: Heterogeneous Storage 15

 Traditional way of using SCMs (i.e. SSD) in cloud-scale distributed storage: as cache layer Caching/buffering generates extensive writes to SSD, which wears out the device Need fine-tuned caching/buffering scheme Not fully utilize capacity of SSDs The capacity of SSDs is growing fast  Tiered Storage Data placed on SSD or HDD servers according to requirements  Throughput  Latency  Access frequency Data transfer between tiers when the requirements changed Background: How to Use SSDs in Cloud-scale Storage 16

 CRUSH ensures data placed across multiple independent locations to improve data availability  Tiered-CRUSH integrates storage tiering into the CRUSH data placement Tiered-CRUSH 17

 The virtualized volumes have different access pattern  Access frequency of object recorded per volume, hotter data more likely to be placed on faster tiers  Fair storage utilization maintained Tiered-CRUSH 18

Tiered-CRUSH: Evaluation 19  Implemented in a benchmarks tool compiled with the CRUSH library functions  Simulation showed that data distribution uniformity can be maintained  Simulation shows 1.5 to 2X improvement in overall bandwidth in our experimental settings Device nameNumberCapacity(GB)Read bandwidth (MB/s) Samsung NVMe SSD Samsung SATA SSD Seagate HDD

 Trace object I/O requests when executing applications at first time  Trace analysis, correlation finding and object grouping  Reorganize objects for replication in the background Pattern-directed Replication 20

Version Consistent Hashing Scheme Build versions into the consistent hashing Avoid data migration when adding nodes or node fails Maintain efficient data lookup

Conclusions Decentralized storage becomes the standard in cloud storage Tiered-CRUSH algorithm achieves better IO performance and higher data availability at the same time for heterogeneous storage system Version consistent hashing scheme for improving manageability of data center PRS for high performance data replication by reorganizing the placement of data replications

Thank you! Questions? Visit: discl.cs.ttu.edu for more details Texas Tech University 2016 Symposium on Big Data