DuraStore – Achieving Highly Durable Data Centers

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Henry C. H. Chen and Patrick P. C. Lee
Availability in Globally Distributed Storage Systems
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
12006/9/26 Load Balancing in Dynamic Structured P2P Systems Brighten Godfrey, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica INFOCOM.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
IPDPS 2007 Making Peer-to-Peer Anonymous Routing Resilient to Failures Yingwu Zhu Seattle University
Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.
DISTRIBUTED COMPUTING
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Model-Driven Analysis Frameworks for Embedded Systems George Edwards USC Center for Systems and Software Engineering
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
The IEEE International Conference on Cluster Computing 2010
TECHNOLOGY GUIDE THREE Emerging Types of Enterprise Computing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Load Rebalancing for Distributed File Systems in Clouds.
Seminar On Rain Technology
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Decentralized Distributed Storage System for Big Data Presenter: Wei Xie Data-Intensive Scalable Computing Laboratory(DISCL) Computer Science Department.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
File system: Ceph Felipe León fi Computing, Clusters, Grids & Clouds Professor Andrey Y. Shevel ITMO University.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
An Innovative Internship Model ( version 3 1/3/2012 )
Using Pattern-Models to Guide SSD Deployment for Big Data in HPC Systems Junjie Chen 1, Philip C. Roth 2, Yong Chen 1 1 Data-Intensive Scalable Computing.
April 9-10, 2015 Texas Tech University Semiannual Meeting Unistore: A Unified Storage Architecture for Cloud Computing Project Members: Wei Xie,
Geoffrey Fox Panel Talk: February
Unistore: Project Updates
Data Management on Opportunistic Grids
Introduction to Load Balancing:
Curator: Self-Managing Storage for Enterprise Clusters
Status and Challenges: January 2017
Distributed Network Traffic Feature Extraction for a Real-time IDS
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Presented by Munezero Immaculee Joselyne PhD in Software Engineering
Dynamo: Amazon’s Highly Available Key-value Store
Unistore: A Unified Storage Architecture for Cloud Computing
CHAPTER 3 Architectures for Distributed Systems
Towards Reliable Application Deployment in the Cloud
Elastic Consistent Hashing for Distributed Storage Systems
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Plethora: Infrastructure and System Design
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
Consistency in Distributed Systems
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
2019-TTU-1: Visualizing, Monitoring, and Automating Data Centers
Unistore: Project Updates
Model-Driven Analysis Frameworks for Embedded Systems
湖南大学-信息科学与工程学院-计算机与科学系
Department of Computer Science The University of Texas at Dallas
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Computational Elements of Robust Civil Infrastructure
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Automated Analysis and Code Generation for Domain-Specific Models
Presentation Title Presenter: Name, Institution Project Participants:
Database System Architectures
Presentation transcript:

DuraStore – Achieving Highly Durable Data Centers Lead faculty: Yong Chen Students: Wei Xie Jiang Zhou (postdoctoral researcher) CAC faculty: Cloud and Autonomic Computing Center & Data-Intensive Scalable Computing Laboratory Presenter: Wei Xie

Data-Intensive Scalable Computing Laboratory Faculty/staff members Dr. Yong Chen Dr. Yu Zhuang Dr. Dong Dai Dr. Jiang Zhou Ph.D. student members Mr. Ghazanfar Ali Ms. Elham Hojati Mr. John Leidel Ms. Neda Tavakoli Mr. Xi Wang Mr. Wei Xie Mr. Wei Zhang Masters/UG student members Ms. Priyanka Kumari, Mr. Ali Nosrati, Mr. Yuan Cui, Mr. Frank Conlon, Ms. Yan Mu, Mr. Zachary Hansen, Mr. Yong Wu, Mr. Alex Weaver Website: http://discl.cs.ttu.edu/ Mission statement Broad research interests in parallel and distributed computing, high-performance computing, cloud computing with a focus on building scalable computing systems for data-intensive applications Projects (over $2M external grants in past 5 years, from NSF/DOE/DOD) OpenSoC HPC: Open Source, Extensible High Performance Computing Platform (DOD/LBNL) Compute on Data Path: Combating Data Movement in HPC (NSF) Unistore: A Unified Storage Architecture (NSF-IUCRC/Nimboxx) Development of a Data-Intensive Scalable Computing Instrument (NSF) Decoupled Execution Paradigm for Data-Intensive High-End Computing (NSF) Active Object Storage for Big Data Applications (DOE/ANL) in high-performance scientific computing/high-end enterprise computin

Outline Project Goals, Motivations, Challenges Project Overview Project team members Background and Related Research Overview of project tasks Activities and outcomes Deliverables and benefits LIFE form input

Project Goals, Motivation, Challenges Design a data center storage system that is significantly more durable under correlated failure events. Motivations Correlated failure events (e.g. site-wide power outage) Random replication used in mainstream storage system has high data loss probability It takes high cost to recover data loss resulted from correlated failures Both industry and research groups reported this problem Challenges Recently emerged durability aware data replication scheme improves durability but sacrifices load balance, scalability, and many other features of storage systems

Project Overview Core component: durability aware data replication/erasure coding Use copyset and combination theory to minimize the correlated failure data loss probability Compatible with existing data replication/erasure coding scheme with plug-in replacement Data store software and storage infrastructure should be modified to be compatible Durable distributed/parallel data store (HDFS, GFS, RAMCloud, Ceph) Durability aware data replication/erasure coding Durable storage cluster

Project Team Members Faculty Students Expertise Dr. Yong Chen Assistant Professor of CS, Texas Tech University Dr. Jiang Zhou Postdoctoral Researcher of CS, Texas Tech University Students Wei Xie CS Ph.D. Candidate, Texas Tech University Expertise Storage systems Parallel/distributed file systems Cloud software stack Data models and data management

Background and Related Research Correlated failures in Data Centers In a cluster-wide failure event such as power outage occurs, about 0.5-1% of the nodes fail to reboot (from report of Yahoo! and LinkedIn) Long fixed time to recover lost data (locating lost data) Copyset Data Replication Scheme A copyset is a set of servers that a data chunk is replicated to The total number of unique copysets in the system determines the data loss probability under correlated failure Minimize the total number of unique copysets could significantly reduce the data loss probability (100% to 0.1%) Geo-replication Replication on remote site to achieve better durability

Overview of Project Tasks Task 1: Analysis and modeling of mainstream data stores in terms of data durability Task2: Design a new data replication or erasure coding scheme Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme

Task 1: Analysis and Modeling Task 1: Analysis and modeling of mainstream data stores in terms of data durability Data redundancy scheme (replication or erasure coding) Correlated failure durability model

Task 2: Design Task2: Design a new data replication or erasure coding scheme Enhance data durability without sacrificing load balance and scalability For load balance and scalability, a consistent hashing based replication is a good candidate It is challenging to improve the durability of consistent hashing based replication Solutions will be developed based upon extensive prior R&D

Task 3: Implementation Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system Target systems: Sheepdog, Ceph, and HDFS A prototype system with the newly proposed durable data replication or erasure coding scheme is planned The code developed will be free to use

Task 4: Evaluation Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme Small-scale test will be conducted Large-scale simulation to complement the test

Activities and outcomes Preliminary study and prior R&D in this space Modeling of mainstream data store systems Outcomes Knowledge and findings from a systematic study on the correlated failure durability Mathematical model and a simulator for simulating the data loss in data centers under correlated failures New data replication algorithm or erasure coding scheme for DuraStore and prototype implementation source codes

Deliverables and benefits One or more correlated-failure durability aware data replication or erasure coding schemes for data center storage Evaluation results from the simulation and experiment of the implementation of the proposed schemes Report/paper presenting the design and evaluation results Benefits Access to the DuraStore design and prototype and free use of IP generated Enhanced productivity and utilization of cloud storage systems Less operational cost to data centers Less maintenance and trouble-shooting manpower and resources for data centers Simpler fault-tolerance design of data centers Collaboration with faculty/staff and graduate students Access to reports and papers Recruitment and spin-off opportunities

Preliminary Results and Publications W. Xie and Y. Chen. Elastic Consistent Hashing for Distributed Storage Systems, IPDPS’17 J. Zhou, W. Xie, D. Dai and Y. Chen. Pattern-Directed Replication Scheme for Heterogeneous Object-based Storage, CCGrid’17 J. Zhou, W. Xie, J. Noble, K. Echo and Y. Chen. SUORA: A Scalable and Uniform Data Distribution Algorithm for Heterogeneous Storage Systems, NAS’16 W. Xie, J. Zhou, M. Reyes, J. Noble and Y. Chen. Two-Mode Data Distribution Scheme for Heterogeneous Storage in Data Centers, BigData’15 Sheepdog with Elastic Consistent Hashing (IPDPS'17 paper): http://discl.cs.ttu.edu/gitlab/xiewei/ElasticConsistentHashing Copyset consistent hashing on lib-ch-placement: http://xiewei@discl.cs.ttu.edu/gitlab/xiewei/ch-placement-copyset.git

LIFE Form Input Please take a moment to fill out your L.I.F.E. forms. http://www.iucrc.com Select “Cloud and Autonomic Computing Center” then select “IAB” role. What do you like about this project? What would you change? (Please include all relevant feedback.)