Download presentation
Presentation is loading. Please wait.
Published byGilbert Woods Modified over 6 years ago
1
DuraStore – Achieving Highly Durable Data Centers
Lead faculty: Yong Chen Students: Wei Xie Jiang Zhou (postdoctoral researcher) CAC faculty: Cloud and Autonomic Computing Center & Data-Intensive Scalable Computing Laboratory Presenter: Wei Xie
2
Data-Intensive Scalable Computing Laboratory
Faculty/staff members Dr. Yong Chen Dr. Yu Zhuang Dr. Dong Dai Dr. Jiang Zhou Ph.D. student members Mr. Ghazanfar Ali Ms. Elham Hojati Mr. John Leidel Ms. Neda Tavakoli Mr. Xi Wang Mr. Wei Xie Mr. Wei Zhang Masters/UG student members Ms. Priyanka Kumari, Mr. Ali Nosrati, Mr. Yuan Cui, Mr. Frank Conlon, Ms. Yan Mu, Mr. Zachary Hansen, Mr. Yong Wu, Mr. Alex Weaver Website: Mission statement Broad research interests in parallel and distributed computing, high-performance computing, cloud computing with a focus on building scalable computing systems for data-intensive applications Projects (over $2M external grants in past 5 years, from NSF/DOE/DOD) OpenSoC HPC: Open Source, Extensible High Performance Computing Platform (DOD/LBNL) Compute on Data Path: Combating Data Movement in HPC (NSF) Unistore: A Unified Storage Architecture (NSF-IUCRC/Nimboxx) Development of a Data-Intensive Scalable Computing Instrument (NSF) Decoupled Execution Paradigm for Data-Intensive High-End Computing (NSF) Active Object Storage for Big Data Applications (DOE/ANL) in high-performance scientific computing/high-end enterprise computin
3
Outline Project Goals, Motivations, Challenges Project Overview
Project team members Background and Related Research Overview of project tasks Activities and outcomes Deliverables and benefits LIFE form input
4
Project Goals, Motivation, Challenges
Design a data center storage system that is significantly more durable under correlated failure events. Motivations Correlated failure events (e.g. site-wide power outage) Random replication used in mainstream storage system has high data loss probability It takes high cost to recover data loss resulted from correlated failures Both industry and research groups reported this problem Challenges Recently emerged durability aware data replication scheme improves durability but sacrifices load balance, scalability, and many other features of storage systems
5
Project Overview Core component: durability aware data replication/erasure coding Use copyset and combination theory to minimize the correlated failure data loss probability Compatible with existing data replication/erasure coding scheme with plug-in replacement Data store software and storage infrastructure should be modified to be compatible Durable distributed/parallel data store (HDFS, GFS, RAMCloud, Ceph) Durability aware data replication/erasure coding Durable storage cluster
6
Project Team Members Faculty Students Expertise Dr. Yong Chen
Assistant Professor of CS, Texas Tech University Dr. Jiang Zhou Postdoctoral Researcher of CS, Texas Tech University Students Wei Xie CS Ph.D. Candidate, Texas Tech University Expertise Storage systems Parallel/distributed file systems Cloud software stack Data models and data management
7
Background and Related Research
Correlated failures in Data Centers In a cluster-wide failure event such as power outage occurs, about 0.5-1% of the nodes fail to reboot (from report of Yahoo! and LinkedIn) Long fixed time to recover lost data (locating lost data) Copyset Data Replication Scheme A copyset is a set of servers that a data chunk is replicated to The total number of unique copysets in the system determines the data loss probability under correlated failure Minimize the total number of unique copysets could significantly reduce the data loss probability (100% to 0.1%) Geo-replication Replication on remote site to achieve better durability
8
Overview of Project Tasks
Task 1: Analysis and modeling of mainstream data stores in terms of data durability Task2: Design a new data replication or erasure coding scheme Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme
9
Task 1: Analysis and Modeling
Task 1: Analysis and modeling of mainstream data stores in terms of data durability Data redundancy scheme (replication or erasure coding) Correlated failure durability model
10
Task 2: Design Task2: Design a new data replication or erasure coding scheme Enhance data durability without sacrificing load balance and scalability For load balance and scalability, a consistent hashing based replication is a good candidate It is challenging to improve the durability of consistent hashing based replication Solutions will be developed based upon extensive prior R&D
11
Task 3: Implementation Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system Target systems: Sheepdog, Ceph, and HDFS A prototype system with the newly proposed durable data replication or erasure coding scheme is planned The code developed will be free to use
12
Task 4: Evaluation Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme Small-scale test will be conducted Large-scale simulation to complement the test
13
Activities and outcomes
Preliminary study and prior R&D in this space Modeling of mainstream data store systems Outcomes Knowledge and findings from a systematic study on the correlated failure durability Mathematical model and a simulator for simulating the data loss in data centers under correlated failures New data replication algorithm or erasure coding scheme for DuraStore and prototype implementation source codes
14
Deliverables and benefits
One or more correlated-failure durability aware data replication or erasure coding schemes for data center storage Evaluation results from the simulation and experiment of the implementation of the proposed schemes Report/paper presenting the design and evaluation results Benefits Access to the DuraStore design and prototype and free use of IP generated Enhanced productivity and utilization of cloud storage systems Less operational cost to data centers Less maintenance and trouble-shooting manpower and resources for data centers Simpler fault-tolerance design of data centers Collaboration with faculty/staff and graduate students Access to reports and papers Recruitment and spin-off opportunities
15
Preliminary Results and Publications
W. Xie and Y. Chen. Elastic Consistent Hashing for Distributed Storage Systems, IPDPS’17 J. Zhou, W. Xie, D. Dai and Y. Chen. Pattern-Directed Replication Scheme for Heterogeneous Object-based Storage, CCGrid’17 J. Zhou, W. Xie, J. Noble, K. Echo and Y. Chen. SUORA: A Scalable and Uniform Data Distribution Algorithm for Heterogeneous Storage Systems, NAS’16 W. Xie, J. Zhou, M. Reyes, J. Noble and Y. Chen. Two-Mode Data Distribution Scheme for Heterogeneous Storage in Data Centers, BigData’15 Sheepdog with Elastic Consistent Hashing (IPDPS'17 paper): Copyset consistent hashing on lib-ch-placement:
16
LIFE Form Input Please take a moment to fill out your L.I.F.E. forms.
Select “Cloud and Autonomic Computing Center” then select “IAB” role. What do you like about this project? What would you change? (Please include all relevant feedback.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.