HBase MTTR, Stripe Compaction and Hoya

Slides:



Advertisements
Similar presentations
© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
Zookeeper at Facebook Vishal Kathuria.
SwatI Agarwal, Thomas Pan eBay Inc.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Google File System.
RAID Systems CS Introduction to Operating Systems.
Transaction log grows unexpectedly
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
The Hadoop Distributed File System
MAHADEV KONAR Apache ZooKeeper. What is ZooKeeper? A highly available, scalable, distributed coordination kernel.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Bigtable: A Distributed Storage System for Structured Data
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
HDFS Deep Dive Berlin Buzzwords 2010 Jay Booth. HDFS in a slide One NameNode, N datanodes Files are split into Blocks Client talks to namenode in order.
CS Introduction to Operating Systems
Data Loss and Data Duplication in Kafka
HBase Mohamed Eltabakh
Hadoop.
HDFS Yarn Architecture
Chapter 10 Data Analytics for IoT
CSE-291 (Cloud Computing) Fall 2016
Senior Solutions Architect, MongoDB Inc.
Introduction to HDFS: Hadoop Distributed File System
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
湖南大学-信息科学与工程学院-计算机与科学系
GARRETT SINGLETARY.
Trafodion Distributed Transaction Management
Introduction to Apache
HBase on MapR Lohit VijayaRenu, MapR Technologies, Inc.
THE GOOGLE FILE SYSTEM.
Distributed Availability Groups
Presentation transcript:

HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

About myself Been working on Hbase for 3 years Became Committer & PMC member June 2011

Outline Overview to HBase Recovery HDFS issues Stripe compaction Hbase-on-Yarn Q & A

We’re in a distributed system Hard to distinguish a slow server from a dead server Everything, or, nearly everything, is based on timeout Smaller timeouts means more false positive HBase works well with false positive, but they always have a cost. The less the timeouts the better

HBase components for recovery

Recovery in action

Region Servers, DataNode Recovery process ZK Heartbeat Client Region Servers, DataNode Data recovery Master, RS, ZK Region Assignment Failure detection: ZooKeeper heartbeats the servers. Expire the session when it does not reply Region assignment: the master reallocates the regions to the other servers Failure recovery: read the WAL and rewrite the data again The client stops the connection to the dead server and goes to the new one.

Failure detection Failure detection 0.96 Set a ZooKeeper timeout to 30s instead of the old 180s default. Beware of the GC, but lower values are possible. ZooKeeper detects the errors sooner than the configured timeout 0.96 HBase scripts clean the ZK node when the server is kill -9ed => Detection time becomes 0 Can be used by any monitoring tool

With faster region assignment Detection: from 180s to 30s Data recovery: around 10s Reassignment : from 10s of seconds to seconds

DataNode crash is expensive! One replica of WAL edits is on the crashed DN 33% of the reads during the regionserver recovery will go to it Many writes will go to it as well (the smaller the cluster, the higher that probability) NameNode re-replicates the data (maybe TBs) that was on this node to restore replica count NameNode does this work only after a good timeout (10 minutes by default) HDFS writes localy first => when you lose a region server you’ve just lost 1 of the 3 replica of the WAL => When you write, HDFS may select a this dead datanode Recovery means: Reading the WAL Writing new data HDFS marks a server as dead after 10 minutes - Don’t change that: ever heard of replication storm? Hbase recovery is slowed down by trying to read and write on dead datanodes So HBase recovery takes > 10 minutes

HDFS – Stale mode Live As today: used for reads & writes, using locality 30 seconds, can be less. Stale Not used for writes, used as last resort for reads 10 minutes, don’t change this Stale mode: Not live, not dead. Used only if necessary Last priority for reads Excluded from writes Available in all HDFS branches (1 & 2 & 3). Must be marked stale before the HBase recovery happens to save extra seconds A datanode can be marked as stale Dead As today: not used. And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node

Results Do more read/writes to HDFS during the recovery Multiple failures are still possible Stale mode will still play its role And set dfs.timeout to 30s This limits the effect of two failures in a row. The cost of the second failure is 30s if you were unlucky HDFS writes localy first => when you lose a region server you’ve just lost 1 of the 3 replica of the WAL => When you write, HDFS may select a this dead datanode Recovery means: Reading the WAL Writing new data HDFS marks a server as dead after 10 minutes - Don’t change that: ever heard of replication storm? Hbase recovery is slowed down by trying to read and write on dead datanodes So HBase recovery takes > 10 minutes

Here is the client The client Can be connected to the dead regionserver Thanks to TCP, needs a timeout Default is one minute, often increased (scanners, coprocessors: many reasons). Nice situation: everything is recovered on the cluster, but the client is still waiting for an answer from the dead server

The client You want the client to be patient Retries when the system is already loaded is not good. You want the client to learn about region servers dying, and to be able to react immediately. You want the solution to be scalable. The client Can be connected to the dead regionserver Thanks to TCP, needs a timeout Default is one minute, often increased (scanners, coprocessors: many reasons). Nice situation: everything is recovered on the cluster, but the client is still waiting for an answer from the dead server

Scalable solution The master notifies the client A cheap multicast message with the “dead servers” list. Sent 5 times for safety. Off by default. On reception, the client stops immediately waiting on the TCP connection. You can now enjoy large hbase.rpc.timeout

Faster recovery (HBASE-7006) Previous algorithm Read the WAL files Write new Hfiles Tell the region server it got new Hfiles Put pressure on namenode Remember: avoid putting pressure on the namenode New algo: Read the WAL Write to the regionserver We’re done (have seen great improvements in our tests) TBD: Assign the WAL to a RegionServer local to a replica

Distributed log Splitting WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> WAL-file1 <region2:edit1><region1:edit2> HDFS Distributed log Splitting RegionServer3 RegionServer2 RegionServer1 writes reads RegionServer0 RegionServer_x RegionServer_y reads writes Previously.. HDFS Splitlog-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Splitlog-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Splitlog-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

Distributed log Replay WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> WAL-file1 <region2:edit1><region1:edit2> HDFS Distributed log Replay RegionServer3 RegionServer2 RegionServer1 writes reads RegionServer0 RegionServer_x RegionServer_y replays reads writes Previously.. HDFS Recovered-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Recovered-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Recovered-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

Write during recovery Concurrent writes allowed during the WAL replay – same memstore serves both Events stream: your new recovery time is the failure detection time: max 30s, likely less! Caveat: HBASE-8701 WAL Edits need to be applied in receiving order

MemStore flush Real life: some tables are updated at a given moment then left alone With a non empty memstore More data to recover It’s now possible to guarantee that we don’t have MemStore with old data Improves real life MTTR Helps online snapshots

.META. .META. A lot of small improvements And a big one There is no –ROOT- table in 0.95/0.96 But .META. failures are critical A lot of small improvements Server now says to the client when a region has moved (client can avoid going to meta) And a big one .META. WAL is managed separately to allow an immediate recovery of META With the new MemStore flush, ensure a quick recovery

Data locality post recovery HBase performance depends on data-locality After a recovery, you’ve lost it Bad for performance Here comes region groups Assign 3 favored RegionServers for every region On failures assign the region to one of the secondaries The data-locality issue is minimized on failures

Discoveries from cluster testing HDFS-5016 Heartbeating thread blocks under some failure conditions leading to loss of datanodes HBASE-9039 Parallel assignment and distributed log replay during recovery Region splitting during distributed log replay may hinder recovery To summarize, responder thread is stuck in flush call, writer thread is stuck on calling join() on the responder thread, FSDataset recoverRbw is holding the FSDataset lock and is stuck waiting on join() for the responder thread. Since the FSDataset lock is held, which is crucial for the datanode, the heart beat thread, data transceiver threads are blocked waiting on FSDataset lock.

Architecting the Future of Big Data Compactions example Memstore fills up, files are flushed When enough files accumulate, they are compacted writes … MemStore HDFS HFile HFile HFile HFile HFile Architecting the Future of Big Data

But, compaction cause slowdowns Looks like lots of I/O for no apparent benefit Example effect on reads (note better average)

Key ways to improve compactions Read from fewer files Separate files by row key, version, time, etc. Allows large number of files to be present, uncompacted Don't compact the data you don't need to compact For example, old data in OpenTSDB-like systems Obviously, results in less I/O Make compactions smaller Without too much I/O amplification or too many files Results in less compaction-related outages HBase works better with few large regions; however, large compactions cause unavailability

Stripe compactions (HBASE-7667) Somewhat like LevelDB, partition the keys inside each region/store But, only 1 level (plus optional L0) Compared to regions, partitioning is more flexible The default is a number of ~equal-sized stripes To read, just read relevant stripes + L0, if present HFile L0 get 'hbase' HFile HFile HFile HFile HFile H Row-key axis Region start key: ccc eee ggg iii: region end key Architecting the Future of Big Data

Stripe compactions – writes Data flushed from MemStore into several files Each stripe compacts separately most of the time MemStore HFile HFile HFile HFile HFile HFile H H H HDFS H Architecting the Future of Big Data

Stripe compactions – other Why Level0? Bulk loaded files go to L0 Flushes can also go into single L0 files (to avoid tiny files) Several L0 files are then compacted into striped files Can drop deletes if compacting one entire stripe +L0 No need for major compactions, ever Compact 2 stripes together – rebalance if unbalanced Very rare, however - unbalanced stripes are not a huge deal Boundaries could be used to improve region splits in future Architecting the Future of Big Data

Stripe compactions - performance EC2, c1.xlarge, preload; then measure random read perf LoadTestTool + deletes + overwrites; measure random reads Architecting the Future of Big Data

Hbase on Yarn Hoya is a YARN application All components are YARN services Input is cluster specification, persisted as JSON document on HDFS HDFS and ZooKeeper are shared by multiple cluster instances The cluster can also be stopped and later resumed

Hoya Architecture Hoya Client: parses commandline, executes local operations, talks to HoyaMasterService HoyaMasterService: AM service, deploys the HBase master locally HoyaRegionService: installs and executes the region server

HBase Master Service Deployment HoyaMasterService requested to create cluster Local Hbase dir chosen for expanded image User supplied config dir overwrites conf files in conf directory Hbase conf patched with hostname of master HoyaMasterService monitors reporting from RM

Failure Handling Region Service failures trigger new RS instances MasterService failures not trigger restart RegionService monitors ZK node for master MasterService monitors state of Hbase master

Runtime classpath dependencies

Q & A Thanks!