Christopher Shain Software Development Lead Tresata
Hadoop for Financial Services First completely Hadoop-powered analytics application Widely recognized as “Big Data Startup to Watch” Winner of the 2011 NCTA Award for Emerging Tech Company of the Year Based in Charlotte NC We are hiring!
Software Development Lead at Tresata Background in Financial Services IT End-User Applications Data Warehousing ETL
What is HBase? From: ▪ “HBase is the Hadoop database.” I think this is a confusing statement
‘Database’, to many, means: Transactions Joins Indexes HBase has none of these More on this later
HBase is a data storage platform designed to work hand-in-hand with Hadoop Distributed Failure-tolerant Semi-structured Low latency Strictly consistent HDFS-Aware “NoSQL”
Need for a low-latency, distributed datastore with unlimited horizontal scale Hadoop (MapReduce) doesn’t provide low- latency Traditional RDBMS don’t scale out horizontally
November 2006: Google BigTable whitepaper published: February 2007: Initial HBase Prototype October 2007: First ‘usable’ HBase January 2008: HBase becomes Apache subproject of Hadoop March 2009: HBase May 10th, 2010: HBase becomes Apache Top Level Project
Web Indexing Social Graph Messaging ( etc.)
HBase is written almost entirely in Java JVM clients are first-class citizens HBase Master RegionServer Proxy (Thrift or REST) Proxy (Thrift or REST) JVM Clients Non-JVM Clients
All data is stored in Tables Table rows have exactly one Key, and all rows in a table are physically ordered by key Tables have a fixed number of Column Families (more on this later!) Each row can have many Columns in each column family Each column has a set of values, each with a timestamp Each row:family:column:timestamp combination represents coordinates for a Cell
Defined by the Table A Column Family is a group of related columns with it’s own name All columns must be in a column family Each row can have a completely different set of columns for a column family Row:Column Family:Columns: Chris Friends Friends:Bob BobFriends:ChrisFriends:James JamesFriends:Bob
Not exactly the same as rows in a traditional RDBMS Key: a byte array (usually a UTF-8 String) Data: Cells, qualified by column family, column, and timestamp (not shown here) Row Key:Column Families : (Defined by the Table) Columns: (Defined by the Row) (May vary between rows) Cells: (Created with Columns) Chris AttributesAttributes:Age30 Attributes:Height68 FriendsFriends:Bob1 (Bob’s a cool guy) Friends:Jane0 (Jane and I don’t get along)
All cells are created with a timestamp Column family defines how many versions of a cell to keep Updates always create a new cell Deletes create a tombstone (more on that later) Queries can include an “as-of” timestamp to return point-in-time values
HBase deletes are a form of write called a “tombstone” Indicates that “beyond this point any previously written value is dead” Old values can still be read using point-in-time queries TimestampWrite TypeResulting ValuePoint-In-Time Value “as of” T+1 T + 0PUT (“Foo”)“Foo” T + 1PUT (“Bar”)“Bar” T + 2DELETE “Bar” T + 3PUT (“Foo Too”)“Foo Too”“Bar”
Requirement: Store real-time stock tick data Requirement: Accommodate many simultaneous readers & writers Requirement: Allow for reading of current price for any ticker at any point in time TickerTimestampSequenceBidAsk IBM09:15:03: MSFT09:15:04: GOOG09:15:04: IBM09:15:04:
KeysColumnDataType PK TickerVarchar TimestampDateTime Sequence_NumberInteger Bid_PriceDecimal Ask_PriceDecimal KeysColumnDataType PKTickerVarchar Bid_PriceDecimal Ask_PriceDecimal Latest Prices: Historical Prices:
Row KeyFamily:Column [Ticker].[Rev_Timestamp].[Rev_Sequence_Number] Prices:Bid Prices:Ask HBase throughput will scale linearly with # of nodes No need to keep separate “latest price” table A scan starting at “ticker” will always return the latest price row
HBase scales horizontally Needs to split data over many RegionServers Regions are the unit of scale
All HBase tables are broken into 1 or more regions Regions have a start row key and an end row key Each Region lives on exactly one RegionServer RegionServers may host many Regions When RegionServers die, Master detects this and assigns Regions to other RegionServers
TableRegion Server Users “Aaron” – “George”Node01 “George” – “Matthew”Node02 “Matthew” – “Zachary”Node01 Row Keys in Region “Aaron” – “George” “Aaron” “Bob” “Chris” Row Keys in Region “George” – “Matthew” “George” Row Keys in Region “Matthew” – “Zachary” “Matthew” “Nancy” “Zachary” -META- Table “Users” Table
Deceptively simple
HBase Master RegionServer Proxy (Thrift or REST) Proxy (Thrift or REST) JVM Clients Non-JVM Clients Backup HBase Master ZooKeeper Cluster
ZooKeeper Keeps track of which server is the current HBase Master HBase Master Keeps track of Region/RegionServer mapping Manages the -ROOT- and.META. tables Responsible for updating ZooKeeper when these change
RegionServer Stores table regions Clients Need to be smarter than RDBMS clients First connect to ZooKeeper to get RegionServer for a given Table/Region Then connect directly to RegionServer to interact with the data All connections over Hadoop RPC – non-JVM clients use proxy (Thrift or REST (Stargate))
-ROOT- Table.META.[region] info:regioninfo info:server info:serverstartcode.META. Table [table],[region start key],[region id] info:regioninfo info:server info:serverstartcode Points to DataNode hosting.META. region.… Regular User Table … whatever …… Points to DataNode hosting table region.
HBase Master is not necessarily a single point of failure (SPOF) Multiple masters can be running Current ‘active’ Master controlled via ZooKeeper Make sure you have enough ZooKeeper nodes! Master is not needed for client connectivity Clients connect directly to ZooKeeper to find Regions Everything Master does can be put off until one is elected
ZooKeeper Quorum ZooKeeper Node HBase Master (Current) HBase Master (Standby)
HBase tolerates RegionServer failure when running on HDFS Data is replicated by HDFS (dfs.replication setting) Lots of issues around fsync, failure before data is flushed - some probably still not fixed Thus, data can still be lost if node fails after a write HDFS NameNode is still SPOF, even for HBase
Similar to log in many RDBMS All operations by default written to log before considered ‘committed’ (can be overridden for ‘disposable fast writes’) Log can be replayed when region is moved to another RegionServer One WAL per RegionServer Writes WAL MemStore HFile Flushed periodically (10s by default) Flushed when MemStore gets too big
RegionServer Region Log HDFS Client Block HDFS DataNode Block Store StoreFile HFile MemStore Store StoreFile HFile MemStore Store StoreFile HFile MemStore Block HDFS DataNode Block HDFS DataNode Block HDFS DataNode Block
A RegionServer is not guaranteed to be on the same physical node as it’s data Compaction causes RegionServer to write preferentially to local node But this is a function of HDFS Client, not HBase
All data is in memory initially (memstore) HBase is a write-only system Modifications and deletes are just writes with later timestamps Function of HDFS being append-only Eventually old writes need to be discarded 2 Types of Compactions: Minor Major
All HBase edits are initially stored in memory (memstore) Flushes occur when memstore reaches a certain size By default 67,108,864 bytes Controlled by hbase.hregion.memstore.flush.size configuration property Each flush creates a new HFile
Triggered when a certain number of HFiles are created for a given Region Store (+ some other conditions) By default 3 HFiles Controlled by hbase.hstore.compactionThreshold configuration property Compacts most recent HFiles into one By default, uses RegionServer-local HDFS node Does not eliminate deletes Only touches most recent HFiles NOTE: All column families are compacted at once (this might change in the future)
Triggered every 24 hours (with random offset) or manually Large HBase installations usually leave this for manual operators Re-writes all HFiles into one Processes deletes Eliminates tombstones Erases earlier entries
HBase does not have transactions However: Row-level modifications are atomic: All modifications to a row will succeed or fail as a unit Gets are consistent for a given point in time ▪ But Scans may return 2 rows from different points in time All data read has been ‘durably stored’ ▪ Does NOT mean flushed to disk- can still be lost!
DO: Design your schema for linear range scans on your most common queries. Scans are the most efficient way to query a lot of rows quickly DON’T: Use more than 2 or 3 column families. Some operations (flushing and compacting) operate on the whole row DO: Be aware of the relative cardinality of column families Wildly differing cardinality leads to sparsity and bad scanning results.
DO: Be mindful of the size of your row and column keys They are used in indexes and queries, can be quite large! DON’T: Use monotonically increasing row keys Can lead to hotspots on writes DO: Store timestamp keys in reverse Rows in a table need to be read in order, usually you want most recent
DO: Query single rows using exact-match on key (Gets) or Scans for multiple rows Scans allow efficient I/O vs. multiple gets DON’T: Use regex-based or non-prefix column filters Very inefficient DO: Tune the scan cache and batch size parameters Drastically improves performance when returning lots of rows
Deceptively simple HBase Master RegionServer Proxy (Thrift or REST) Proxy (Thrift or REST) JVM Clients Non-JVM Clients
ZooKeeper Quorum ZooKeeper Node HBase Master (Current) HBase Master (Standby)
RegionServer Region Log HDFS Client Block HDFS DataNode Block Store StoreFile HFile MemStore Store StoreFile HFile MemStore Store StoreFile HFile MemStore Block HDFS DataNode Block HDFS DataNode Block HDFS DataNode Block
Requirement: Store an arbitrary set of preferences for all users Requirement: Each user may choose to store a different set of preferences Requirement: Preferences may be of different data types (Strings, Integers, etc) Requirement: Developers will add new preference options all the time, so we shouldn’t need to modify the database structure when adding them
One possible RDBMS solution: Key/Value table All values as strings Flexible, but wastes space Keys:Column:Data Type: PK UserIDInt PreferenceNameVarchar PreferenceValueVarchar
Store all preferences in the Preferences column family Preference name as column name, preference value as (serialized) byte array: HBase client library provides methods for serializing many common data types Row Key:Family:Column:Value: ChrisPreferences Age30 Hometown“Mineola, NY” JoePreferencesBirthdate11/13/1987