Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD COSC6376 Cloud Computing Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
Outline Hadoop Bigtable Hbase
Projects
Sample Projects Support video processing using HDFS and Mapreduce Image processing using cloud Security services using cloud Web analytics using cloud Cloud based MPI Novel applications of cloud based storage New pricing model Cyber physical system with cloud as the backend Bioinformatics using Mapreduce
Hadoop DFS (HDFS) http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html Mimic GFS Same assumptions Highly similar design Different names: Master namenode Chunkserver datanode Chunk block Operation log EditLog
Working with HDFS /usr/local/hadoop/ Installation bin/ : scripts for starting/stopping the system conf/ : configure files log/ : system log files Installation Single node: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
In-Memory Accelerator for Hadoop
HDFS on different storage devices
PCM Emerging NVM technology that can replace Flash and DRAM Much higher density; much better scalability; can do multi-level cells Non-volatile, fast reads (~50ns), slow and energy-hungry writes; limited lifetime (~10 writes per cell), no leakage
Bigtable Fay Chang, et al @google.com
Global Picture
Why Bigtable? Performance of RDBMS system is good for transaction processing but for very large scale analytic processing, the solutions are commercial, expensive, and specialized. Very large scale analytic processing Big queries – typically range or table scans. Big databases (100s of TB)
Why Bigtable? (2) Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution. Sharding is not a solution to scale open source RDBMS platforms Application specific Labor intensive (re)partitionaing
Bigtable BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products
BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls
Building Blocks Building blocks: BigTable uses of building blocks: Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent data (SSTable file format for storage of data) Scheduler: schedules jobs involved in BigTable serving Lock service: master election Map Reduce: often used to read/write BigTable data
Google File System Large-scale distributed “filesystem” Master: responsible for metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines, master responsible for ensuring replicas exist
(row, column, timestamp) -> cell contents Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications
WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched.
Rows Name is an arbitrary string Rows ordered lexicographically Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines
Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
Columns Columns have two-level name structure: Column family family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired
Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”
SSTable Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values SSTable 64K block 64K block 64K block Index
Tablet Contains some range of rows of the table Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index
Table Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTable
Architecture Client library Single master server Tablet servers
Bigtable Master Assigns tablets to tablet servers Detects addition and expiration of tablet servers Balances tablet server load. Tablets are distributed randomly on nodes of the cluster for load balancing. Handles garbage collection Handles schema changes
Bigtable Tablet Servers Each tablet server manages a set of tablets Typically between ten to a thousand tablets Each 100-200 MB by default Handles read and write requests to the tablets Splits tablets that have grown too large Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers
A 3-level Hierarchy 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. The root tablet never splits. 2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.
A 3-level Hierarchy Each meta-data row stores ~ 1KB of data, With 128 MB tablets, the three level store addresses 234 tablets (261 bytes in 128 MB tablets). Approaches a Zetabyte (million Petabytes).
Editing a Table Mutations are logged, then applied to an in-memory version Logfile stored in GFS Tablet Insert Memtable Insert Delete apple_two_E boat Insert Delete Insert SSTable SSTable
Chubby A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.
Bigtable and Chubby Bigtable uses Chubby to: Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.
API Metadata operations Writes (atomic) Reads Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns
API Examples: Write/Modify atomic row modification No support for (RDBMS-style) multi-row transactions
Return sets can be filtered using regular expressions: API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*
Tablet Serving “Log Structured Merge Trees” Image Source: Chang et al., OSDI 2006
Tablet Representation append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples
Client Write & Read Operations Write operation arrives at a tablet server: Server ensures the client has sufficient privileges for the write operation (Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable. Read operation arrives at a tablet server: Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.
Write Operations As writes execute, size of memtable increases. Once memtable reaches a threshold: Memtable is frozen, A new memtable is created, Frozen metable is converted to an SSTable and written to GFS.
Compactions Minor compaction Merging compaction Major compaction Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data
Refinements: Locality Groups Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.
Refinements: Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)