BigTable and Google File System

BigTable and Google File System
Presented by: Ayesha Fawad 10/07/2014

Overview Google File System Basics Design Chunks Replicas Clusters
Client

Overview Google File System chunk server Master server Shadow Master
Read Request Workflow Write Request Workflow Built-in Functions Limitations

Overview BigTable Introduction What is BigTable? Design
example Rows and Tablets Columns and Column Families

Overview BigTable Timestamp Cells Data Structure SSTables and Logs
example Cells Data Structure SSTables and Logs Tablet Table

Overview BigTable Cluster Chubby How to find a row Mutations
BigTable Implementation BigTable Building Blocks Architecture

Overview BigTable Master server Tablet server Client Library
Incase of Failure? Recovery Process Compactions Refinement

Overview BigTable Interactions between GFS and BigTable API
Why use BigTable? Why not any other Database? Application Design CAP

Overview BigTable Google Services using BigTable BigTable Derivatives
Colossus Comparison

Overview Google App Engine Introduction GAE Data store
Unsupported Actions Entities Models Queries Indexes

Overview Google App Engine GQL Transactions Data store Software Stack
GUI Main Data store options Competitors

Overview Google App Engine Hard Limits Free Quotas
Cloud Data Storage options

Presented by: Ayesha Fawad 10/07/2014
Google File System Presented by: Ayesha Fawad 10/07/2014

Basics Originated in 2003. GFS is designed for system to system interaction, not user to system. Network of inexpensive machines running on Linux operating systems

Design GFS relies on Distributed Computing to provide users the infrastructure they need to create, access and alter data Distributed Computing: is all about networking several computers together and taking advantage of their individual resources in a collective way. Each computer contributes some of its resources e.g. such as memory, processing power and hard drive space, to the overall network. It turns the entire network into a massive computer, with each individual computer acting as a processor and data storage device.

Design Autonomic Computing: a concept in which computers are able to diagnose problems and solve them in real time without the need for human intervention Challenge for GFS development team was to design an autonomic monitoring system that could work across a huge network of computers Simplification offer basic commands like open, create, read, write and close. Some specialized commands like append, snapshot

Design Checkpoints can include application level checksums
Readers verify and process only file region up to last checkpoint, which is known to be in defined state Check pointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from applications point of view Relies on appends rather than overwrites

Chunks Files on the GFS tend to be very large (multi-gigabyte (GB) range) GFS handles this issue by breaking files up into chunks of 64 MB each good for scans, streams, archives, shared Q’s Each chunk has a unique 64-bit ID number called chunk handle Simplifies Resource Application: all file chunks are the same size check which computers are near capacity check which computers are underused balance workload by moving chunk from one resource to another

Replicas Two categories:
Primary Replica: primary replica is the chunk that a chunk server sends to a client Secondary Replica: secondary replicas serve as backups on other chunk servers Master decides which chunks will act as primary or secondary Based on client changes to the data in the chunk, the master server informs chunk servers with secondary replicas that they have to copy the new chunk off the primary chunk server to stay current

Design REFERENCE:

Clusters Google has organized GFS into a simple network of computers called clusters Cluster contains three kinds of entities: Clients Master Server Chunk servers

Client Clients: any entity making a request
Developed by Google for its own use Clients can be other computers or computer applications

Chunk server Chunk servers: workhorses stores the 64 MB file chunks
sends requested chunks directly to client replicas are configurable

Master server Master Server: is the coordinator for cluster
maintains operation log keeps track of metadata information describing chunks chunk garbage collection re-replication on chunk server failures chunk migration to balance load and disk space does not store the actual chunks

Master server Upon start up, master server polls all the chunk servers
chunk servers respond back with information of: data they contain location details space details REFERENCE:

Shadow Master Shadow master servers contact primary master server to stay up to date operation log polling chunk servers Anything goes wrong with the primary master, the shadow server can take over GFS ensure shadow master servers are stored on different machines (incase of hardware failure) Shadow servers lag behind the primary master server by fractions of a second They provide limited services in parallel with master. Services are limited to reads

Shadow Master REFERENCE:

Read Request Work flow Client send a read request for a particular file to master REFERENCE:

Read Request Work flow 2. Master responds back with a location of primary replica, where client can find that particular file REFERENCE:

Read Request Work flow Client contacts the chunk server directly
REFERENCE:

Read Request Work flow 4. chunk server sends the replica to the client
REFERENCE:

Write Request Work flow
Client sends the request to master server REFERENCE:

2. Master responds back with a location of primary and secondary replicas REFERENCE:

3. Client sends the write data to all the replicas. Regardless of primary or secondary, closest one first (pipeline) REFERENCE:

Once data is received by replicas, client instructs the primary replica to begin the write function primary assigns consecutive serial numbers to each of the file changes (mutations) REFERENCE:

5. After primary applies the mutations to its own data, it sends the write requests to all the secondary replicas REFERENCE:

6. Secondary replicas complete the write function and report back to the primary replica REFERENCE:

Primary sends confirmation to the client if that doesn’t work, the master will identify the affected replica as garbage REFERENCE:

Mutations REFERENCE:

Mutations Consistent: a file region is consistent, if all clients will always see same data, regardless of which replicas is being read Defined: a region is defined, after a file data mutation if it is consistent and clients will see what the mutation writes in its entirely

Built-in Functions Master and Chunk replication
Streamlined recovery process Rebalancing Stale replica detection Garbage removal - configurable Checksumming each 64 MB chunk is broken into blocks of 64 KB each block has its own 32-bit checksum master monitors and compares checksums prevents data corruption

Limitations Suited for batch-oriented applications which prefers high sustained bandwidth over low latency e.g. web crawling Single Point of Failure is unacceptable for latency sensitive applications e.g. Gmail or YouTube Single master a scanning bottleneck Consistency Problems

BigTable Presented by: Ayesha Fawad 10/07/2014

Introduction Created by Google in 2005.
Maintained as a proprietary, in-house technology. Some technical details were disclosed in USENIX Symposium in 2006. It is being used by Google services since 2005.

What is BigTable? It is a distributed storage system
could be spread across multiple nodes appears to be one large table not a database design, it’s a storage design model REFERENCE:

What is BigTable? Map BigTable is a collection of (key, value) pairs The key identifies a row and the value is the set of columns

What is BigTable? Sparse
different rows in the table may you use different columns. with many of the columns empty for a particular row

What is BigTable? Column-oriented
it can operate on a set of attributes (columns) for all tuples stores each column contiguously on disk allow more records in a disk block reduces the disk I/O The underlying assumption is that in most cases not all columns are needed for data access In RDBMS implementation, usually each “row” is stored contiguous on disk

Example webpages REFERENCE:

Example webpages { "com.cnn.www" => { "contents" => "html….", "anchor" => { "cnnsi.com" => "CNN", "my.look.ca" => "CNN.com" } },

What is BigTable? It is semi-structured Map (key value pair)
different rows in the same table can have different columns key is string, so it is not required to be sequential unlike an array

What is BigTable? Lexicographically sorted data is sorted by keys
structure keys in a way that sorting brings the data together, for e.g. edu.villanova.cs edu.villanova.law edu.villanova.www REFERENCE:

What is BigTable? Persistent
when a certain amount of data is collected in memory, BigTable makes it persistent by storing the data in Google File System

What is BigTable? Multi-dimensional URLS : row keys
data is indexed by row key, column name and time stamp its like a table with many rows (key) and many columns (columns) with timestamp. it acts like a map For e.g URLS : row keys Metadata of Web pages : column names Contents of Web page : column Timestamps when fetched

Design Data is indexed by row key, column name and time stamp
row :string column :string time :int64 Data is indexed by row key, column name and time stamp Each value in map is an interpreted array of bytes Offers client some control over data layout and format Careful choice of schema can control locality of data Client decides how to serialize the data

Row Row key is up to 64KB Row range for a table are dynamically partitioned Each row range is called a tablet unit of distribution load balancing Clients can select row keys for better locality of data accesses reads of short row ranges are efficient. typically require communication with few number of machines

Row every read or write of data using a single row key is atomic
no guarantee across rows (different columns being read or written in the row) supports single row transactions to perform atomic (read, modify, write) sequences on data store under a single row key does not support general transaction across row keys

Row with Example ROW

Column and Column Families
Column keys are grouped together to form sets, which are called Column families family:qualifier data stored in the same column family usually has the same data type indexes data in the same column family compress data number of distinct column families are small for e.g. language used on web page

Column with Example COLUMNS

Column Families with Example
COLUMN FAMILY family: qualifier

Timestamp 64-bit integers
Multiple timestamps exist in each cell to show various versions of data created modified Most recent version is accessible first can choose options for garbage collection can choose specific timestamps Timestamps are assigned by BigTable (in microseconds) or Client application

Timestamp with Example
TIMESTAMPS

Cells CELLS

Mutations First, mutations are logged in a log file
Log file is stored in GFS Then the mutations are applied to an in-memory version called memtable

Mutations REFERENCE:

Data Structure GFS supports two data structures.
Logs Sorted String Tables Data Structure is defined using protocol buffers (data description language) Used to avoid inefficiency of converting data from one format to another. For e.g. data format in Java and .NET

SSTables and Logs In memory BigTable provides mutable storage using key-value. Once the log or in-memory table reaches a certain limit, changes are made persistent by GFS. immutable All transaction in memory are saved in GFS as segments, called logs After changes reach a certain size (that you want in memory), they are cleaned. After cleaning, data is compacted into series of SSTables Then sent out as chunks to GFS

SSTables and Logs SSTable provides a persistent, immutable, ordered map from keys to values Sequence of blocks form into an SSTable Each SSTable saves one block index when SSTable is opened, index is loaded in memory specifies block location

SSTables and Logs Index (block ranges) 64KB Block …
REFERENCE:

Tablet Tablets are a range of rows of a table
Contains multiple SSTables Tablets are assigned to Tablet servers Tablet Start:aardvark End:apple REFERENCE: SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

Table Multiple tablets form a Table
SSTables can overlap but tablets do not overlap Tablet Tablet aardvark apple apple boat REFERENCE: SSTable SSTable SSTable SSTable

Cluster BigTable cluster stores tables each table consists of tablets
initially, table contains one tablet as the table grows, multiple tablets are created tablets are assigned to tablet servers each tablet exists at only one server server contains multiple tablets each tablet is MB

How to find a Row? REFERENCE:

How to find a Row? Client reads location of the root tablet from the Chubby file Root tablet contains location of Metadata tablets root tablet never splits Metadata tablet contains the location of user tablets

BigTable Architecture
REFERENCE:

BigTable Implementation
BigTable has 3 components: Master Server Tablet Servers: dynamically added or removed to handle workload Chubby Client Library: links the master server, many tablets servers and all clients

BigTable Implementation
REFERENCE:

BigTable Building Blocks
Google File System Stores persistent state. Scheduler Schedules jobs involved in serving BigTable Lock Service Master election Location bootstrapping Map Reduce Used to read/write BigTable data

Chubby Distributed Lock Service Highly available Paxos Atomic
name space consists of directories and files, which are used as locks provides Mutual Exclusion Highly available 1 master (elected) 5 active replicas Paxos maintain consistency in replicas Atomic reads and writes

Chubby Responsible for: ensure there is only one active master
store the bootstrap location of BigTable data discover tablet servers store BigTable schema information store access control lists REFERENCE:

Chubby Client Library Responsible for:
providing consistent caching of Chubby files each Chubby client maintains a session with a Chubby service Every client’s session has a lease expiration time. If the client is unable to renew its session lease within the given time, the session expires and all locks and open handles are lost

Master Server Starts up: Responsible for: Incase of Failure:
Acquires unique master lock in chubby Discovers tablet assignments Discovers live servers in chubby Scans Metadata table to learn the set of tablets Responsible for: Adding or deleting tablet servers based on demand Assigns tablets to tablet servers Monitor and balance tablet server load Garbage collection of files in GFS Check tablet server for the status of its lock Incase of Failure: If session with Chubby is lost, master kills itself and an election can take place to find a new master

Tablet Server Starts up:
acquires an exclusive lock on a uniquely named file in a specific Chubby directory Responsible for: Tablet servers manages tablets Splits tablets beyond a certain size For reads and writes, client communicates directly with tablet server Incase of Failure: if it loses its exclusive lock, the tablet server stops serving if the file exists, will attempt to reacquire lock if the file no longer exists, tablet server kills itself, restart and join the pool of unassigned tablet servers

Tablet Server Failure REFERENCE:

Tablet Server Recovery Process
Read metadata containing SSTABLES and redo points Metadata table contains the list of SSTables that comprise a tablet and a set of a redo points Redo points are pointers into any commit logs Apply redo points to reconstruct the memtable based on updates in commit log

Tablet Server Recovery Process
Read and Write requests at the tablet server are checked to make sure they are well formed Check permission file in Chubby to ensure Authorization Incase of write operation, all mutations are written to commit log and finally a group commit is used Incase of read operation, it is executed on a merged view of the sequence of SSTables and the memtable

Compactions When in-memory is full
Minor compaction – convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Reduce number of SSTables Good place to apply policy “keep only N versions” Major compaction Merging compaction that results in only one SSTable No deletion records, only live data

Refinement Locality groups Compression Caching for read performance
Clients can group multiple column families together into a locality group. Compression Compression applied to each SSTable block separately Uses Bentley and McIlroy's scheme and fast compression algorithm Caching for read performance Uses Scan Cache and Block Cache Bloom filters Reduce the number of disk accesses

Refinement Commit-log implementation Exploiting SSTable immutability
Suppose one log per tablet rather have one log per tablet server Exploiting SSTable immutability No need to synchronize accesses to file system when reading SSTables Concurrency control over rows efficient Deletes work like garbage collection on removing obsolete SSTables Enables quick tablet split: parent SSTables used by children

Interactions between GFS and BigTable
Persistent state of a collection of rows (tablet) is stored in GFS Writes Incoming writes are recorded in memory as memtables They are sorted and buffered in memory After they reach a certain size, they are stored in sequence of SSTables (persistent storage, in GFS)

Interactions between GFS and BigTable
Reads Information can be in Memtables or SSTables Need to consider, how to avoid Stale information All tables are sorted so easy to find most recent Recovery To recover a tablet, tablet server reconstructs Memtable by reading its metadata, redo points

API BigTable APIs provide functions for:
Creating/deleting tables, column families Changing cluster, table and column family metadata such as access control rights Client applications can: write or delete values lookup values from individual rows iterate over a subset of data Support of single row transactions Allowing cells to be used as integer counters Executing client supplied scripts in the address space of servers

Why use BigTable? Scale is Large
More than 100 TB of Satellite Image Data Millions of users thousands of queries per second manage Latency Billions of URLS billions and billions of pages each page has many versions

Why not any other Database?
In-house solution is always cheaper Scale is very large for most of the databases Cost is too high Same system can be used across different projects, which again lowers the cost With Relational Databases, we expect ACID transactions. It is impossible to guarantee Consistency while providing High Availability and Network Partition Tolerance.

CAP REFERENCE:

Application Design Reminders
Timestamp is Int64, so application needs to plan for updating the same cell at the same time by multiple clients. At application level, need to know the data structure that is supported by GFS, to avoid conversion

Google Services using BigTable
Used a database by: Google Analytics Google Earth Google App Engine Datastore Google Personalized Search

BigTable Derivatives Apache Hbase database, which is built to run on top of the Hadoop Distributed File System (HDFS). Cassandra, which originated at Facebook Inc. Hypertable, an open source technology, an alternative to HBase.

Colossus GFS is more suited for batch operations
Colossus is a revamped file system that is suited for real-time operations Colossus makes use of a new search infrastructure called ‘Caffeine’ which enables Google to update its search index in real-time In Colossus there are many masters operating at the same time Number of changes have already been made to the open-source Hadoop to make it look more like Colossus

Comparison REFERENCE:

Google App Engine Presented by: Ayesha Fawad 10/07/2014

Introduction Also know as GAE or App Engine
Preview started in April 2008 Came out of preview in September 2011 Is a a PAAS (platform as a service) Allows developing and hosting web applications in Google managed data centers Default choice for storage is a NoSQL solution

Introduction Language Independent plans to support more languages
Automatic scaling automatically allocates more resources to handle additional demand It is free up to certain level of resources (storage, bandwidth, or instance hours) required by the application Does not allow joins

Introduction Applications are Sandboxed across multiple servers
security mechanism to execute/run untested code restricted resources for the safety of host system Reliable Service Level Agreement of 99.5% uptime can sustain multiple data center failures

GAE Data store It is built on top of BigTable
Follows a hierarchical structure Schema-less object data store Designed to scale for high performance Queries are pre-built indexes Does not require entities of same kind

Does Not Support Join operations
Inequality filtering on multiple properties Filtering data based on results of sub query

Entities Also known as Objects in App Engine Data store
Each entity is uniquely identified by its own key An entity: begins with root entity proceeding from parent to child Every Entity belongs to an Entity group

Models Model is the superclass for data model definitions defined in google.appengine.ext.db Entities of a given kind are represented by instance of the corresponding model class

Queries A Data store Query retrieves
entities from data store which operates on entity values keys to meet specified set of conditions Data store API provides a Query class for constructing queries PreparedQuery class for fetching and returning entities from the data store Can apply filters and sort orders on queries

Indexes An index is defined on a list of entity properties of an entity kind An index table contains a column for every property specified in the index’s definition Data store identifies the index that corresponds with the Query’s kind, filter properties, filter operators and sort orders App Engine predefines an index on each property of each kind. These indexes are sufficient to simple queries.

GQL GQL is a SQL like language for retrieving Entities or Keys from App Engine Data store

Transactions Transaction is a set of Data store operations on one or more entity Its atomic, means transactions are never partially applied Isolation and Consistency Required when users are attempting to create or update an entity with same string ID Also possible to queue transactions

Data store Software Stack

Data store Software Stack
App Engine Data store schema-less storage advanced query engine Megastore Multi-row transactions simple indexes/queries strict schema BigTable distributed key/value store Next gen distributed file system

GUI https://appengine.google.com
Everything done through console can also be done through Command Line (appcfg)

GUI Main Data Administration Billing

GUI (Main) Dashboard you can see all metrics related to your application. versions resources and usage much more

GUI (Main) Instances total number of instances
availability (e.g. dynamic) average latency average memory much more

GUI (Main) Logs detailed information helps resolving any issue
much more

GUI (Main) Versions number of versions default setting
deployment information delete a specific version much more

GUI (Main) Backends its like a worker role
piece of business logic which does not have a user interface much more

GUI (Main) Crom Jobs time based job can be defined in xml or yaml file
much more

GUI (Main) Task Queues can create multiple tasks
first one will be default automatically can be defined in xml or yaml file much more

GUI (Main) Quota Details detailed metrics of resources being used
For e.g. storage, memcache, mail etc shows daily quota shows rate details of what client is billed for much more

Data store Options High-Replication uses Paxos algorithm
multi master read and write provides highest level of availability (99.999% SLA) certain queries will be Eventually Consistent some latency due to multi master writing reads are from the fastest source (local) Reads are transactional

Data store Options Master/Slave
offers strong Consistency over availability, for all reads and queries data is written to a single master data center, then replicated asynchronously to other (slave) data centers 99.9% SLA reads from master only

Competitors App Engine offers better infrastructure to host applications in terms of administration and scalability Other hosting services offer better flexibility for applications in terms of languages and configuration

Hard Limits

Free Quotas

Cloud Data Storage Options

References Reference to Bigtable Reference to Google File System

BigTable and Google File System

Similar presentations

Presentation on theme: "BigTable and Google File System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BigTable and Google File System

Similar presentations

Presentation on theme: "BigTable and Google File System"— Presentation transcript:

Similar presentations

About project

Feedback