Lecture 6. NoSQL and Bigtable

Lecture 6. NoSQL and Bigtable
COSC6376 Cloud Computing Lecture 6. NoSQL and Bigtable Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline NoSQL HW2 Bigtable

SQL vs. noSQL Functionality SQL NoSQL Data Storage
SQL follows the relational model, which comprises of rows and columns. Each row represents all the information about one specific entry/entity, and columns are distinct data points. NoSQL follows a non-relational model i.e. the data is not stored in a tabular form, instead it is stored in small chunks termed as Collections. The collection could be like graphs, key-value pairs or documents. Schemas and Flexibility Schemas are locked and static before the data entry Schemas are dynamic and could be altered at runtime Scalability Scaling for an RDBMS is vertical that in turn means storing data across multiple servers and so is considered to be expensive Scaling for a non-RDBMS is horizontal, one could use cheap servers for cloud management to store data, which in turn could be cost-effective ACID amenability [Atomicity, Consistency, Isolation, Durability] All RDMS are ACID compliant NoSQLis not an ACID compliant technology. CAP Theorem Amenability [ Consistency, Availability, Partial Tolerance] CAP theorem adoption and application is not possible for SQL NoSQL databases could let you choose between the two priorities as per the theorem.

Pros Mostly open source Horizontal scalability Support for Map/Reduce
There’s no need for complex joins and data can be easily shared and processed in parallel Support for Map/Reduce It is a simple paradigm that allows for scaling computation on cluster of computing nodes No need to develop fine-grained data model It saves development time Very fast for adding new data and for simple operations/queries No need to make significant changes in code when data structure is modified Ability to store complex data types (for document-based solutions) in a single item of storage

Cons Immaturity Possible database administration issues
Still lots of rough edges Possible database administration issues NoSQL often sacrifices features that are present in SQL solutions “by default” for the sake of performance No indexing support Some solutions like MongoDB have to index, but it’s not as powerful as in SQL solutions No ACID Complex consistency models Eventual consistency CAP theorem states that it’s not possible to achieve consistency, availability and partitioning tolerance at the same time NoSQL vendors are trying to make their solutions as fast as possible, and consistency is a most typical trade-off

Types of noSQL

OpenstreetMap

OSM Size and Growth Current Data – c. 0.5 – 1 TB
Current and Historical Data – 5.15TB Growing at 1TB per annum Source: Planet OSM OSM Historical – every version ever of everything in the database, including now deleted items Hardware growth Source: OSM

noSQL Spatial Implementations that add spatial capabilities to NoSQL databases SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop SpatialSpark, GeoTrellis Geomesa, Geowave MongoDB (extension) Geocouch Geographic data is problematic for databases – it is minimum 2 dimensional – X and Y Imagine a list of names or numbers – it is easy to order them because the key is one-dimensional. So if I have a list between 1-1 billion I know that 1-10 million are on computer A, million on B etc. and I can get 1-500,000 easily as they are on the same computer. But I can’t do that for 2-dimensional space, so need to convert it into 2d space for efficiency Space-filling curves try to map 2d onto 1d so that we don’t need to query every node in the cluster if we want to query a geographic area Z-order curve on left Hilbert Curve – on right Geohashing is a form of Z-order curve

Yelp Dataset Containing 1,100k reviews of 42k businesses written by 190k users in five cities, namely Phoenix, Las Vegas, Madison, Waterloo in Canada, and Edinburgh in the UK, over 9 years from 2005 to present.

Yelp Dataset

Map Yelp Reviews

Tools https://github.com/Yelp/dataset-examples

Bigtable Fay Chang, et

Global Picture

Why Bigtable? Performance of RDBMS system is good for transaction processing but for very large scale analytic processing, the solutions are commercial, expensive, and specialized. Very large scale analytic processing Big queries – typically range or table scans. Big databases (100s of TB)

Why Bigtable? (2) Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution. Sharding is not a solution to scale open source RDBMS platforms Application specific Labor intensive (re)partitionaing

Bigtable BigTable is a distributed storage system for managing data.
Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products

BigTable Distributed multi-level map Fault-tolerant, persistent
Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls

Building Blocks Building blocks: BigTable uses of building blocks:
Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent data (SSTable file format for storage of data) Scheduler: schedules jobs involved in BigTable serving Lock service: master election Map Reduce: often used to read/write BigTable data

(row, column, timestamp) -> cell contents
Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched.

Rows Name is an arbitrary string Rows ordered lexicographically
Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys

Columns Columns have two-level name structure: Column family
family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired

Timestamps Used to store different versions of data in a cell
New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”

API Metadata operations Writes (atomic) Reads
Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

API Examples: Write/Modify
atomic row modification

Return sets can be filtered using regular expressions:
API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*

HBase is an open-source, distributed, column-oriented database built on top of HDFS based on BigTable!

HBase is .. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability.

Why HBase ? HBase is a Bigtable clone. It is open source
It has a good community and promise for the future It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.

HBase Is Not … No join operators.
Limited atomicity and transaction support. HBase supports multiple batched mutations of single rows only. Data is unstructured and untyped. No accessed or manipulated via SQL. Programmatic access via Java, REST, or Thrift APIs. Scripting via JRuby.

HBase benefits than RDBMS
No real indexes Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing

Testing $ hbase shell > create 'test', 'data'
0 row(s) in seconds > list test 1 row(s) in seconds > put 'test', 'row1', 'data:1', 'value1' 0 row(s) in seconds > put 'test', 'row2', 'data:2', 'value2' 0 row(s) in seconds > put 'test', 'row3', 'data:3', 'value3' 0 row(s) in seconds > scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp= , value=value1 row2 column=data:2, timestamp= , value=value2 row3 column=data:3, timestamp= , value=value3 3 row(s) in seconds > disable 'test' 09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test 0 row(s) in seconds > drop 'test' 09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test 0 row(s) in seconds > list 0 row(s) in seconds

Connecting to HBase Java client Non-Java clients
get(byte [] row, byte [] column, long timestamp, int versions); Non-Java clients Thrift server hosting HBase client instance Sample ruby, c++, & java (via thrift) clients REST server hosts HBase client TableInput/OutputFormat for MapReduce HBase as MR source or sink HBase Shell ./bin/hbase shell YOUR_SCRIPT

Bigtable Applications

Application 1: Google Analytics
Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: Number of unique visitors per day and the page views per URL per day, Percentage of users that made a purchase given that they earlier viewed a specific page. How? A small JavaScript program that the webmaster embeds in their web pages. Every time the page is visited, the program is executed. Program records the following information about each request: User identifier The page being fetched

Application 1: Google Analytics
Two of the Bigtables Raw click table (~ 200 TB) A row for each end-user session. Row name include website’s name and the time at which the session was created. Clustering of sessions that visit the same web site. And a sorted chronological order. Compression factor of 6-7. Summary table (~ 20 TB) Stores predefined summaries for each web site. Generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. Row name includes website’s name and the column family is the aggregate summaries. Compression factor is 2-3.

Application 2: Google Earth & Maps
Functionality: Pan, view, and annotate satellite imagery at different resolution levels. One Bigtable stores raw imagery (~ 70 TB): Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together. Column family maintains sources of data for each segment.

Google File System Large-scale distributed “filesystem”
Master: responsible for metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines, master responsible for ensuring replicas exist

SSTable Immutable, sorted file of key-value pairs
Chunks of data plus an index Index is of block ranges, not values SSTable 64K block 64K block 64K block Index Bloom Filter

Tablet Contains some range of rows of the table
Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

Table Multiple tablets make up the table SSTables can be shared
Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTable

Chubby A persistent and distributed lock service.
Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.

Bigtable and Chubby Bigtable uses Chubby to:
Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.

Bigtable Master Assigns tablets to tablet servers
Detects addition and expiration of tablet servers Balances tablet server load. Tablets are distributed randomly on nodes of the cluster for load balancing. Handles garbage collection Handles schema changes

Bigtable Tablet Servers
Each tablet server manages a set of tablets Typically between ten to a thousand tablets Each MB by default Handles read and write requests to the tablets Splits tablets that have grown too large Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers

A 3-level Hierarchy 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. The root tablet never splits. 2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.

A 3-level Hierarchy Each meta-data row stores ~ 1KB of data,
With 128 MB tablets, the three level store addresses 234 tablets (261 bytes in 128 MB tablets). Approaches a Zetabyte (million Petabytes).

Editing a Table Mutations are logged, then applied to an in-memory version Logfile stored in GFS Tablet Insert Memtable Insert Delete apple_two_E boat Insert Delete Insert SSTable SSTable

Tablet Serving “Log Structured Merge Trees”
Image Source: Chang et al., OSDI 2006

Tablet Representation
append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples

Client Write & Read Operations
Write operation arrives at a tablet server: Server ensures the client has sufficient privileges for the write operation (access control, Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable. Read operation arrives at a tablet server: Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.

Write Operations As writes execute, size of memtable increases.
Once memtable reaches a threshold: Memtable is frozen, A new memtable is created, Frozen metable is converted to an SSTable and written to GFS.

Compactions Minor compaction Merging compaction Major compaction
Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data

Refinements: Locality Groups
Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.

Refinements: Compression
Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)

Refinements: Bloom Filters
Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk

Bloom Filters

Approximate set membership problem
Suppose we have a set S = {s1,s2,...,sm}  universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .

Bloom filters 1. Initially set the array to 0
Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  sS, A[hi(s)] = 1 for 1 i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i  k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS

Initial with all 0 Each element of S is hashed k times
1 x1 x2 Each element of S is hashed k times Each hash location set to 1 Initial with all 0

If only 1s appear, conclude that y is in S
This may yield false positive 1 x1 x2

BigTable – Bloom Filters
Drastically reduces the number of disk seeks required for read operations !

Benchmarks

Lecture 6. NoSQL and Bigtable

Similar presentations

Presentation on theme: "Lecture 6. NoSQL and Bigtable"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 6. NoSQL and Bigtable

Similar presentations

Presentation on theme: "Lecture 6. NoSQL and Bigtable"— Presentation transcript:

Similar presentations

About project

Feedback