Lecture 6. NoSQL and Bigtable COSC6376 Cloud Computing Lecture 6. NoSQL and Bigtable Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
Outline NoSQL HW2 Bigtable
SQL vs. noSQL Functionality SQL NoSQL Data Storage SQL follows the relational model, which comprises of rows and columns. Each row represents all the information about one specific entry/entity, and columns are distinct data points. NoSQL follows a non-relational model i.e. the data is not stored in a tabular form, instead it is stored in small chunks termed as Collections. The collection could be like graphs, key-value pairs or documents. Schemas and Flexibility Schemas are locked and static before the data entry Schemas are dynamic and could be altered at runtime Scalability Scaling for an RDBMS is vertical that in turn means storing data across multiple servers and so is considered to be expensive Scaling for a non-RDBMS is horizontal, one could use cheap servers for cloud management to store data, which in turn could be cost-effective ACID amenability [Atomicity, Consistency, Isolation, Durability] All RDMS are ACID compliant NoSQLis not an ACID compliant technology. CAP Theorem Amenability [ Consistency, Availability, Partial Tolerance] CAP theorem adoption and application is not possible for SQL NoSQL databases could let you choose between the two priorities as per the theorem.
Pros Mostly open source Horizontal scalability Support for Map/Reduce There’s no need for complex joins and data can be easily shared and processed in parallel Support for Map/Reduce It is a simple paradigm that allows for scaling computation on cluster of computing nodes No need to develop fine-grained data model It saves development time Very fast for adding new data and for simple operations/queries No need to make significant changes in code when data structure is modified Ability to store complex data types (for document-based solutions) in a single item of storage
Cons Immaturity Possible database administration issues Still lots of rough edges Possible database administration issues NoSQL often sacrifices features that are present in SQL solutions “by default” for the sake of performance No indexing support Some solutions like MongoDB have to index, but it’s not as powerful as in SQL solutions No ACID Complex consistency models Eventual consistency CAP theorem states that it’s not possible to achieve consistency, availability and partitioning tolerance at the same time NoSQL vendors are trying to make their solutions as fast as possible, and consistency is a most typical trade-off
Types of noSQL
HW2
OpenstreetMap
OSM Size and Growth Current Data – c. 0.5 – 1 TB Current and Historical Data – 5.15TB Growing at 1TB per annum Source: Planet OSM http://planet.openstreetmap.org OSM Historical – every version ever of everything in the database, including now deleted items Hardware growth Source: OSM http://munin.openstreetmap.org/openstreetmap/katla.openstreetmap/postgres_size_openstreetmap_9_1_main.html
noSQL Spatial Implementations that add spatial capabilities to NoSQL databases SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop SpatialSpark, GeoTrellis Geomesa, Geowave MongoDB (extension) Geocouch Geographic data is problematic for databases – it is minimum 2 dimensional – X and Y Imagine a list of names or numbers – it is easy to order them because the key is one-dimensional. So if I have a list between 1-1 billion I know that 1-10 million are on computer A, 10-20 million on B etc. and I can get 1-500,000 easily as they are on the same computer. But I can’t do that for 2-dimensional space, so need to convert it into 2d space for efficiency Space-filling curves try to map 2d onto 1d so that we don’t need to query every node in the cluster if we want to query a geographic area Z-order curve on left Hilbert Curve – on right Geohashing is a form of Z-order curve
Yelp Dataset Containing 1,100k reviews of 42k businesses written by 190k users in five cities, namely Phoenix, Las Vegas, Madison, Waterloo in Canada, and Edinburgh in the UK, over 9 years from 2005 to present.
Yelp Dataset
Map Yelp Reviews
Tools https://github.com/Yelp/dataset-examples https://github.com/stev-0/osm-Hbase
Bigtable Fay Chang, et al @google.com
Global Picture
Why Bigtable? Performance of RDBMS system is good for transaction processing but for very large scale analytic processing, the solutions are commercial, expensive, and specialized. Very large scale analytic processing Big queries – typically range or table scans. Big databases (100s of TB)
Why Bigtable? (2) Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution. Sharding is not a solution to scale open source RDBMS platforms Application specific Labor intensive (re)partitionaing
Bigtable BigTable is a distributed storage system for managing data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products
BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls
Building Blocks Building blocks: BigTable uses of building blocks: Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent data (SSTable file format for storage of data) Scheduler: schedules jobs involved in BigTable serving Lock service: master election Map Reduce: often used to read/write BigTable data
(row, column, timestamp) -> cell contents Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications
WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched.
Rows Name is an arbitrary string Rows ordered lexicographically Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines
Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
Columns Columns have two-level name structure: Column family family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired
Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”
API Metadata operations Writes (atomic) Reads Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns
API Examples: Write/Modify atomic row modification
Return sets can be filtered using regular expressions: API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*
HBase is an open-source, distributed, column-oriented database built on top of HDFS based on BigTable!
HBase is .. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability.
Why HBase ? HBase is a Bigtable clone. It is open source It has a good community and promise for the future It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.
HBase Is Not … No join operators. Limited atomicity and transaction support. HBase supports multiple batched mutations of single rows only. Data is unstructured and untyped. No accessed or manipulated via SQL. Programmatic access via Java, REST, or Thrift APIs. Scripting via JRuby.
HBase benefits than RDBMS No real indexes Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing
Testing $ hbase shell > create 'test', 'data' 0 row(s) in 4.3066 seconds > list test 1 row(s) in 0.1485 seconds > put 'test', 'row1', 'data:1', 'value1' 0 row(s) in 0.0454 seconds > put 'test', 'row2', 'data:2', 'value2' 0 row(s) in 0.0035 seconds > put 'test', 'row3', 'data:3', 'value3' 0 row(s) in 0.0090 seconds > scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp=1240148026198, value=value1 row2 column=data:2, timestamp=1240148040035, value=value2 row3 column=data:3, timestamp=1240148047497, value=value3 3 row(s) in 0.0825 seconds > disable 'test' 09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test 0 row(s) in 6.0426 seconds > drop 'test' 09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test 0 row(s) in 0.0210 seconds > list 0 row(s) in 2.0645 seconds
Connecting to HBase Java client Non-Java clients get(byte [] row, byte [] column, long timestamp, int versions); Non-Java clients Thrift server hosting HBase client instance Sample ruby, c++, & java (via thrift) clients REST server hosts HBase client TableInput/OutputFormat for MapReduce HBase as MR source or sink HBase Shell ./bin/hbase shell YOUR_SCRIPT
Bigtable Applications
Application 1: Google Analytics Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: Number of unique visitors per day and the page views per URL per day, Percentage of users that made a purchase given that they earlier viewed a specific page. How? A small JavaScript program that the webmaster embeds in their web pages. Every time the page is visited, the program is executed. Program records the following information about each request: User identifier The page being fetched
Application 1: Google Analytics Two of the Bigtables Raw click table (~ 200 TB) A row for each end-user session. Row name include website’s name and the time at which the session was created. Clustering of sessions that visit the same web site. And a sorted chronological order. Compression factor of 6-7. Summary table (~ 20 TB) Stores predefined summaries for each web site. Generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. Row name includes website’s name and the column family is the aggregate summaries. Compression factor is 2-3.
Application 2: Google Earth & Maps Functionality: Pan, view, and annotate satellite imagery at different resolution levels. One Bigtable stores raw imagery (~ 70 TB): Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together. Column family maintains sources of data for each segment.
Google File System Large-scale distributed “filesystem” Master: responsible for metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines, master responsible for ensuring replicas exist
SSTable Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values SSTable 64K block 64K block 64K block Index Bloom Filter
Tablet Contains some range of rows of the table Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index
Table Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTable
Chubby A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.
Bigtable and Chubby Bigtable uses Chubby to: Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.
Bigtable Master Assigns tablets to tablet servers Detects addition and expiration of tablet servers Balances tablet server load. Tablets are distributed randomly on nodes of the cluster for load balancing. Handles garbage collection Handles schema changes
Bigtable Tablet Servers Each tablet server manages a set of tablets Typically between ten to a thousand tablets Each 100-200 MB by default Handles read and write requests to the tablets Splits tablets that have grown too large Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers
A 3-level Hierarchy 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. The root tablet never splits. 2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.
A 3-level Hierarchy Each meta-data row stores ~ 1KB of data, With 128 MB tablets, the three level store addresses 234 tablets (261 bytes in 128 MB tablets). Approaches a Zetabyte (million Petabytes).
Editing a Table Mutations are logged, then applied to an in-memory version Logfile stored in GFS Tablet Insert Memtable Insert Delete apple_two_E boat Insert Delete Insert SSTable SSTable
Tablet Serving “Log Structured Merge Trees” Image Source: Chang et al., OSDI 2006
Tablet Representation append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples
Client Write & Read Operations Write operation arrives at a tablet server: Server ensures the client has sufficient privileges for the write operation (access control, Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable. Read operation arrives at a tablet server: Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.
Write Operations As writes execute, size of memtable increases. Once memtable reaches a threshold: Memtable is frozen, A new memtable is created, Frozen metable is converted to an SSTable and written to GFS.
Compactions Minor compaction Merging compaction Major compaction Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data
Refinements: Locality Groups Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.
Refinements: Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)
Refinements: Bloom Filters Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk
Bloom Filters
Approximate set membership problem Suppose we have a set S = {s1,s2,...,sm} universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .
Bloom filters 1. Initially set the array to 0 Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2. sS, A[hi(s)] = 1 for 1 i k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS
Initial with all 0 Each element of S is hashed k times 1 x1 x2 Each element of S is hashed k times Each hash location set to 1 Initial with all 0
If only 1s appear, conclude that y is in S This may yield false positive 1 x1 x2
BigTable – Bloom Filters Drastically reduces the number of disk seeks required for read operations !
Benchmarks