Column-Based
Column-Based Column-based database are sometimes described as: sparse multidimensional distributed persistent sorted map map means a collection of (key, value) pairs where the key is mapped to the value What distinguished column-based storage from key-value stores is that the keys are multidimensional, meaning that they are derived from many components table name, row key, column, and timestamp
Column-Based Examples Google's BigTable which holds the data in the Google File System (runs Gmail and Google Docs). Apache's open source Hbase often holds its data in the Hadoop Distributed File System or Amazon's Simple Storage System (S3). Cassandra is also sometimes classed as a Column-Based NOSQL system. https://en.wikipedia.org/wiki/Cassandra
Hbase Data Model Hbase organizes data into concepts including: namespaces, tables, column families, column qualifiers, columns, rows, and data cells A column is a combinations of (column family : column qualifier). Data is stored in a self-describing way by associating columns with data values, where the data values are strings. Each data item also has a timestamp, and there can be multiple versions of a data item. Each data item also has a unique key for fast access, but the keys identify cells in the storage system. Note the terms table, row and column are not used identically to relational databases.
Hbase Data Model Continued Tables and Rows: Data is stored in tables, each table has its own name. Each data item is a self-describing row with a unique row key. Row keys are strings that can be lexicographically ordered (so only orderable characters are allowed). Columns: Each table can have one or more column families. Each column family has a name and must be specified at table creation. When data is added, each data item can be associated with a column qualifier. Column qualifiers are part of the self-describing model in that they can be different for each data item. A column is just a combination of a column family and a column qualifier. The concept of a column family allows for vertical partitioning because column attributes are generally accessed together.
Hbase Data Model Continued Versioning: Each data item has an associated timestamp and there can be multiple versions of each data item. The timestamp can be user-provided or automatically generated. Cells: A cell is the basic data item in Hbase. The key of a cell is a combination of the table name, row key, column family, column qualifier, and timestamp. If the timestamp isn't provided, the most recent matching cell is retrieved. Namespaces: A namespace is a collection of tables that are typically used together.
Hbase Storage Each Hbase table is divided into a number of regions. Each region holds a range of the row keys (which is why they need to be lexicographically ordered). Each region is divided into stores. Each column family is assigned to one store in one region. Regions are assigned region servers (storage nodes). A master server is responsible for managing the region servers and splitting a table into regions.
Using Hbase Hbase only provides low level CRUD (Create, Read, Update, Delete) operations. It is the responsibility of the application to implement more complex operations (such as joins). Creating a table: create 'EMPLOYEE', 'Name', 'Address', 'Details' EMPLOYEE is the table name Name, Address, Details are the column families Inserting a cell: put 'EMPLOYEE', 'row1', 'Name:Fname', 'John' row1 is the unique row key Name is the column family Fname is the column qualifier John is the value
More Hbase Insertions put 'EMPLOYEE', 'row1', 'Name:Fname', 'John' put 'EMPLOYEE', 'row1', 'Name:Lname', 'Cena' put 'EMPLOYEE', 'row3', 'Name:Fname', 'Anakin' put 'EMPLOYEE', 'row1', 'Name:Nickname', 'John Cena' put 'EMPLOYEE', 'row3', 'Name:Lname', 'Skywalker' put 'EMPLOYEE', 'row1', 'Details:Job', 'Wrestler' put 'EMPLOYEE', 'row3', 'Name:Nickname', 'Annie' put 'EMPLOYEE', 'row1', 'Details:Review', 'Good' put 'EMPLOYEE', 'row3', 'Name:EvilNickname', 'Darth Vader' put 'EMPLOYEE', 'row3', 'Details:Job', 'Sith Lord' put 'EMPLOYEE', 'row2', 'Name:Fname', 'Peter' put 'EMPLOYEE', 'row3', 'Details:Supervisor', 'The Emperor' put 'EMPLOYEE', 'row2', 'Name:Lname', 'Parker' put 'EMPLOYEE', 'row2', 'Name:Nickname', 'Spiderman Sympathizer' put 'EMPLOYEE', 'row3', 'Details:Review', 'Breathless' put 'EMPLOYEE', 'row3', 'Address:Homeworld', 'Tatooine' put 'EMPLOYEE', 'row2', 'Details:Job', 'Photographer' put 'EMPLOYEE', 'row2', 'Details:Supervisor', 'J. Jonah Jameson' put 'EMPLOYEE', 'row2', 'Details:Review', 'Pathetic'
More Hbase CRUD operations Reading Data: scan 'EMPLOYEE' Returns all of the data in a table get 'EMPLOYEE', 'row2' Returns all of the data for a data item. Updating data: Same as inserting data (using "put") Deleting data: delete 'EMPLOYEE', 'row2', 'Name:Fname', 1417521848375 EMPLOYEE is the table name row2 is the row key Name:Fname is the column 1417521848375 is the timestamp
Why would you ever use Column-Based Databases? 1. You have huge amounts of data 2. Your data doesn't have strict structure 3. You need to vertically partition your data 4. You like Greek architecture