Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD

Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Cassandra Memcache

Cassandra

What Cassandra is… Cassandra is a massively
scalable, decentralized, structured data store (aka database). A prophetess in Troy during the Trojan War. Her predictions were always true, but never believed.

Cassandra: Big Table/Dynamo Hybrid
Originally from Facebook Written by an original Dynamo developer Now an Apache project Facebook now uses Hbase NetFlix is based on Cassandra Written in Java Follows the BigTable data model: column-oriented Uses the Dynamo Eventual Consistency model Uses Apache Thrift as it’s API An interface definition language Define and create services for numerous languages

Thrift Created at Facebook along with Cassandra
Is a cross-language, service-generation framework Binary Protocol (like Google Protocol Buffers) Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ...

Thrift SOAP XML, XML, and more XML CORBA Over designed and Heavyweight
Sending requests, getting results Waiting for requests (known location, known port) Communication protocol, Data format SOAP XML, XML, and more XML CORBA Over designed and Heavyweight COM Embraced mainly in Windows Client Software Pillar Slick! But no versioning/abstraction. Protocol Buffers etc Closed source Google deliciousness

Principle Of Operation
Define Data types and Service interfaces Create a thrift file eg demo.thrift Thrift Code Generator Tool (written in C++) Build Thrift platform files Demo.php Demo.cpp Demo.py Demo.java Server implements Services and Client calls them Create Server/Client App Run the Server

Projects Using Thrift Cassandra ThriftDB Scribe Hadoop / HBase
Facebook

Approaches of influence
BigTable Sparse map data model GFS, Chubby, et al Dynamo O(1) distributed hash table (DHT) BASE (aka eventual consistency) Client tunable consistency/availability cassandra ~= bigtable + dynamo

Design Goals High availability Eventual consistency
trade-off strong consistency in favor of high availability Incremental scalability Optimistic Replication “Knobs” to tune tradeoffs between consistency, durability and latency Low total cost of ownership Minimal administration

web 2.0 Proven The Facebook stores 150TB of data on 150 nodes
used at Twitter, Rackspace, Mahalo, Reddit, Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX, others

Cassandra Data Model

Typical NoSQL API Basic API access:
get(key) -- Extract the value given a key put(key, value) -- Create or update the value given its key delete(key) -- Remove the key and its associated value execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).

keyspace column family column Data Model name value clock
settings (eg, partitioner) column family settings (eg, type [Std]) column name value clock

Data Model Keyspace ColumnFamily Column Uppermost namespace
Typically one per application ColumnFamily Associates records of a similar kind Record-level Atomicity Indexed Column Basic unit of storage

Keyspace ~= database typically one per application
some settings are configurable only per keyspace

Column Column: smallest data element, a tuple with a name and a value
Each column has 3 parts name determines sort order used in queries Value timestamp long (clock) Here’s a column represented in JSON-ish notation: { // this is a column name: " Address", value: timestamp: }

Column Family Group records of similar kind
Not same kind, because CFs are sparse tables Example: UserProfile = { // this is a ColumnFamily phatduckk: { // this is the key to this Row inside the CF // now we have an infinite # of columns in this row username: "phatduckk", phone: "(900) " }, // end row ieure: { // this is the key to another row in the CF // now we have another infinite # of columns in this row username: "ieure", phone: "(888) " age: "66", gender: "undecided" }, }

nickname=The Situation
Column Family key123 user=eben nickname=The Situation key456 user=alison icon= n= 42 Think of it as hashmap or associative array each row is uniquely identifiable by key

Super Column super columns group columns under a common name
A SuperColumn is a tuple with a name & a value which is a map containing an unbounded number of Columns (a map of columns)

Super Column { // this is a SuperColumn name: "homeAddress",
// with an infinite list of Columns value: { // note the keys is the name of the Column street: {name: "street", value: "1234 x street", timestamp: }, city: {name: "city", value: "san francisco", timestamp: }, zip: {name: "zip", value: "94107", timestamp: }, } }

Super Column Family A column family can be of type standard or super
Standard column family: all the Rows contains a map of normal columns Super column family: each Row contains a map of super columns

super column family column key
AddressBook = { // this is a ColumnFamily of type Super phatduckk: { // this is the key to this row inside the Super CF // the key here is the name of the owner of the address book // now we have an infinite # of super columns in this row // the keys inside the row are the names for the SuperColumns // each of these SuperColumns is an address book entry friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"}, // this is the address book entry for John in phatduckk's address book John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"}, Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"}, Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"}, Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"}, ... // we can have an infinite # of ScuperColumns (aka address book entries) }, // end row ieure: { // this is the key to another row in the Super CF // all the address book entries for ieure joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"}, William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"}, }, } column key

Datamodel explained by example (Twitter)

Example - Twitter

Example - Twitter Supercolumn family

Write Operations A client issues a write request to a random node in the Cassandra cluster. The “Partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.

Write Operations No locks in the critical path No reads No seeks
Memtable No locks in the critical path No reads No seeks Append support Fast Sequential disk access Atomic within a column family ≈ 0.2 ms Commit log Threshold Write SSTable SSTable

D E L E T E D Compaction MERGE SORT K2 < Serialized data >
-- K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- Sorted Sorted Sorted MERGE SORT Index File K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted Data File

Reads Memtable Bloomfilter field to determine whether a provided key is in the SSTable Index field for quick read Any node Read repair ≈ 15 ms Read Bf Idx Bf Idx SSTable SSTable

Read repair if digests differ
Read Operations Client Query Result Cassandra Cluster Read repair if digests differ Closest replica Result Replica A Digest Query Digest Response Digest Response Replica B Replica C

Cassandra and Consistency
Talked previous about eventual consistency Cassandra has programmable read/writable consistency One: Return from the first node that responds Quorom: Query from all nodes and respond with the one that has latest timestamp once a majority of nodes responded All: Query from all nodes and respond with the one that has latest timestamp once all nodes responded. An unresponsive node will fail the read

Cassandra and Consistency
Zero: Ensure nothing. Asynchronous write done in background Any: Ensure that the write is written to at least 1 node One: Ensure that the write is written to at least 1 node’s commit log and memory table before receipt to client Quorom: Ensure that the write goes to node/2 + 1 All: Ensure that writes go to all nodes. An unresponsive node would fail the write

Architecture

Tombstones “soft delete.” Instead of actually executing a delete SQL statement, the application will issue an update statement that changes a value in a column called something like “deleted”. In Cassandra, it is called a tombstone. When you execute a delete operation, the data is not immediately deleted. Instead, it’s treated as an update operation that places a tombstone on the value. A tombstone is a deletion marker that is required to suppress older data in SSTables until compaction can run.

Hinted Handoff An optimization technique for data write on replicas
When a write is made and a replica node for the key is down Cassandra writes a hint to a live replica node That replica node will remind the downed node of changes once it is back on line Hinted Handoff reduce write latency when a replica is temporarily down Hinted Handoff provides high write availability at the cost of consistency A hinted write does NOT count towards Consistency Level requirements for ONE, QUORUM, or ALL

MySQL Comparison MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms

Lessons Learnt Add fancy features only when absolutely required.
Many types of failures are possible. Big systems need proper systems-level monitoring. Value simple designs

Memcache

Memcache Memcache is not a database.
Memcache is a distributed cache system. Memcache is not meant for providing any backup support. Its all about simple read and write. Memcache is very fast.

Memcache users LiveJournal Wikipedia Flickr Twitter Youtube Dig
Wordpress Craigslist Facebook (around 200 dedicated memcache servers)‏

Memcache Memcache is an in-memory key-value store for small chunks of arbitrary data (strings, objects) > in-memory (volatile) key-value store $memcache->set('unique_key', $value, $flag, $expiration_time); $flag = 0 / MEMCACHE_COMPRESSED to store the item compressed. $expiration_time = 0 (never expire) / 30 (30 seconds) etc. $memcache->get('unique_key'); NOTE: Missing key makes fetch time doubles. > distributed memory caching system (you can use more than one server to cache your data)‏ $memcache->addServer('host1', 11211); $memcache->addServer('host2', 11211); $memcache->addServer('host3', 11211);

What can you store in Memcache?
Results of database calls, API calls (xml as string), page rendering (html as string) etc. NOTE: Objects are serialized before being stored to memcache

Caching

Caching Use memcache What can be cached? What are the benefits?
In-memory key-value store Distributed What can be cached? Common queries, results of database calls Page rendering (html as string) Sessions What are the benefits? Decreases load on DB Faster response than DB

Caching

Additional NoSQL DBs

What else? MongoDB Voldemort Riak / Basho CouchDB Hibari Virtuoso
Many many others!

MongoDB Document–oriented All writes and reads are through the master
Documents stored as JSON objects All writes and reads are through the master Written in C++ Native Python bindings Simple configuration

MemcacheDB MemcacheD with persistence Uses Memcache API
Uses Berkelely DB Master/Slave Read from any slave Write only to the master

HyperTable open source Inspired Google's BigTable
runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS) written almost entirely in C++ Developed in-house at Zvents Inc

Voldemort Dynamo clone by LinkedIn Eventually consistent
Multiple versions may be returned on a Get Uses Berkeley DB for persistence Thrift interface Written in java

Datastores Replication Consistency CAP Data Model Range Queries Cassandra Yes Eventual AP Column oriented Hbase Strong CP Hypertable MemcacheDB Key/Value MongoDB Document MySQL Relational Voldemort No Eventual Consistency: Apps can see inconsistent data if they are not careful about choice of R and W Might not see its own writes or successive reads might see a row’s state jump back and forth in time

List of NoSQL databases [122+]
Wide Column Store / Column Families HBase, Cassandra, Hypertable, Cloudata, Cloudera, Amazon SimpleDB Document Stores CouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB Key Value / Tuple Store Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, Keyspace Berkeley DB, MemcacheDB, Faircom C-Tree, Mnesia, LightCloud, Hibari, HamsterDB, STSdb, Pincaster, RaptorDB Eventually Consistent Key Value Stores Amazon Dynamo, Voldemort, Dynomite, KAI Graph Databases Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink, Virtuoso, VertexDB, FlockDB Object Databases db4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, Caching, ZODB, NEO, PicoLisp, Sterling More and more databases

Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 11: Other NoSql Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback