NoSQL Know Your Enemy Shelly Noll SRT Solutions, Ann Arbor, MI shelly.noll@srtsolutions.com @shellynoll
Disclaimer There is lots of disagreement about this topic Everything I say could be wrong depending on who you ask Even if it’s right today, it will probably be wrong soon
What is nosql? It is a database management system with the following features: Queries do not use SQL Doesn’t guarantee ACID properties Fault-tolerant, distributed architecture Coined by Carlo Strozzi in 1998 to describe a database he created that did not expose a SQL interface Term was co-opted in 2009 when Eric Evans from Rackspace and Johan Oskarsson from Last.fm organized an event to discuss the growing trend of open-source, distributed databases
Consistency Availability Partition Tolerance CAP Theorem All nodes see the same data at the same time Availability Every request receives a success/failure response Partition Tolerance Operates despite failure of part of the system A distributed system can satisfy any two of these guarantees at the same time, but not all three A couple of basic theories we need to talk about to understand the difference between relational and noSQL databases
ACID vs BASE Atomicity Consistency Isolation Durability Basically Available Soft State Eventual Consistency Instead of ACID properties found in relational database, nosql has something different. What is the opposite of a an acid? Nosql databases exhibit BASE properties All or nothing (atomicity) Data must be adhere to schema and rules (consistency) No transaction interferes with another (isolation) Permanency (durability) an application works basically all the time (basically available) does not have to be consistent all the time (soft-state) but will be in some known-state state eventually (eventual consistency,
ACID vs BASE ACID BASE Strong consistency Isolation Focus on “commit” Nested transactions Conservative (pessimistic) Difficult to change schema Weak consistency Best effort Approximate answer OK Aggressive (optimistic) Simpler Faster Easier to change Consistency – adheres to the rules Isolation – transactions do not interfere Dr. Eric A. Brewer (2000) http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
Why Did This Happen??? Data-related reasons Avoidance of unneeded complexity Avoidance of object-relational mapping Avoidance of making schema changes Performance-related reasons Higher throughput Horizontal scalability and running on commodity hardware Complexity and cost of setting up database clusters Complexity – consider Twitter – You have users, status updates, relationships between users, direct messages and not much else Object-relational mapping – object-oriented programmers have to create a layer in their applications that take the data from the database and transforms it into objects the application can use – also creates the overhead in syncing the state of the objects in memory with the entities in the database – expensive, time-consuming, nosql APIs look more like the objects programmers use NoSQL compromises reliability for better performance
Database Types Key-Value Graph Document Store Column Store
Database type disagreement Stephen Yen Ken North Rick Cattel Jonathan Ellis Wikipedia Amazon SimpleDB Entity-Attribute-Value Data Store Document Store Apache Hadoop Tabular Cassandra Wide Columnar Store Extensible Record Store Columnfamily Eventually-Consistent Key-Value Store Google Bigtable Key-Value Store HBase HyperTable Redis Data-Structures Server Collection Key-Value Cache
Key-Value Data is stored in a schema-less way with a key and a value Limited querying capability Values can usually be of any data type, or could be a serialized object Variations Eventually consistent Hierarchical Ordered Key-value cache (in RAM or on disk) Memcached Redis Riak Basho Voldemort
Popular Key-Value stores Vendor Language Used By Memcached Danga C LiveJournal, YouTube, Reddit, Zynga, Facebook, Twitter Redis Vmware ANSI C Github, Craigslist, Blizzard, Digg, Twitter, Flickr, Stackoverflow Riak Basho Erlang, C, C++, JavaScript Comcast, Mozilla, AOL, Ask.com Voldemort LinkedIn Java
Graph Based on graph theory Data is stored as nodes (entities), properties, and edges (relationship) Allows for calculations between nodes Shortest distance between nodes Analysis of relationships AllegroGraph FlockDB GraphDB InfiniteGraph Neo4j OrientDB
Popular graph databases Vendor Language Used By AllegroGraph Franz, Inc. Lisp Pfizer, Ford, Kodak, NASA, DoD FlockDB Twitter GraphDB Sones .NET InfiniteGraph Objectivity CIA, DoD Neo4j Neo Technology Java Adobe, Cisco OrientDB Apache A bunch of small companies no one’s heard of
Document Store Stores document-oriented or semi-structured data Documents may be encoded as XML, YAML, JSON, BSON, PDF, MS Word, MS Excel, etc. Documents are not required to adhere to a standard schema Offers a query language to retrieve documents based on content Amazon SimpleDB Apache CouchDB Lotus Notes MongoDB
Popular Document stores Vendor Language Used By CouchDB Apache Erlang Various Facebook applications MongoDB 10gen C++ MTV Networks, Craigslist, Foursquare SimpleDB Amazon
Column store Stores data in a tabular format Different names for the exact same thing Wide Columnar Store ColumnFamily Tabular Entity-Attribute-Value Data Store Extensible Record Store Multivalue BigTable Apache Hadoop Cassandra Google Bigtable Hbase HyperTable
Popular column stores Vendor Language Used By Bigtable Google Google File System Cassandra Apache Java Netflix, Twitter, Constant Contact, Reddit, Digg Hadoop Yahoo! HBase Facebook's messaging platform HyperTable Zvents C++ Baidu
An algorithm for dividing work across a distributed system Map reduce An algorithm for dividing work across a distributed system Breaks a big task into smaller tasks that can be done in parallel Map Query Maps the input into a final format Reduce Query Operates over a set of results
Comparisons Performance Scalability Flexibility Complexity Key-Value Stores High None Column Stores Moderate Low Document Stores Variable (High) Graph Databases Variable Relational Databases Ben Scofield (2010) http://nosql.mypopescu.com/post/396337069/presentation-nosql-codemash-an-interesting-nosql
Mongodb example
Where wouldn’t you use nosql? Data is critical to the function of the business/application Data has strong and/or slowly changing schema Need true transactional capabilities Need data mining capabilities Set-based updates Banking apps Healthcare apps Enterprise apps
Where would you use nosql? Heavy read/write Single-user Simple, non- structured data Lack of interconnected data Doesn’t matter if it takes a while to get the data consistent Data is not critical Social networking apps Mobile apps
Future of nosql UnSQL A query language for NoSQL databases Does not have data definition language Acquisition of NoSQL databases by larger companies Similar to what happened in the BI space where IBM, Microsoft, and HP acquired smaller players
Shelly Noll SRT Solutions, Ann Arbor, MI shelly.noll@srtsolutions.com Twitter - @shellynoll