Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011
Grid Technology NoSQL Overview Highlights –Non-relational –Distributed, Easy replication support –Open-source –Horizontally scalable, High scalability –Simple API Use cases –Large data volumes –Extreme query workloads –Schema evolution
Grid Technology The Zoo of solutions
Grid Technology Classification (data model) NoSQL key-value based –BerkleyDB, Dynamo, Veldemort, Redis, Scalaris, etc NoSQL column/tabular based –Hadoop, Cassandra, HBase, Hive, Hypertable, etc NoSQL document based –MongoDB, CouchDB, SimpleDB, Riak, etc Relational DBMS –Oracle, MySQL, etc Column based DBMS –Vertica, Infobright, LucidDB, etc
Grid Technology NoSQL Key-value Store Data items stored and paired with a key Data accessible by a hash map Fast storage/retrieval of simple data by primary key Complex queries are not straightforward Modeling applications can get complicated
Grid Technology NoSQL Document Store More complex and meaningful data structures Based on versioned structured documents Values associated with keys are full documents The documents are stored in formats like JSON Provides more modeling flexibility Good for incomplete datasets Easy to map data from object-oriented software
Grid Technology NoSQL Document Store MongoDBCouchDB Programming Language C++Erlang HDFS Support No (GridFS)No Document Format BSONJSON Query Method Object-based Javascript MapReduce Best UseDynamic queries Pre-defined queries Less dynamic data Supported / Used by Foursquare, SourceForge Several websites
Grid Technology NoSQL Column Store Each key is associated with many attributes Data stored as column families (similar to namespace for a set of related attributes) Most known because of Google’s BigTable implementation. Used by the largest and best supported NoSQL implementations Store and process very large amounts Very high throughput Strong partitioning support
Grid Technology NoSQL Column Store CassandraHBaseHypertableHive Programming Language Java C++Java HDFS Support Yes Batch Processing No Yes Query Method MapReduceMapreduceHQLHiveQL Best Use Real-time write Real-time read/write - Complex Queries Supported / Used by Facebook, Reddit, Digg Facebook, Adobe, Yahoo, Twitter Baidu Facebook, Amazon
Grid Technology Final Considerations Start prototyping with few use cases –Take few use cases spanning across different groups –One use case based on NoSQL document store –One or two use cases based on NoSQL column store –Each use case should involve 2+ groups –Try to maximize the collaboration between groups Get feedback from NoSQL team –Status of their work –Plan the next steps together Terminology and Shared Architecture - 10
Grid Technology Final Considerations Do not forget NoSQL distributions (as Cloudera) Do not forget (R)DBMS !