Hadoop and NoSQL at Thomson Reuters Prepared by: Cory Vogel and Dan Dressel July 2012
Agenda Hadoop NoSQL Infrastructure Lessons Learned HBase Cassandra MongoDB
What does Hadoop mean to TR? Large batch processing Shared Infrastructure Scalable w/o down time Fault tolerant Lower cost More efficient than traditional data processing
Hardware Roles Master Node Data Node Edge Node Function NameNode CheckPoint JobTracker TaskTracker HDFS CLI PIG Hive HUE… Requirements Enterprise hardware RAID Redundant Power Commodity Hardware JBOD
Recommended Commodity Hardware Hardware Recommended Commodity Initial Standard Current Standard CPU 4 CPU (Intel or AMD) 8 CPU (Intel) 12 CPU (Intel) Memory 4 GB 24 GB 96 GB Disk (4) 1TB 7.2k (12) 1TB 7.2k Network Single 1GbE Dual 10GbE Power Supply Single Dual Server Cabinet Dedicated Shared (TOR networking)
Performance Results Metric Commodity Standard CPU High WIO (spindle bound) Great for different work loads Memory Very small heaps More is better NIC Bottle Neck Peaks around 300MBs Cabinet Expensive (pre-allocate space) Cost Effective (shared space with 10GbE)
Hardware Comparison Node Teragen Terasort DL160 CPU: 8 Cores Mem: 24GB Disk: 4x1TB NIC: 1GbE 1hrs, 26mins, 41sec 2hrs, 56mins, 9sec DL180 Disk: 12x450GB 13mins, 39sec 33mins, 56sec
Big Data Stacks Hadoop / Hbase Cassandra / MongoDB Hardware Standard Disk Layout JBOD RAID10 for Data Performance 6X Improvement 5X Improvement
NoSQL Database Management Systems Next Generation Databases Non-relational Distributed Open-source Horizontally scalable Source: http://nosql-database.org
NoSQL Database Management Systems Proof-of-Concepts Data Store Model Foundation Support , Training, Consulting HBase Wide Column Store Google’s Bigtable Cloudera Cassandra Amazon’s Dynamo DataStax MongoDB Document Store 10gen
HBase Data Model Table Column Family Row Column HBase Description Relational (like) Table Namespace for Column Families Schema Column Family Container for an ordered collection of rows and columns Table / Segment Row Ordered collection of columns Column Column name / Value / Timestamp (Cell) Table Column Family Row Column
HBase Data Model HBase Example Column names are not fixed Table pmart_detail pmart_summary Column Family DETAIL SUM_PRODUCT_COMMAND_HOUR Row Key b4d433f42a00a59b831000007c020da0i 1343066400,60,{product,command}, {WestlawNext,GetDocumentSummaryRequest} Column Name product elapsedTime.TOTAL Column Value WestlawNext 63196 Column Timestamp 1343066400429000 1343066400291111 Column names are not fixed Each individual column has a timestamp Column versions allowed CLI Examples: put ‘pmart_detail’,‘0026d257460737a77310’,‘DETAIL’:’product ‘,‘WestlawNext’; get ‘pmart_detail’,‘0026d257460737a77310’,‘DETAIL’:’product‘;
Cassandra Data Model Cassandra Description Relational (like) Keyspace Namespace for Column Families Schema Super Column Family Container for an ordered collection of rows Table / Segment Column Family Row Ordered collection of columns Super Column Ordered collection of Sub-Columns Column Column name / Value / Timestamp Keyspace Super Column Family Row Super Column Column Value Column Family
Cassandra Data Model Cassandra Example Keyspace ks Column Family cf Row xxxx_20120723 Column Name 09:10:25 Column Value k=v,k=v,k=v Column Timestamp 1343066400429000 (client provided) Each row can contain as many as 2B columns Column names are not fixed Each individual column has a timestamp CLI Examples: use ks; set cf[‘xxxx_20120723’][‘09:10:25‘]=‘k=v,k=v,k=v’; get cf[‘xxxx_20120723’][‘09:10:25 '] ;
MongoDB Data Model Database Collection Document MongoDB Description Relational (like) Database Container structure for collection of documents Tablespace Collection Named grouping of documents Table Document BSON (Binary JavaScript Object Notation) documents with dynamic schemas Row Database Collection Document
MongoDB Data Model MongoDB Example Database people Collection names Document {"addresses":[ {"state":"MN","zip":“55101","street_type":"CT","street_name":“XXXXXX WAY","county":“RAMSEY", "city":“ST. PAUL,"full_street1":"2140 XXXXXX WAY COURT","house_nbr":"2140"}, {"state":"MN","zip":“55101,"zip_ext":“1101","street_type":"DR","street_name":“XXXX","county":“RAMSEY","city":“ST. PAUL","full_street1":"1109 XXXX DR","house_nbr":“1109“}], "phones":[], "birthdays":["19491101"], "names":[ {"last_name":“XXXXX","middle_name":“WILLIAM","first_name":“JAMES"}, {"last_name":“XXXXX”,"middle_name":“W","first_name":“JAMES,"full_name":“XXXXX, JAMES W"}, {"last_name":“XXXXX“,"middle_name":“W.","first_name":“JAMES“}} 16M document size limit No fixed schema Structured document
Replication Options Database Media Protection Data Replication to Additional Servers HBase Hadoop Distributed File System (Triple Store) Master/Slave Master/Master Cyclic Replication (Asynchronous) Cassandra RAID 10 Peer-to-Peer MongoDB Replica sets
Consistency Levels Consistency level for reads: Database Consistency Levels Default Consistency Level HBase Strong consistency (possible exception replication slaves) Strong consistency Cassandra Configurable consistency Eventual Consistency MongoDB Consistency level for reads: Strong consistency – No stale reads possible Eventually consistent / configurable consistency – Stale reads possible Locking – Single row or single document HBase supports read-write row level locking Cassandra BASE – Basically available, soft state, eventual consistency Mongo supports atomic operations within a single document
NoSQL Database Management System Considerations Use NoSQL Database Management Systems when low latency row/record level processing is required Pick the correct hardware stack (CPU, Memory, Disk, Network) Confirm data consistency levels match business expectations Design row keys and column keys to match known search requirements Understand data access control options are very limited Plan for replicated copies of the data
Q & A