Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop and NoSQL at Thomson Reuters

Similar presentations


Presentation on theme: "Hadoop and NoSQL at Thomson Reuters"— Presentation transcript:

1 Hadoop and NoSQL at Thomson Reuters
Prepared by: Cory Vogel and Dan Dressel July 2012

2 Agenda Hadoop NoSQL Infrastructure Lessons Learned HBase Cassandra
MongoDB

3 What does Hadoop mean to TR?
Large batch processing Shared Infrastructure Scalable w/o down time Fault tolerant Lower cost More efficient than traditional data processing

4 Hardware Roles Master Node Data Node Edge Node Function NameNode
CheckPoint JobTracker TaskTracker HDFS CLI PIG Hive HUE… Requirements Enterprise hardware RAID Redundant Power Commodity Hardware JBOD

5 Recommended Commodity
Hardware Hardware Recommended Commodity Initial Standard Current Standard CPU 4 CPU (Intel or AMD) 8 CPU (Intel) 12 CPU (Intel) Memory 4 GB 24 GB 96 GB Disk (4) 1TB 7.2k (12) 1TB 7.2k Network Single 1GbE Dual 10GbE Power Supply Single Dual Server Cabinet Dedicated Shared (TOR networking)

6 Performance Results Metric Commodity Standard CPU High WIO
(spindle bound) Great for different work loads Memory Very small heaps More is better NIC Bottle Neck Peaks around 300MBs Cabinet Expensive (pre-allocate space) Cost Effective (shared space with 10GbE)

7 Hardware Comparison Node Teragen Terasort DL160 CPU: 8 Cores Mem: 24GB
Disk: 4x1TB NIC: 1GbE 1hrs, 26mins, 41sec 2hrs, 56mins, 9sec DL180 Disk: 12x450GB 13mins, 39sec 33mins, 56sec

8 Big Data Stacks Hadoop / Hbase Cassandra / MongoDB Hardware Standard
Disk Layout JBOD RAID10 for Data Performance 6X Improvement 5X Improvement

9 NoSQL Database Management Systems
Next Generation Databases Non-relational Distributed  Open-source Horizontally scalable Source:

10 NoSQL Database Management Systems Proof-of-Concepts
Data Store Model Foundation Support , Training, Consulting HBase Wide Column Store Google’s Bigtable Cloudera Cassandra Amazon’s Dynamo DataStax MongoDB Document Store 10gen

11 HBase Data Model Table Column Family Row Column HBase Description
Relational (like) Table Namespace for Column Families Schema Column Family Container for an ordered collection of rows and columns Table / Segment Row Ordered collection of columns Column Column name / Value / Timestamp (Cell) Table Column Family Row Column

12 HBase Data Model HBase Example Column names are not fixed
Table pmart_detail pmart_summary Column Family DETAIL SUM_PRODUCT_COMMAND_HOUR Row Key b4d433f42a00a59b c020da0i ,60,{product,command}, {WestlawNext,GetDocumentSummaryRequest} Column Name product elapsedTime.TOTAL Column Value WestlawNext 63196 Column Timestamp Column names are not fixed Each individual column has a timestamp Column versions allowed CLI Examples: put ‘pmart_detail’,‘0026d a77310’,‘DETAIL’:’product ‘,‘WestlawNext’; get ‘pmart_detail’,‘0026d a77310’,‘DETAIL’:’product‘;

13 Cassandra Data Model Cassandra Description Relational (like) Keyspace
Namespace for Column Families Schema Super Column Family Container for an ordered collection of rows Table / Segment Column Family Row Ordered collection of columns Super Column Ordered collection of Sub-Columns Column Column name / Value / Timestamp Keyspace Super Column Family Row Super Column Column Value Column Family

14 Cassandra Data Model Cassandra Example Keyspace ks Column Family cf
Row xxxx_ Column Name 09:10:25 Column Value k=v,k=v,k=v Column Timestamp (client provided) Each row can contain as many as 2B columns Column names are not fixed Each individual column has a timestamp CLI Examples: use ks; set cf[‘xxxx_ ’][‘09:10:25‘]=‘k=v,k=v,k=v’; get cf[‘xxxx_ ’][‘09:10:25 '] ;

15 MongoDB Data Model Database Collection Document MongoDB Description
Relational (like) Database Container structure for collection of documents Tablespace Collection Named grouping of documents Table Document BSON (Binary JavaScript Object Notation) documents with dynamic schemas Row Database Collection Document

16 MongoDB Data Model MongoDB Example Database people Collection names
Document {"addresses":[ {"state":"MN","zip":“55101","street_type":"CT","street_name":“XXXXXX WAY","county":“RAMSEY", "city":“ST. PAUL,"full_street1":"2140 XXXXXX WAY COURT","house_nbr":"2140"}, {"state":"MN","zip":“55101,"zip_ext":“1101","street_type":"DR","street_name":“XXXX","county":“RAMSEY","city":“ST. PAUL","full_street1":"1109 XXXX DR","house_nbr":“1109“}], "phones":[], "birthdays":[" "], "names":[ {"last_name":“XXXXX","middle_name":“WILLIAM","first_name":“JAMES"}, {"last_name":“XXXXX”,"middle_name":“W","first_name":“JAMES,"full_name":“XXXXX, JAMES W"}, {"last_name":“XXXXX“,"middle_name":“W.","first_name":“JAMES“}} 16M document size limit No fixed schema Structured document

17 Replication Options Database Media Protection
Data Replication to Additional Servers HBase Hadoop Distributed File System (Triple Store) Master/Slave Master/Master Cyclic Replication (Asynchronous) Cassandra RAID 10 Peer-to-Peer MongoDB Replica sets

18 Consistency Levels Consistency level for reads:
Database Consistency Levels Default Consistency Level HBase Strong consistency (possible exception replication slaves) Strong consistency Cassandra Configurable consistency Eventual Consistency MongoDB Consistency level for reads: Strong consistency – No stale reads possible Eventually consistent / configurable consistency – Stale reads possible Locking – Single row or single document HBase supports read-write row level locking Cassandra BASE – Basically available, soft state, eventual consistency Mongo supports atomic operations within a single document

19 NoSQL Database Management System Considerations
Use NoSQL Database Management Systems when low latency row/record level processing is required Pick the correct hardware stack (CPU, Memory, Disk, Network) Confirm data consistency levels match business expectations Design row keys and column keys to match known search requirements Understand data access control options are very limited Plan for replicated copies of the data

20 Q & A


Download ppt "Hadoop and NoSQL at Thomson Reuters"

Similar presentations


Ads by Google