Hadoop and NoSQL at Thomson Reuters

Slides:



Advertisements
Similar presentations
2 Proprietary & Confidential What is Sharding Benefits of Sharding Alternatives of Sharding When to start Sharding Agenda.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
NoSQL Databases: MongoDB vs Cassandra
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
1 The Google File System Reporter: You-Wei Zhang.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Goodbye rows and tables, hello documents and collections.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Hadoop & Neptune Feb 김형준.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
Cassandra as Memcache Edward Capriolo Media6Degrees.com.
Why NO-SQL ?  Three interrelated megatrends  Big Data  Big Users  Cloud Computing are driving the adoption of NoSQL technology.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden
CS 405G: Introduction to Database Systems
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
and Big Data Storage Systems
Amit Ohayon, seminar in databases, 2017
Scaling HDFS to more than 1 million operations per second with HopsFS
Column-Based.
Cassandra - A Decentralized Structured Storage System
HBase Mohamed Eltabakh
CSE 775 – Distributed Objects Bekir Turkkan & Habib Kaya
Apache hadoop & Mapreduce
Software Systems Development
How did it start? • At Google • • • • Lots of semi structured data
CS122B: Projects in Databases and Web Applications Winter 2017
CLOUDERA TRAINING For Apache HBase
Modern Databases NoSQL and NewSQL
NOSQL.
The NoSQL Column Store used by Facebook
Gowtham Rajappan.
NOSQL databases and Big Data Storage Systems
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden
Chapter 2 Objectives Identify Windows 7 Hardware Requirements.
NoSQL Systems Overview (as of November 2011).
Storage Systems for Managing Voluminous Data
Massively Parallel Cloud Data Storage Systems
The Basics of Apache Hadoop
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Introduction to Apache
NoSQL Databases Antonino Virgillito.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Database Management Systems Unit – VI Introduction to Big Data, HADOOP: HDFS, MapReduce Prof. Deptii Chaudhari, Assistant Professor Department of.
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Working with GEOLocation Data
SDMX meeting Big Data technologies
Presentation transcript:

Hadoop and NoSQL at Thomson Reuters Prepared by: Cory Vogel and Dan Dressel July 2012

Agenda Hadoop NoSQL Infrastructure Lessons Learned HBase Cassandra MongoDB

What does Hadoop mean to TR? Large batch processing Shared Infrastructure Scalable w/o down time Fault tolerant Lower cost More efficient than traditional data processing

Hardware Roles Master Node Data Node Edge Node Function NameNode CheckPoint JobTracker TaskTracker HDFS CLI PIG Hive HUE… Requirements Enterprise hardware RAID Redundant Power Commodity Hardware JBOD

Recommended Commodity Hardware Hardware Recommended Commodity Initial Standard Current Standard CPU 4 CPU (Intel or AMD) 8 CPU (Intel) 12 CPU (Intel) Memory 4 GB 24 GB 96 GB Disk (4) 1TB 7.2k (12) 1TB 7.2k Network Single 1GbE Dual 10GbE Power Supply Single Dual Server Cabinet Dedicated Shared (TOR networking)

Performance Results Metric Commodity Standard CPU High WIO (spindle bound) Great for different work loads Memory Very small heaps More is better NIC Bottle Neck Peaks around 300MBs Cabinet Expensive (pre-allocate space) Cost Effective (shared space with 10GbE)

Hardware Comparison Node Teragen Terasort DL160 CPU: 8 Cores Mem: 24GB Disk: 4x1TB NIC: 1GbE 1hrs, 26mins, 41sec 2hrs, 56mins, 9sec DL180 Disk: 12x450GB 13mins, 39sec 33mins, 56sec

Big Data Stacks Hadoop / Hbase Cassandra / MongoDB Hardware Standard Disk Layout JBOD RAID10 for Data Performance 6X Improvement 5X Improvement

NoSQL Database Management Systems Next Generation Databases Non-relational Distributed  Open-source Horizontally scalable Source: http://nosql-database.org

NoSQL Database Management Systems Proof-of-Concepts Data Store Model Foundation Support , Training, Consulting HBase Wide Column Store Google’s Bigtable Cloudera Cassandra Amazon’s Dynamo DataStax MongoDB Document Store 10gen

HBase Data Model Table Column Family Row Column HBase Description Relational (like) Table Namespace for Column Families Schema Column Family Container for an ordered collection of rows and columns Table / Segment Row Ordered collection of columns Column Column name / Value / Timestamp (Cell) Table Column Family Row Column

HBase Data Model HBase Example Column names are not fixed Table pmart_detail pmart_summary Column Family DETAIL SUM_PRODUCT_COMMAND_HOUR Row Key b4d433f42a00a59b831000007c020da0i 1343066400,60,{product,command}, {WestlawNext,GetDocumentSummaryRequest} Column Name product elapsedTime.TOTAL Column Value WestlawNext 63196 Column Timestamp 1343066400429000 1343066400291111 Column names are not fixed Each individual column has a timestamp Column versions allowed CLI Examples: put ‘pmart_detail’,‘0026d257460737a77310’,‘DETAIL’:’product ‘,‘WestlawNext’; get ‘pmart_detail’,‘0026d257460737a77310’,‘DETAIL’:’product‘;

Cassandra Data Model Cassandra Description Relational (like) Keyspace Namespace for Column Families Schema Super Column Family Container for an ordered collection of rows Table / Segment Column Family Row Ordered collection of columns Super Column Ordered collection of Sub-Columns Column Column name / Value / Timestamp Keyspace Super Column Family Row Super Column Column Value Column Family

Cassandra Data Model Cassandra Example Keyspace ks Column Family cf Row xxxx_20120723 Column Name 09:10:25 Column Value k=v,k=v,k=v Column Timestamp 1343066400429000 (client provided) Each row can contain as many as 2B columns Column names are not fixed Each individual column has a timestamp CLI Examples: use ks; set cf[‘xxxx_20120723’][‘09:10:25‘]=‘k=v,k=v,k=v’; get cf[‘xxxx_20120723’][‘09:10:25 '] ;

MongoDB Data Model Database Collection Document MongoDB Description Relational (like) Database Container structure for collection of documents Tablespace Collection Named grouping of documents Table Document BSON (Binary JavaScript Object Notation) documents with dynamic schemas Row Database Collection Document

MongoDB Data Model MongoDB Example Database people Collection names Document {"addresses":[ {"state":"MN","zip":“55101","street_type":"CT","street_name":“XXXXXX WAY","county":“RAMSEY", "city":“ST. PAUL,"full_street1":"2140 XXXXXX WAY COURT","house_nbr":"2140"}, {"state":"MN","zip":“55101,"zip_ext":“1101","street_type":"DR","street_name":“XXXX","county":“RAMSEY","city":“ST. PAUL","full_street1":"1109 XXXX DR","house_nbr":“1109“}], "phones":[], "birthdays":["19491101"], "names":[ {"last_name":“XXXXX","middle_name":“WILLIAM","first_name":“JAMES"}, {"last_name":“XXXXX”,"middle_name":“W","first_name":“JAMES,"full_name":“XXXXX, JAMES W"}, {"last_name":“XXXXX“,"middle_name":“W.","first_name":“JAMES“}} 16M document size limit No fixed schema Structured document

Replication Options Database Media Protection Data Replication to Additional Servers HBase Hadoop Distributed File System (Triple Store) Master/Slave Master/Master Cyclic Replication (Asynchronous) Cassandra RAID 10 Peer-to-Peer MongoDB Replica sets

Consistency Levels Consistency level for reads: Database Consistency Levels Default Consistency Level HBase Strong consistency (possible exception replication slaves) Strong consistency Cassandra Configurable consistency Eventual Consistency MongoDB Consistency level for reads: Strong consistency – No stale reads possible Eventually consistent / configurable consistency – Stale reads possible Locking – Single row or single document HBase supports read-write row level locking Cassandra BASE – Basically available, soft state, eventual consistency Mongo supports atomic operations within a single document

NoSQL Database Management System Considerations Use NoSQL Database Management Systems when low latency row/record level processing is required Pick the correct hardware stack (CPU, Memory, Disk, Network) Confirm data consistency levels match business expectations Design row keys and column keys to match known search requirements Understand data access control options are very limited Plan for replicated copies of the data

Q & A