Real-Time Analytics with NewSQL: Why Hadoop is not enough

Slides:



Advertisements
Similar presentations
Andy Pavlo April 13, 2015April 13, 2015April 13, 2015 NewS QL.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Real-Time Big Data Use Cases John Leach CTO, Splice Machine.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
The NewSQL database you’ll never outgrow Taming the Big Data Fire Hose John Hugg Sr. Software Engineer, VoltDB.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
A Fast Growing Market. Interesting New Players Lyzasoft.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Meanwhile RAM cost continues to drop Moore’s Law on total CPU processing power holds but in parallel processing… CPU clock rate stalled… Because.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
InfiniDB Overview.
Real-time Stream Processing Architecture for Comcast IP Video
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Why does my perfectly working App Crash and Burn in Production? Matt Kramer Project Manager, STL Boeing Scalability Test Lab cell.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
BIG DATA/ Hadoop Interview Questions.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
CSCI5570 Large Scale Data Processing Systems
BigData - NoSQL Hadoop - Couchbase
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Operational & Analytical Database
NOSQL.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Introduction to NewSQL
APACHE HAWQ 2.X A Hadoop Native SQL Engine
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
Taming the Big Data Fire Hose
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Performance And Scalability In Oracle9i And SQL Server 2000
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Real-Time Analytics with NewSQL: Why Hadoop is not enough Raj Bains Director of Product Management

Agenda SQL on Hadoop NewSQL with customer examples When to use which Technology NewSQL compared Operations – the big problem with big data

Scale-out: The Architecture of the Cloud NewSQL Scale-out SQL SQL Warehouses NoSQL Hadoop High Volume Simple Transactions System-of-Record Transactions Real-Time Analytics Fast Analytics on old data Batch Analytics on Massive Data Sets

What goes around, comes around… SQL is cool again!! Batch jobs via Map Reduce Apache Hive ✓ Fault Tolerance ✓ Scales to Petabytes ✓ Schema Flexibility Real-time query response On Data Warehouse Cloudera Impala Apache Drill (MapR) Presto (Facebook) Shark/Spark (UC Berkeley AMPLab) Stinger initiative and Tez (Hortonworks) IBM Big SQL Pivotal HAWQ ? Fault Tolerance ? Scale to Petabytes ? Schema Flexibility Transactional Database on Hbase? Unproven

Example: Cloudera Impala Performance Impala Performance Update: Now Reaching DBMS-Class Speed http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ Impala with columnar storage (Parquet) beat Hive (not saying much) and reaches other columnar stores in performance on TPC-DS TPC Benchmark™DS (TPC-DS): The New Decision Support Benchmark Standard Examine large volumes of data Give answers to real-world business questions Execute queries of various operational requirements and complexities (e.g., ad-hoc, reporting, iterative OLAP, data mining) Are characterized by high CPU and IO load Are periodically synchronized with source OLTP databases through database maintenance functions

NewSQL Promise: Scale-out SQL operational database NewSQL Basics Operational databases Scale-out of NoSQL ACID properties Distributed Transactions NewSQL Add-ons Real-time Analytics In-Memory Geo-Distribution Online schema changes GOOGLE F1 “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.” Google is encouraging developers to switch to SQL “for low-latency OLTP queries, large OLAP queries, and everything in between.”

ClustrixDB Introduction HIGH-SCALE TRANSACTIONS Linear scalability for writes/updates/reads Double nodes  double transactions/sec REAL-TIME ANALYTICS Linear speedup for analytics Double nodes  half the query time REAL WORKLOADS SCALE-OUT Add nodes as demand grows SELF-MANAGING BUILT-IN FAULT TOLERANCE ACID, SQL AND MYSQL This slide conveys what we believe are the key characteristics of the ideal database for real world workloads and the Cloud. In other words, this is the “wish list” for the ideal database. Key points to emphasize Scale-Out SQL is the way to go Clustrix offers a scale-out SQL database that lets you simply add more nodes* to your cluster as demand grows so you can serve more users, transactions and data. High-Scale Transactions Clustrix delivers high transactional query throughput with near linear scale at virtually any data set size and concurrency for all real-world query workloads. Real-time Analytics You can run analytic queries against your main database (while running transactions) to get real-time insights and operational intelligence. Clustrix uses Massively Parallel Processing (MPP) that uses multiple cores across nodes in parallel to speed up your analytic queries. SQL, MySQL and ACID With Clustrix, you get ACID guarantees and the full power of a SQL interface. Our database is on the wire compatible with MySQL, which means that you can use your existing application code and connectors with Clustrix. Self-healing Clustrix is easy to install and automates fault tolerance. Clustrix is built to be self-managing and simplifies operations allowing DBAs to focus on high value add tasks, greating reducing the ownership cost. Customer Proven Clustrix has been serving production workloads since 2008. We power dozens of large-scale production customers all around the world. Our largest customers have datasets with billions of rows, multiple terabytes of data, and non-trivial transactional workloads approaching 100,000 TPS in production. Superior Service Clustrix provides services that out customers love. Our DBA-on-demand service provides deep technical insight. Managed services in DBaaS monitors your database to find issues before you do.

Clustrix Design SQL Massively Parallel Intelligent Data Query Processing Intelligent Data Distribution SQL SQL SQL SQL SQL Shared Nothing Architecture Query Compiler Data map Query Compiler Data map Query Compiler Data map Database Engine Database Engine Database Engine Simple queries Fielded by any node Routed to data node Complex queries Split into query fragments Process fragments in parallel

Scaling SQL to 29+ Million users, without a DBA The Application Social Discovery (dating) and match making Users 29+ million Login 10 million a day User Messages 15 million a day Likes 4 million a day “We have not run into scaling issues anymore. As we’ve need capacity we just add nodes and see linear growth. Nicolas Van Eenaeme CIO MassiveMedia Frequent complex query in the application 7-way join looking with group by and sort The Database Transactions 4.4 Billion a day Avg. Latency 5-10 millisec Cores 168 x 2 Memory 1 TB x 2 SSD 23 TB x 2 Raw reads / writes 4.69 / 1.08 Petabytes a month user_cxxxxxxx (1.9 TB Table) user_email user user_photo user_photo_detail user_blocked user_friends Online schema change on user_contactlist, a ~2TB table Running on ClustrixDB for 3 years © 2014

Real-Time Analytics for Ad Exchanges 6.9 Billion ad impressions a day. Bids in < 50 millisec Master Master struggling to ingest high volume data, clickstreams Complex 15 slave network with lag and inconsistent data Previous setup Scale-out cluster with multi-master replication All data is synchronized and live for analytics Supply side platforms www.abcd.com ad Ad exchanges Ad Agencies and DSPs make bidding strategies and run reports to monitor them “Reports went from up to 4 hours to 15 seconds, making customers happy.” - Ken Kwan, CTO Demand side platforms Ad Agencies Advertisers © 2014

NOMORERACK : Availability and Growth in the Cloud Cyber Monday: 600% Revenue spike 3x Database Traffic Scaled from 6 node (48 core) to 14 node (112 core) Fastest growing e-commerce companies in the US, offering daily deals 1023% growth in revenue 15-20x traffic peaks in the holidays Complex reporting/analytics queries © 2014

SQL on Hadoop or NewSQL? NewSQL is a better fit in real-time HADOOP AND THE DATA WAREHOUSE: When to use which Dr Amr Awadallah (Cloudera) and Dan Graham (Teradata) http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Hadoop_and_the_Data_Warehouse_Whitepaper.pdf Fictional company CostCutter Utilities 10 million households 21.6 billion sensor readings per quarter Analyze this data together, in real-time 21.6 billion * 200 bytes = 3.9 Terabytes NewSQL is a better fit

Architecture with NewSQL for Real-time Analytics Real-Time Analytics on Live Operational Data NewSQL Customer data Metadata Users, Files Commerce data Machine data Social data ETL Retire Processed data, Insights Hadoop EDW Log Data

NewSQL: Scale-out SQL Miscellaneous Transactions (OLTP) Real-Time Analytics High Availability (production) Geo-distributed OLTP (production) Miscellaneous DBShards ScaleBase … Auto–sharding, storage engines and other tools on top of legacy databases In-Memory Real-Time Analytics (Add-on to production) In-memory OLTP ETL for Analytics (Add-on to production)

ClustrixDB Horizontal Slicing vs. Sharding 4 active partition configuration Client or load balancer No single point of failure by design Single command to add/remove nodes Load evenly distributed across cluster on node loss All copies are consistent – no master-slave lag

Availability in Production Is your database production ready?? 5 - nines availability is 25 seconds / month No human intervention – fix bug is possible Strict Accounting Any downtime or slow time counted Database issue or customer process issue

So, NewSQL Scale-out SQL can deliver: Massive Transactions volume at low cost Real-time analytics on real-time data High availability in the cloud TRENDS Fast data ingest with in-memory Richer Analytics More JSON

QUESTIONS

Joins: Data Distribution Clustrix Sharding: Co-located indexes Slicing: Independently distributed indexes TABLE USERS id name rest 2 John … 4 6 Tom TABLE USERS id name rest 3 John … 5 Jake 7 Gopi TABLE USERS id name rest 2 John … 4 6 Tom TABLE USERS id name rest 3 John … 5 Jake 7 Gopi INDEX NAME name id John 2 4 Tom 6 INDEX NAME name id John 3 Jake 5 Gopi 7 INDEX NAME name id John 2 4 3 INDEX NAME name id Gopi 7 Jake 5 Tom 6

??? Joins: In Action What Happens for a 10,000 X 100 Join? Sharding: Joins are broadcasts Slicing: Joins are scalable TABLE PRODUCT product name 2 John TABLE PRODUCT product name 2 John ??? What Happens for a 10,000 X 100 Join? INDEX NAME name id John 2 4 Tom 6 Terribly Slow and not scalable Design Schema based on Joins INDEX NAME name id Gopi 7 Jake 5 John 3 INDEX NAME name id John 2 4 3 Scales INDEX NAME name id Gopi 7 Jake 5 Tom 6

NewSQL Revisited: VoltDB Data Distribution is similar to ClustrixDB Fast OLTP In-memory Reduce Locking and Latching Analytics No MVCC – reads will block writes or non-ACID Plug-and-play compatibility Java stored procedures Tool ecosystem S1 S2 S1 S2

NewSQL Revisited: NuoDB Focus on OLTP and Geo-distributed OLTP Transaction node Transaction node Transaction node S2 Data is moved to the node that needs it, in small pieces Data is moved back to storage nodes for commit S1 Data (and ownership) is moved across nodes if other nodes need to use it Storage node

NewSQL Revisited: MemSQL In-Memory with MVCC Two tier architecture and some restrictions Leaf nodes are not cluster-aware and hold shards JSON support Aggregator (cluster aware) Aggregator (cluster aware) Data is pulled to aggregator nodes for some queries Some queries are pushed down to leaf nodes S1 Leaf node Leaf node Leaf node Leaf node Availability through DB level master-slaving