Mixing Low Latency with Analytical Workloads for Customer Experience Management Neil Ferguson, Development Lead, NICE Systems.

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Big Data Working with Terabytes in SQL Server Andrew Novick
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Amazon RDS (MySQL and Oracle) and SQL Azure Emil Tabakov Telerik Software Academy academy.telerik.com.
Enabling High-level SLOs on Shared Storage Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica Cake 1.
SwatI Agarwal, Thomas Pan eBay Inc.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
VTS INNOVATOR SERIES Real Problems, Real solutions.
Distributed storage for structured data
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take advantage of the SMS technology in your organization today!
Apache Spark and the future of big data applications Eric Baldeschwieler.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Hands-On Microsoft Windows Server 2008
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.
The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Event Management & ITIL V3
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Scalability == Capacity * Density.
AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
An Introduction to Super-Scalability But first…
Introduction to Exadata X5 and X6 New Features
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
BIG DATA/ Hadoop Interview Questions.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Gorilla: A Fast, Scalable, In-Memory Time Series Database
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
HBase Mohamed Eltabakh
Real-time analytics using Kudu at petabyte scale
Running virtualized Hadoop, does it make sense?
CSE-291 Cloud Computing, Fall 2016 Kesden
Software Architecture in Practice
Working with Very Large Tables Like a Pro in SQL Server 2014
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Performance And Scalability In Oracle9i And SQL Server 2000
Presentation transcript:

Mixing Low Latency with Analytical Workloads for Customer Experience Management Neil Ferguson, Development Lead, NICE Systems

2  Causata was founded in 2009 and developed Customer Experience Management software  Based in Silicon Valley, but with engineering team in London  Causata was acquired by NICE Systems in August 2013 Introduction

Talk will mostly focus on our HBase-based data platform: Challenges we encountered migrating from our proprietary data store Performance optimizations we have made General observations about using HBase in production

 Two main use cases, and therefore access patterns for our data  Real-time offer management:  Involves predicting something about a customer based on their profile  For example, predicting if somebody is a high-value customer when deciding whether to offer them a discount  Typically involves low latency (< 50 ms) access to an individual customer’s profile 4 NICE / Causata Overview

 Analytics  Involves getting a large set of profiles matching certain criteria  For example, finding all of the people who have spent more than $100 in the last month  Involves streaming access to large samples of data (typically millions of rows / sec per node)  Often ad-hoc  Both on-premise and SAAS 5 NICE / Causata Overview

 Started building our platform around 4 ½ years ago  Started on MySQL  Latency too high when reading large profiles  Write throughput too low with large data sets  Built our own custom-built data store  Performed well (it was built for our specific needs)  Non-standard; maintenance costs  Moved to HBase last year  Industry standard; lowered maintenance costs  Can perform well! 6 Some History

 All data is stored as Events, which have the following:  A type (for example, “Product Purchase”)  A timestamp  An identifier (who the event belongs to)  A set of attributes, each of which has a type and value(s), for example:  “Product Price ->  “Product Category” -> “Shoes”, “Footwear”  Only raw data is stored (not pre-aggregated)  Typical sizes are 10s of TB 7 Our Data

 Event table (row-oriented):  Stores data clustered by user profile  Used for low latency retrieval of individual profiles for offer management, and for bulk queries for analytics  Index table (“column-oriented”):  Stores data clustered by attribute type  Used for bulk queries (scanning) for analytics  Identity Graph:  Stores a graph of cross-channel identifiers for a user profile  Stored as an in-memory column family in the Events table 8 Our Storage

 Data locality (with HBase client) gives around a 60% throughput increase  Single node can scan around 1.6 million rows / second with Region Server on separate machine  Same node can scan around 2.5 million rows / second with Region Server on the local machine 9 Maintaining Locality

 Custom region splitter: ensures that (where possible), event tables and index tables are split at the same point  Tables divided into buckets, and split at bucket boundaries  Custom load balancer: ensures that index table data is balanced to the same Region Server as event table data  All upstream services are locality-aware 10 Maintaining Locality

For each customer who has spent more than $100, get product views in the last week from now : SELECT C.product_views_in_last_week FROM Customers C WHERE C.timestamp = now() AND total_spend > 100; 11 Querying Causata

For each customer who has spent more than $100, get product views in the last week from when they purchased something : SELECT C.product_views_in_last_week FROM Customers C, Product_Purchase P WHERE C.timestamp = P.timestamp AND C.profile_id = P.profile_id AND C.total_spend > 100; 12 Querying Causata

 Raw data stored in HBase, queries typically performed against aggregated data  Need to scan billions of rows, and aggregate on the fly -Many parallel scans performed: - Across machines (obviously) - Across regions (and therefore disks) - Across cores 13 Query Engine

14 Parallelism Single Region Server, local client, all rows returned to client, disk-bound workload (disk cache cleared before test), ~1 billion rows scanned in total, ~15 bytes per row (on disk, compressed), 2 x 6 core Intel(R) 2.67GHz, 4 x 10k RPM SAS disks, 48GB RAM

 Queries can optionally skip uncompacted data (based on HFile timestamps)  Allows result freshness to be traded for performance  Multiple columns combined into one  Very important to understand exactly how data is stored in HBase  Trade-off between performance and ease of understanding  Shortcircuit reads turned on  Available from 0.94  Make sure HDFS checksums are turned off (and HBase checksums are turned on) 15 Query Engine

 All requests to HBase go through a single thread pool  This allows requests to be prioritized according to sensitivity to latency  “Real-time” (latency-sensitive) requests are treated specially  Real-time request latency is monitored continuously, and more resources allocated if deadlines are not met 16 Request Prioritization

 Major Compactions can stomp all over other requests, affecting real-time performance  We disable automatic major compactions and schedule them for off-peak times  Regions are compacted individually, allowing time- bounded incremental compaction 17 Major Compactions

18  HBase is relatively young, in database years  Distributed computing is hard  Writing a database is hard  Writing a distributed database is very very hard HBase in Production

 HBase is operationally challenging  There are quite a few moving parts  Configuration is difficult  Error messages aren’t always clear (and are not always something that we need to worry about)  Monitoring is important  Operations folks need to be trained adequately  Choose your distro carefully  HBase doesn’t cope very well if you try to throw too much at it  May need to throttle the rate of insertion (though it can cope happily with tens of thousands of rows per region server)  Limit the size of insertion batches 19 HBase in Production

 By choosing HBase you are choosing consistency over availability 20 HBase in Production CONSISTENCY AVAILABILITY PARTITION TOLERANCE HBASE

 Expect some of your data to be unavailable for periods of time  Failure detection is difficult!  Data is typically unavailable for between 1 and 15 minutes, depending on your configuration  You may wish to buffer incoming data somewhere  Consider the impact of unavailability on your users carefully 21 HBase in Production

 We’re hiring Java developers, Machine Learning developers, and a QA lead in London to work on our Big Data platform  me, or come and talk to me 22 We’re Hiring!

THANK YOU neil.ferguson at nice dot com Web: