How to Architect Big Data Apps with the Lambda Architecture

Slides:



Advertisements
Similar presentations
Customer Relationship Management
Advertisements

© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.
Public B2B Exchanges and Support Services
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Yammer Technical Solutions Overview
Trade Promotion Management Study Summary Charts
QA practitioners viewpoint
Describing Complex Products as Configurations using APL Arrays.
Taming User-Generated Content in Mobile Networks via Drop Zones Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Capacity Planning For Products and Services
1© Copyright 2013 EMC Corporation. All rights reserved. EMC STORAGE ANALYTICS With VNX and VMAX Support.
Jonathan Berry President & CEO Leveraging a Help Desk as part of a Hyperion Center of Excellence Copyright © 2014, Accelatis.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Executional Architecture
Building a BI Solution Leveraging Analytical Reporting Arunachalam T, IM Group, SETLabs, Infosys.
1. SQL Server 2014 In-Memory by Design Arthur Zubarev June 21, 2014.
IT Analytics for Symantec Endpoint Protection
© Paradigm Publishing Inc Chapter 10 Information Systems.
Chapter 13 The Data Warehouse
You can’t manage what you can’t measure
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
A Fast Growing Market. Interesting New Players Lyzasoft.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Knowledge Portals and Knowledge Management Tools
David Besemer, CTO On Demand Data Integration with Data Virtualization.
Getting Smarter with Information An Information Agenda Approach
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Tyson Condie.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
“ONE” - Business Elsevier
© 2007 IBM Corporation IBM Information Management Accelerate information on demand with dynamic warehousing April 2007.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
OpenField Consolidates Stadium Data, Provides CRM and Analysis Functions for an Intelligent, End-to-End Solution COMPANY PROFILE : OPENFIELD Founded by.
Information Systems in Organizations Managing the business: decision-making Growing the business: knowledge management, R&D, and social business.
Big Data – Big Opportunity Mohammad Khansari ITRC President Jan 2015 ITRC, Tehran, Iran.
Information Systems in Organizations
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
BUSINESS INTELLIGENCE & ADVANCED ANALYTICS DISCOVER | PLAN | EXECUTE JANUARY 14, 2016.
Information Systems in Organizations Managing the business: decision-making Growing the business: knowledge management, R&D, and social business.
Sitecore. Compelling Web Experiences Page 1www.sitecore.net Patrick Schweizer Director of Sales Enablement 2013.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Stream Processing with Tamás István Ujj
Course : Study of Digital Convergence. Name : Srijana Acharya. Student ID : Date : 11/28/2014. Big Data Analytics and the Telco : How Telcos.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
An Introduction To Big Data For The SQL Server DBA.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Managing Data Resources File Organization and databases for business information systems.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics (CS40003) Introduction to Data Lecture #1
OMOP CDM on Hadoop Reference Architecture
CNIT131 Internet Basics & Beginning HTML
Data Platform Modernization
BigData - NoSQL Hadoop - Couchbase
Published Date: 14th October 2013
Chapter 14 Big Data Analytics and NoSQL
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Data Warehouse.
Data Platform Modernization
MARMIND’s New Service Delivers a Single Centralized Marketing Plan That Connects Teams, Campaigns and Outcomes by Using the Power of the Azure Platform.
Technical Capabilities
Big DATA.
Analytics, BI & Data Integration
Customer 360.
UNIT 6 RECENT TRENDS.
Presentation transcript:

How to Architect Big Data Apps with the Lambda Architecture OCTOBER 2014 Altan Khendup – Big Data Architect Ron Bodkin – Founder Think Big, a Teradata company © 2014 Teradata

Copyright 2013-2014 Think Big, a Teradata Company Real-Time Low latency Query response Data refresh End-to-end response … nanoseconds, milliseconds, seconds, or minutes depending on your problem Two basic patterns Strategic insight: decision support Process execution: system of engagement/operational analytics Storm – 100’s of milliseconds to 5 seconds Copyright 2013-2014 Think Big, a Teradata Company

Real-time Demand Growing Many users looking to gain valuable insights from both batch and real-time systems User Characteristics Do not always understand the complexities of tackling this challenge Also want to use familiar/easy-to-use interfaces wherever possible Want best practices about ways to integrate real-time (current) and batch (historical) Often not aware of all the options and trade-offs among them © 2014 Teradata

Enter Lambda Architecture Provides a common architectural pattern for discussion Provides a more clear picture of the complexities typically found in most organizations Some challenges in tackling Lambda architecture Complete Lambda requires more than just a single system Typically requires multiple components E.g. Batch/cold storage via e.g. Hadoop, Real-time/current data via e.g. Storm, Query via e.g. business analysis using a database Also some challenges in delivering results to the business Coordination is very difficult across the stack Quality results back to the organization very important Takes a lot of knowledge/expertise/technology to tackle Not typically a first step in Big Data implementation © 2014 Teradata

Background of Lambda Architecture Reference architecture for Big Data systems Designed by Nathan Marz (Twitter) Defined as a system that runs arbitrary functions on arbitrary data “query = function(all data)” Design Principles Human fault-tolerant, Immutability, Computable Lambda Layers Batch - Contains the immutable, constantly growing master dataset. Speed - Deals only with new data and compensates for the high latency updates of the serving layer. Serving - Loads and exposes the combined view of data so that they can be queried. Lambda = architectural pattern to talk about the complexity of dealing with real-time and historical datasets Overall use Prescriptive/Predictive uses rely on some dimension of real-time Use cases CPG – consumer goods looking at what customers are doing in real-time and making adjustments Medical – real-time medical sensors and treatment and labs for critical patient care Financial – credit risk and transaction fraud Manufacturers – IoT/Telematics getting information from their plants and logistics, cross referencing to inventory, and making adjustments to supply chain © 2014 Teradata

Overview of Lambda Architecture General architecture that covers how Lambda works overall Able to address real-time and historical data Layers Speed – real-time/current data streams; spark, storm, etc. Batch – historical data layer Serving – ability to take the current data and historical and merge the results and provide that to the organization Real-world experience/strategy Do not tackle all of the data but rather necessary segments of business functionality called queries Data can be tackled per query hence the idea of “query focused datasets” or qfds Allows for more focused results/faster speed gains © 2014 Teradata

USE CASE - MEDICAL © 2014 Teradata

Challenges in Medical Data Every year, more than a million people from all 50 states and nearly 150 countries come for care Challenges in Medical Data Health data tends to be “wide”, not “deep” New data types are becoming more important Unstructured Real-time streaming A challenge to generally move from retrospective “BI” viewing to event-based and predictive analytics usage

Optimize an existing Natural Language Processing pipeline in support of critical Colorectal Surgery (Move to tens of thousands of documents processed) Replace an existing free-text search facility used by Clinical Web Service for colorectal cancer (Move search to milliseconds)

Overall Architecture

Operational Statistics Current Storm throughput up to 1.5 million documents per hour Average of 140,000 HL7 messages actually processed per day with average latency of 60 milliseconds from ingest to persistence Average of 50,000 documents passed through annotators per day versus 5,000 historically Actual annotations of documents up to 6 times faster than previously accomplished Free-text search use cases that took over 30 minutes on old infrastructure completing in milliseconds in ElasticSearch HL7 actual processing based on “pull” requests from users not actual processing power HL7 are large xml-based documents Much larger than say JSON or others (roughly 800k-900k in size) Contains significant data related to medical information End goal An architecturally-driven, internally-owned technology stack that blends: An event-based processing fabric A real-time processing framework A multi-destination distillation hub “Classic” BI delivery techniques “Services-based” delivery techniques A “serendipitous” discovery environment Mutually supportive components that combine in delivering novel clinical solutions. © 2014 Teradata

Implementing Lambda Challenges Need for Practical Lambda approach Multiple layers Lots of events, data Complex Lots of different languages and data structures Difficult to maintain Lots of moving pieces/components/technologies Lots of changes for the business Need for Practical Lambda approach Based on real-world implementations Metadata model (events and data) Discrete data (query focused datasets) Data convergence (holistic query focused dataset)

Active Executor Lambda Framework The serving layer coordinates bringing this data together and creating a holistic view of the data Teradata understands some form of event and corresponding coordination of events to bring the data across the layers to the serving layer A general metadata model for data lineage and transformations Merge the data together into a holistic data set so that it can be served to consumers A context component that allows events, data, and requests to be held together Rules engine that allows for determinations based on sensing patterns Workflow/Dataflow for execution of necessary processing on data Save on the constant re-computation Snapshotted/versioned data Calculations done on these versions Can be worked with varios data structures and Hadoop components Full re-computation can be deferred and used to verify/replace specific snapshots © 2014 Teradata

Real Time and Lambda

Simpler Instantiations of Lambda KISS Real-Time isn’t free! 1 hour vs. 5 min vs. seconds And may not be meaningful anyhow Is there a robot or a human in the loop? Simpler Instantiations of Lambda Micro-Batch Feeds & Real-Time Queries Embarrassingly Parallel Speed Layer Transient Speed Layer … One database for Speed & Serving (RDBMS or NoSQL)

Use Case: Cross-Channel Behavior Analytics Understanding consumer purchase behavior across more than one touch point to drive holistic results Each channel for consumer marketing and engagement has siloed applications and analytic tools Correlating behavior across channels to understand customer journeys allows better engagement (e.g., web, mobile, call center, in store, email, social) Common goals: increased response rates, increased share of wallet, reduced churn, focus on high value customers, increase customer satisfaction Challenges: data volumes, correlation/sessionization, feature discovery

Real-Time Queries Pattern Queue Micro- batch Query/Serving Events HBase/ Teradata/Hive… Web server… Kafka etc Hadoop Many analytics use cases can be handled with update latencies of a few minutes Micro-batching allows for dramatic efficiency improvements … can extend to updates per event with additional infrastructure Pre-aggregation (HBase, MPP, etc.) can serve many users Hadoop query (Hive 0.13+ / Tez, Impala etc.) emerging

Use Case: Recommendations Recommendations rely on recent activity (purchases, content viewed, product interest, support issues) trends/fashion long-term propensity (relationship history, micro-segments, social…) The opportunity is to integrate deep insight into Behavior Social graph Building product recommendations/person/next best offer that’s maximally effective All A/B tested

Embarrassingly Parallel Speed Layer Pattern Queue Micro- batch Hadoop Events Kafka etc NoSQL/Speed Web server… HBase/ Mongo… Many operational use cases can be distributed across app server farm Batch computed views pushed to NoSQL Read NoSQL, update, respond & write to NoSQL can be done quickly No need for streaming analytics/computation

Copyright 2013-2014 Think Big, a Teradata Company Conclusions There are many kinds of real-time problems No one Big Data technology solves all the problems Lambda architecture provides a powerful way to solve the more sophisticated There are simpler approaches for simpler problems… …which may be a step towards Lambda Copyright 2013-2014 Think Big, a Teradata Company

We’re Hiring! thinkbig.teradata.com Booth #324

Altan Khendup (@madmongol) Ron Bodkin (@ronbodkin) Thank you! Altan Khendup (@madmongol) Ron Bodkin (@ronbodkin)