© 2014 MapR Technologies 1 Ted Dunning. © 2014 MapR Technologies 2 Me, Us Ted Dunning, MapR Chief Application Architect, Apache Member –Committer PMC.

Slides:

Advertisements

Similar presentations

Operating System.

Advertisements

Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.

Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

SM3121 Software Technology Mark Green School of Creative Media.

Running Your Database in the Cloud Eran Levin VP R&D - Xeround.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

1 The Google File System Reporter: You-Wei Zhang.

Module 12: Designing High Availability in Windows Server ® 2008.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Our Experience Running YARN at Scale Bobby Evans.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Introduction to Hadoop and HDFS

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Server to Server Communication Redis as an enabler Orion Free

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Virtual Machines Mr. Monil Adhikari. Agenda Introduction Classes of Virtual Machines System Virtual Machines Process Virtual Machines.

Apache Mesos What is it ? Beyond Hadoop Resource Sharing Mesos Intentions Architecture Users

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Calgary Oracle User Group

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

BLoyal Version 4: Real-Time, Omnichannel Loyalty and Engagement Solution Has Been Redesigned and Rebuilt on the Microsoft Azure Cloud Platform MICROSOFT.

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Fast Cars, Big Data How Streaming Can Help Formula 1.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator.

N-Tier Architecture.

Intro to SaaS Software as a service (SaaS) is a model of software delivery where the software company provides maintenance, daily technical operation,

Large-scale file systems and Map-Reduce

Cisco Data Virtualization

Scaling Apache Flink® to very large State

BDII Performance Tests

Trial.iO Makes it Easy to Provision Software Trials, Demos and Training Environments in the Azure Cloud in One Click, Without Any IT Involvement MICROSOFT.

CLOUDERA TRAINING For Apache HBase

Wonderware Online Cost-Effective SaaS Solution Powered by the Microsoft Azure Cloud Platform Delivers Industrial Insights to Users and OEMs MICROSOFT AZURE.

Using Sequence Statistics to Fight Advanced Persistent Threats

1Z0-477 VCE Questions

Central Florida Business Intelligence User Group

Capitalize on modern technology

Overview Introduction VPS Understanding VPS Architecture

Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.

Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.

Big Data - in Performance Engineering

Johannes Peter MediaMarktSaturn Retail Group

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

G063 - Distributed Databases

Data Security for Microsoft Azure

CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.

Ewen Cheslack-Postava

Introducing Qwory, a Business-to-Business Search Engine That’s Powered by Microsoft Azure and Detects Vital Contact Information for Businesses MICROSOFT.

Partner Logo Azure Provides a Secure, Scalable Platform for ScheduleMe, an App That Enables Easy Meeting Scheduling with People Outside of Your Company.

Dell Data Protection | Rapid Recovery: Simple, Quick, Configurable, and Affordable Cloud-Based Backup, Retention, and Archiving Powered by Microsoft Azure.

Evolution of messaging systems and event driven architecture

Lecture 16 (Intro to MapReduce and Hadoop)

Department of Intelligent Systems Engineering

HBase on MapR Lohit VijayaRenu, MapR Technologies, Inc.

The Gamma Database Machine Project

MapReduce: Simplified Data Processing on Large Clusters

Big Data, Simulations and HPC Convergence

Running C# in the browser

Presentation transcript:

© 2014 MapR Technologies 1 Ted Dunning

© 2014 MapR Technologies 2 Me, Us Ted Dunning, MapR Chief Application Architect, Apache Member –Committer PMC member Zookeeper, Drill, others –Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin –VP Incubator –Bought the beer at the first HUG MapR –Produces first converged platform for big and fast data –Includes data platform (files, streams, tables) + open source –Adds major technology for performance, HA, industry standard API’s

© 2014 MapR Technologies 3 New book on Apache Flink Download free pdf courtesy of MapR Technologies mapr.com/flink-book

© 2014 MapR Technologies 4 Agenda Why streaming first architecture What does fast mean? How do I make something fast? Minor pause for reality check First steps … heavy bottlenecks Real results Deeper insights

© 2014 MapR Technologies 5 Is this really a revolutionary moment?

© 2014 MapR Technologies 6 Scenario: Profile Database

© 2014 MapR Technologies 7 The task

© 2014 MapR Technologies 8 Traditional Solution

© 2014 MapR Technologies 9 What Happens Next?

© 2014 MapR Technologies 10 What Happens Next?

© 2014 MapR Technologies 11 How to Get Service Isolation

© 2014 MapR Technologies 12 New Uses of Data

© 2014 MapR Technologies 13 Scaling Through Isolation

© 2014 MapR Technologies 14 For this to work (socially), streaming has to be faster than almost any requirement

© 2014 MapR Technologies 15 So how do we make something go really fast?

© 2014 MapR Technologies 16

© 2014 MapR Technologies 17

© 2014 MapR Technologies 18 Well, perhaps not quite so simple?

© 2014 MapR Technologies 19 Recommendations

© 2014 MapR Technologies 20 User Generated Content

© 2014 MapR Technologies 21 Yahoo Streaming Benchmark

© 2014 MapR Technologies 22

© 2014 MapR Technologies 23

© 2014 MapR Technologies 24

© 2014 MapR Technologies 25 What we do at MapR

© 2014 MapR Technologies 26 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Over decades of progress, Unix-based systems have set the standard for compatibility and functionality

© 2014 MapR Technologies 27 Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility Evolution of Data Storage

© 2014 MapR Technologies 28 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance Functionality Compatibility Scalability POSIX

© 2014 MapR Technologies 29 Functionality Compatibility Scalability Linux POSIX Hadoop Evolution of Data Storage Adding converged tables and streams enhances the functionality of the base file system

© 2014 MapR Technologies 30

© 2014 MapR Technologies 31 Key Ideas Convergence of files, tables, streams into single platform –All forms of persistence share common implementation base Very high abstraction from hardware … no need to provision clusters for tables and files –Common disaster recovery, security, availability models for files, directories, tables and streams Very high performance levels

© 2014 MapR Technologies 32 Key Issues MapR itself is heavily threaded internally (as many as 50k threads/core) MapR client can have multiple internal threads Ordering boundaries require serialization, locks or memory contention –At client level and also within single stream/topic/partition Replication, splitting, data location completely automated by default, explicit control available MapR Streams and Flink are in same cluster, but some shuffles still required

© 2014 MapR Technologies 33 Initial Configuration 10 nodes in cluster 1 Flink task manager / node 72 partitions in impressions stream Each task manager spawns 72 generator threads 10x72 threads 72 partitions At full speed, partition insert points wander around cluster to avoid hot-spotting MapR client connection shared by all threads in task manager. Having more client connections could help

© 2014 MapR Technologies 34 Tuning #1 Large number of threads and single client connection per node caused massive contention at serialization point inside client Switched to 3 Flink task managers per node 2 task managers each run 1 producer thread –More data pushed by 1 thread than previously sent by 72

© 2014 MapR Technologies 35 Tuning #2 Effective cluster-wide parallelism limited by 72 partitions in stream Increasing to 300 partitions substantially improved performance

© 2014 MapR Technologies 36 The consumer Initial tuning had 72 consumer threads per node Final tuning used single consumer thread per Flink task manager

© 2014 MapR Technologies 37 The Shuffle / Group-by Shuffles were also run by the single consumer task manager Even with shuffle, consumer processes balanced producer processes

© 2014 MapR Technologies 38 Tuning #3 In separate experiments, number of campaigns was increased to 1e6 from original 100 This caused bottle neck to shift massively to data export step Serving results directly from Flink memory avoids this step

© 2014 MapR Technologies 39 Final Comparisons Final result for tuning was 250% improvement No serious optimization was required, however

© 2014 MapR Technologies 40 The Moral Default of 10 partitions per topic is fine for large-scale multi- tenancy, but special purpose applications may need tuning to higher levels (we ended up with 30 partitions per node) Asynchronous client gives effective threading with small number of producer threads, large number of producer threads was counter-productive Net speedup of 250% with tuning, so far Gut feel is that there is ~4x more performance still to come

© 2014 MapR Technologies 41 Me, Us Ted Dunning, MapR Chief Application Architect, Apache Member –Committer PMC member Zookeeper, Drill, others –Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin –VP Incubator –Bought the beer at the first HUG MapR ( –Produces first converged platform for big and fast data –Includes data platform (files, streams, tables) + open source –Adds major technology for performance, HA, industry standard API’s

© 2014 MapR Technologies 42 New book on Apache Flink Download free pdf courtesy of MapR Technologies mapr.com/flink-book

© 2014 MapR Technologies 43 Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free signed hard copies at MapR booth at Flink Forward

© 2014 MapR Technologies 44 Short Books by Ted Dunning & Ellen Friedman Published by O’Reilly in For sale from Amazon or O’Reilly Free e-books currently available courtesy of MapR Download pdfs: mapr.com/ebooks-pdf

© 2014 MapR Technologies 45 Thank You!

© 2014 MapR Technologies 46 Q & maprtech Engage with us! MapR maprtech mapr-technologies