Berkeley Data Analytics Stack (BDAS) Overview

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Apache Spark and the future of big data applications Eric Baldeschwieler.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Tyson Condie.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Berkeley Data Analytics Stack - Apache Spark
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Smart Building Solution
Hadoop Aakash Kag What Why How 1.
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
Spark Presentation.
Smart Building Solution
Data Platform and Analytics Foundational Training
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Ministry of Higher Education
Introduction to Spark.
Visual Analytics Sandbox
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
CS : Project Suggestions
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley March 7, 2013 UC BERKELEY

What is Big Data used For? Reports, e.g., Track business processes, transactions Diagnosis, e.g., Why is user engagement dropping? Why is the system slow? Detect spam, worms, viruses, DDoS attacks Decisions, e.g., Decide what feature to add Decide what ad to show Block worms, viruses, … What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” Data is only as useful as the decisions it enables

8/26/2018 Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis So what does this mean? Well, this means that we want low response-time on historical data since the faster we can make a decision the better. We want the ability to perform queries on live data since decisions on real-time data are better than on stale data. Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.

Today’s Open Analytics Stack… 8/26/2018 Today’s Open Analytics Stack… ..mostly focused on large on-disk datasets: great for batch but slow Data Processing Application Today’s open analytics stack is mostly focused on large on-disk datasets. As a result, it provides sophisticated processing on massive data but it is slow.. This stack consists of roughly four layers… Except Google, Microsoft, and at some extend Amazon, this is the de-facto standarrd stack for data analysis today, and it is used by Yahoo!, Facebook, Twitter, to name a few. Storage Infrastructure

Goals Batch Interactive Streaming One stack to rule them all! Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

Our Approach: Support Interactive and Streaming Comp. High end datacenter node 10Gbps Aggressive use of memory Why? Memory transfer rates >> disk or even SSDs Gap is growing especially w.r.t. disk Many datasets already fit into memory The inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory E.g., 1TB = 1 billion records @ 1 KB each Memory density (still) grows with Moore’s law RAM/SSD hybrid memories at horizon 128-512GB What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” 40-60GB/s 16 cores 0.2-1GB/s (x10 disks) 1-4GB/s (x4 disks) 10-30TB 1-4TB

Our Approach: Support Interactive and Streaming Comp. result T Increase parallelism Why? Reduce work per node  improve latency Techniques: Low latency parallel scheduler that achieve high locality Optimized parallel communication patterns (e.g., shuffle, broadcast) Efficient recovery from failures and straggler mitigation What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” result Tnew (< T)

Our Approach: Support Interactive and Streaming Comp. 128-512GB 16 cores 40-60GB/s Trade between result accuracy and response times Why? In-memory processing does not guarantee interactive query processing E.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing Challenges: accurately estimate error and running time for… … arbitrary computations doubles every 18 months What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” doubles every 36 months

Our Approach Easy to combine batch, streaming, and interactive computations Single execution model that supports all computation models Easy to develop sophisticated algorithms Powerful Python and Scala shells High level abstractions for graph based, and ML algorithms Compatible with existing open source ecosystem (Hadoop/HDFS) Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..) Support existing execution models (e.g., Hive, GraphLab)

Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Application … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers. in-memory processing trade between time, quality, and cost Data Processing Data Processing Data Management Storage Efficient data sharing across frameworks Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)

The Berkeley AMPLab “Launched” January 2011: 6 Year Plan AMP Algorithms Machines People AMP “Launched” January 2011: 6 Year Plan 8 CS Faculty ~40 students 3 software engineers Organized for collaboration:

Berkeley Data Analytics Stack (BDAS) The Berkeley AMPLab Funding: XData, CISE Expedition Grant Industrial, founding sponsors 18 other sponsors, including Goal: next Generation of open source analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Berkeley Data Analytics Stack (BDAS) Application … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers. Data Management Data Processing Resource Management Resource Management Data Management Data Processing Resource Management Data Management Data Processing Infrastructure

Berkeley Data Analytics Stack (BDAS) Existing stack components…. Our instantiation of the resource management is Mesos Data Management Data Processing Resource Management HDFS MPI … Resource Mgmnt. Data Processing Hadoop HIVE Pig HBase Storm

Mesos [Released, v0.9] Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers Our instantiation of the resource management is Mesos HIVE Pig HBase Storm MPI … Data Processing Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

One Framework Per Cluster Challenges Inefficient resource usage E.g., Hadoop cannot use available resources from MPI’s cluster No opportunity for stat. multiplexing Hard to share data Copy or access remotely, expensive Hard to cooperate E.g., Not easy for MPI to use data generated by Hadoop Hadoop MPI Hadoop MPI Need to run multiple frameworks on same cluster

Mesos Uniprograming Multiprograming Solution: Mesos Hadoop MPI Hadoop Common resource sharing layer abstracts (“virtualizes”) resources to frameworks enable diverse frameworks to share cluster Our solution is Mesos, a resource sharing layer that abstracts resources to framework, and this way allows over which diverse frameworks cab run, by abstracting resources to framework By doing so we go from uniprograming or one framework per cluster to multiprogramming where we share entire cluster among diverse frameworks. Hadoop MPI Hadoop MPI Mesos Uniprograming Multiprograming

Dynamic Resource Sharing 100 node cluster This graph shows the number of CPUS allocated to each of the four frameworks in the macrobenchmark over time for the duration of the 25 minute experiment. You can see that some frameworks take advantage of the cluster when other frameworks have a lull in activity. For instance, around time 1050, the Torque framework finishes running the MPI job and no more MPI jobs have been queued up to run, so it releases its shares (until another MPI job is queued around time 1250 seconds). The two hadoop jobs are able to take advantage of the resources Torque is not using and get more work done and increase the average percent of the cluster that is allocated at any time.

Spark [Release, v0.7] In-memory framework for interactive and iterative computations Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction Scala interface, Java and Python APIs Our instantiation of the resource management is Mesos HIVE Pig Storm MPI Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Our Solution Resilient Distributed Data Sets (RDD) Partitioned collection of records Immutable Can be created only through deterministic operations from other RDDs Handle of each RDD stores its lineage: Lineage: sequence of operations that created the RDD Recovery: use lineage information to rebuild RDD

RDD Example Two-partition RDD A={A1, A2} stored on disk Apply f() and cache  RDD B Shuffle, and apply g()  RDD C Aggregate using h()  D B = A.f() f() B2 B1 C = B.g() g() C2 C1 A1 h() D = C.h() D RDD A A2

RDD Example C1 lost due to node failure before h() is computed B = A.f() f() B2 B1 A1 g() C1 h() RDD A D D = C.h() A2 g() C2 C = B.g()

RDD Example C1 lost due to node failure before h() is computed Reconstruct C1, eventually, on a different node B = A.f() f() B2 B1 A1 g() C1 h() D h() RDD A D D = C.h() A2 g() C2 C = B.g()

PageRank Performance 54 GB Wikipedia dataset

Other Iterative Algorithms Time per Iteration (s) 100 GB datasets

Spark Community 3000 people attended online training in August 500+ meetup members 14 companies contributing spark-project.org

Spark Streaming [Alpha Release] Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Existing Streaming Systems Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 Traditional streaming systems have what we call a “record-at-a-time” processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard.

Spark: Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results

Spark: Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Batch sizes as low as ½ second, latency ~ 1 second Potential for combining batch processing and streaming processing in the same system batches of X seconds Spark processed results

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency Tested with 100 streams of data on 100 EC2 instances with 4 cores each

Comparison with Storm and S4 Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack

Fast Fault Recovery Recovers from faults/stragglers within 1 sec

Shark [Release, v0.2] HIVE over Spark: SQL-like interface (supports Hive 0.9) up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Conviva Warehouse Queries (1.7 TB) 1.1 0.8 0.7 1.0

Spark & Shark available now on EMR! Our instantiation of the resource management is Mesos

Tachyon [Alpha Release, this Spring] High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS Support for Spark and Hadoop Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Hadoop Spark HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

BlinkDB [Alpha Release, this Spring] Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Our instantiation of the resource management is Mesos Spark Streaming BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

SparkGraph [Alpha Release, this Spring] GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Our instantiation of the resource management is Mesos Spark Streaming Spark Graph BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

MLbase [In development] Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

Compatible with Open Source Ecosystem Support existing interfaces whenever possible GraphLab API Our instantiation of the resource management is Mesos Hive Interface and Shell Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos HDFS Tachyon Data Mgmnt. HDFS API Resource Mgmnt. Mesos

Compatible with Open Source Ecosystem Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Use existing interfaces whenever possible Support Hive API Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Support HDFS API, S3 API, and Hive metadata Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

Holistic approach to address next generation of Big Data challenges! Summary Holistic approach to address next generation of Big Data challenges! Support interactive and streaming computations In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations Spark execution engine supports all comp. models Easy to develop sophisticated algorithms Scala interface, APIs for Java, Python, Hive QL, … New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Batch Interactive Streaming Spark

Thanks!

Hive Architecture

Shark Architecture