Tyson Condie.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
An Introduction to HDInsight June 27 th,
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
© 2010 IBM Corporation Business Analytics software Business Analytics Editable Text Editable Text Editable Text.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Table of Contents Introduction Why Data Analytics Data Analytics Terminology Predictive Analytics Data Analytics challenges Data Analytics Platform Data.
Next Generation of Apache Hadoop MapReduce Owen
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Data Platform and Analytics Foundational Training
Berkeley Data Analytics Stack (BDAS) Overview
Apache Hadoop YARN: Yet Another Resource Manager
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Visual Analytics Sandbox
湖南大学-信息科学与工程学院-计算机与科学系
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Presentation transcript:

Tyson Condie

Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law Thanks to Hadoop, today it is easier and cheaper than ever to collect data. The data we collect is not only massive but it is projected to grow exponentially. According to an IDC report that data we produce is expected to grow faster than the Moore’s law. (IDC report*)

The New Gold Rush Everyone wants to extract value from data Big companies & startups alike Huge potential Already demonstrated by Google, Facebook, … But, untapped by most organizations “We have lots of data but no one is looking at it!” Everyone collects data with one goal in mind: extract value from it. However, there is a big gap between this aspirational goal and reality. On one hand, companies like Google, Facebook, and others have demonstrated that there can be huge value in the data. On the other hand, most companies do little with their data, if anything, or at least not as much as they would like.

Extracting Value from Data Hard Data is massive, unstructured, and dirty Question are complex e.g., Predict the future. Processing, analysis tools still in their “infancy” Need tools that are Faster More sophisticated Easier to use This is because it is fundamentally hard to extract value from data. Data is masive, ….

Turning Data into Value Insights, diagnosis, e.g., Why is user engagement dropping? Why is the system slow? Detect spam, DDoS attacks Decisions, e.g., What feature to add to a product Personalized medical treatment What ads to show What actors to cast for the “House of Cards” Let be more concrete about what people mean by turning data into value. First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” Data only as useful as the decisions it enables

4/21/2017 What do We Need? Interactive queries: enable human in the loop decisions Big Data Workbench Explore data in real-time Streaming queries: enable automated real-time decisions E.g., fraud detection, detect DDoS attacks Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis So what does this mean? Well, this means that we want low response-time on historical data since the faster we can make a decision the better. We want the ability to perform queries on live data since decisions on real-time data are better than on stale data. Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.

The Need For Unification Today’s state-of-art analytics stack Interactive queries Interactive queries on historical data Data (e.g., logs) Ad-Hoc queries on historical data Batch Streaming Real-Time Analytics Challenge 1: need to maintain three stacks Expensive and complex Hard to compute consistent metrics across stacks

The Need For Unification Today’s state-of-art analytics stack Interactive queries Interactive queries on historical data Data (e.g., logs) Ad-Hoc queries on historical data Batch Streaming Real-Time Analytics Challenge 2: hard/slow to share data, e.g., Hard to perform interactive queries on streamed data

Our Goal: Unified Big Data runtime Batch Streaming Interactive Single Framework! Support batch, streaming, and interactive computations… … in a unified framework Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Resource Managers: Cloud Operating System Manage machine cluster (cloud) resources Tenants coordinate with the RM to allocate resources for running tasks E.g., a MapReduce job would execute its map/reduce tasks A few alternative designs Apache YARN: also known as Hadoop version 2 Apache Mesos Google Omega Facebook Corona Goal: broaden the scope of Big Data applications

!?!?!?! The Challenge Batch (MapReduce) Streaming (Storm) Interactive Machine Learning !?!?!?! YARN / HDFS

The Challenge Fault Tolerance High-throughput networking Batch (MapReduce) Streaming (Storm) Interactive Machine Learning Fault Tolerance High-throughput networking YARN / HDFS

The Challenge Load spikes Elastic resource needs Batch (MapReduce) Streaming (Storm) Interactive Machine Learning Load spikes Elastic resource needs YARN / HDFS

The Challenge User friendly Toolkits Low Latency Networking Batch (MapReduce) Streaming (Storm) Interactive Machine Learning User friendly Toolkits Low Latency Networking YARN / HDFS

The Challenge Complex functions/data Iterative Dataflow Batch (MapReduce) Streaming (Storm) Interactive Machine Learning Complex functions/data Iterative Dataflow YARN / HDFS

REEF: Retainable Evaluator Execution Framework Batch (MapReduce) Streaming (Storm) Interactive Machine Learning REEF YARN / HDFS

Unified Big Data Runtime Stack Batch (MapReduce) Streaming (Storm) Interactive Machine Learning Domain Specific Language (DSL) Physical Data Parallel Operators REEF YARN / HDFS

REEF: http://reef-project REEF: http://reef-project.org Centralized control plane for building a distributed data plane Control Plane Data Plane Storage Big Buffer Manager Operator Access Methods Network Message passing (sending statistics) Bulk Transfers (large-scale shuffle) State Management Checkpoints Data lineage Job Driver User code executed on YARN’s Application Master (control plane) Task User code executed within an Evaluator (data plane) Evaluator Execution Environment for Tasks. One Evaluator is bound to one YARN Container

Summary Everyone collects but few extract value from data Batch Interactive Streaming Everyone collects but few extract value from data Unification of comp. and prog. models to Efficiently analyze data Make sophisticated, real-time decisions REEF provides OS functionalities Used to develop higher-level Big Data applications Long term goal is to… Unify batch, interactive, streaming computation models Provide domain specific toolkits to data scientists REEF

Scalable Analytics Institute http://scai.cs.ucla.edu

ScAI Projects Big Data systems Graph based analytics Language design for Big Data and data streams Mining high dimensional data User and quality modeling in Big Data