Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Slides:



Advertisements
Similar presentations
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Advertisements

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Why Spark on Hadoop Matters
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
The big Data security Analytics Era Is Here Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Tyson Condie.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Charles Tappert Seidenberg School of CSIS, Pace University
Page 1 GADD Software & GADD Analytics 1.5 Public version, January 2015, gaddsoftware.com GADD Analytics.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (
DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matthew Winter and Ned Shawa
GraphX: Graph Analytics on Spark
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Table of Contents Introduction Why Data Analytics Data Analytics Terminology Predictive Analytics Data Analytics challenges Data Analytics Platform Data.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Humanizing Business Insights with Social Data #ListenSmarter Feb
An Introduction To Big Data For The SQL Server DBA.
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. How Can RDF and OWL Coexist with Property Graph Zhe Wu Architect Oracle Spatial and.
Handling Streaming Data in Spotify Using the Cloud
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Berkeley Data Analytics Stack - Apache Spark
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Apache Spot (Incubating)
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Spark Presentation.
Data Platform and Analytics Foundational Training
Berkeley Data Analytics Stack (BDAS) Overview
Team 2 – Mike, Rich, Sam and Steven DPS – PACE University
Introduction to Spark.
Visual Analytics Sandbox
Apache Spark & Complex Network
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
An Overview of Apache Spark
Overview of big data tools
The Arab Co. for Medical & Agricultural Products (Jordan) “SAP Business One enabled us to incorporate all our dynamic activities into one single, easy.
Spark and Scala.
Technical Capabilities
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Big DATA.
Big-Data Analytics with Azure HDInsight
SQL Server 2019 Bringing Apache Spark to SQL Server
Presentation transcript:

Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY

Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

The New Gold Rush Everyone wants to extract value from data  Big companies & startups alike Huge potential  Already demonstrated by Google, Facebook, … But, untapped by most companies  “We have lots of data but no one is looking at it!”

Extracting Value from Data Hard Data is massive, unstructured, and dirty Question are complex Processing, analysis tools still in their “infancy” Need tools that are  Faster  More sophisticated  Easier to use

Turning Data into Value Insights, diagnosis, e.g.,  Why is user engagement dropping?  Why is the system slow?  Detect spam, DDoS attacks Decisions, e.g.,  Decide what feature to add to a product  Personalized medical treatment  Decide when to change an aircraft engine part  Decide what ads to show Data only as useful as the decisions it enables

What do We Need? Interactive queries: enable faster decisions  E.g., identify why a site is slow and fix it Queries on streaming data: enable decisions on real-time data  E.g., fraud detection, detect DDoS attacks Sophisticated data processing: enable “better” decisions  E.g., anomaly detection, trend analysis

Our Goal Batch Interactive Streaming Single Framework! Support batch, streaming, and interactive computations… … in a unified framework Easy to develop sophisticated algorithms (e.g., graph, ML algos)

The Need For Unification Today’s state-of-art analytics stack Data (e.g., logs) Ad-Hoc queries on historical data Challenge 1: need to maintain three stacks Expensive and complex Hard to compute consistent metrics across stacks Interactive queries on historical data Streaming Real-Time Analytics Batch Interactive queries

The Need For Unification Today’s state-of-art analytics stack Data (e.g., logs) Batch Streaming Ad-Hoc queries on historical data Real-Time Analytics Interactive queries on historical data Challenge 2: hard/slow to share data, e.g.,  Hard to perform interactive queries on streamed data Interactive queries

Spark Unifies batch, streaming, interactive comp. Easy to build sophisticated applications  Support iterative, graph-parallel algorithms  Powerful APIs in Scala, Python, Java Spark Streaming Shark SQL BlinkDB GraphXMLlib Streamin g Batch, Interactive Interactiv e Data-parallel, Iterative Sophisticated algos.

An Analogy First cellular phones Unified device (smartphone) Specialized devices z Better Phone Better GPS Better Games

An Analogy Batch processing Unified system Specialized systems

Turning Data into Value, Examples Unify real-time and historical data analysis  Easier to build and maintain  Cheaper to operate  Easier to get insights, faster decisions Unify streaming and machine-learning  Faster diagnosis, decisions (e.g., better ad targeting) Unify graph processing and ETLs  Faster to get social network insights (e.g., improve user experience)

Unify Real-time & Historical Analysis Single implementation (stack) providing  Streaming  Batch (pre-computing results)  Interactive computations/queries Spark Streaming Shark SQL BlinkDB

Unify Real-time & Historical Analysis Batch and streaming codes virtually the same  Easy to develop and maintain consistency // count words from a file (batch) val file = sc.textFile("hdfs://.../pagecounts-*.gz") val words = file.flatMap(line => line.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() // count words from a network stream, every 10s (streaming) val ssc = new StreamingContext(args(0), "NetCount", Seconds(10),..) val lines = ssc.socketTextStream("localhost”, 3456) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()

Unify Streaming and ML Sophisticated, real-time diagnosis & decisions, e.g.,  Fraud detection  Detect denial of service attacks  Early notification of service degradation and failures Spark Streaming MLlib

Unify Graph Processing and ETL Graph-parallel systems (e.g., Pregel, GraphLab)  Fast and scalable, but…  … inefficient for graph creation, post-processing GraphX: unifies graph processing and ETL Spark GraphX

Unify Graph Processing and ETL Graph Lab Hadoop Graph Algorithms Graph Creation (Hadoop) Post Proc. Post Proc. GraphX Graph Creation (Spark) Post Proc. (Spark)

“Crossing the Chasm”

Cloudera Partnership Integrate Spark with Cloudera Manager Spark will become part of CDH Enterprise class support and professional services available for Spark

We are Committed to… … open source  We believe that any successful analytics stack will be open source … improve integration with Hadoop  Enable every Hadoop user take advantage of Spark … work with partners to make Apache Spark successful for enterprise customers

Summary Everyone collects but few extract value from data Unification of comp. and prog. models key to  Efficiently analyze data  Make sophisticated, real-time decisions Spark is unique in unifying  batch, interactive, streaming computation models  data-parallel and graph-parallel prog. models Many use cases in the rest of the program! Batch Interactive Streamin g Spar k