Iterative Computing on Massive Data Sets

Slides:



Advertisements
Similar presentations
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Advertisements

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Tyson Condie.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.
Image taken from: slideshare
Apache Spark: A Unified Engine for Big Data Processing
Berkeley Data Analytics Stack - Apache Spark
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Machine Learning Library for Apache Ignite
Yarn.
Topo Sort on Spark GraphX Lecturer: 苟毓川
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Introduction to Distributed Platforms
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop Tutorials Spark
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
CMPT 733, SPRING 2016 Jiannan Wang
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Introduction to Apache
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Interpret the execution mode of SQL query in F1 Query paper
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CMPT 733, SPRING 2017 Jiannan Wang
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Introduction to Azure Data Lake
CS639: Data Management for Data Science
Pig Hive HBase Zookeeper
Presentation transcript:

Iterative Computing on Massive Data Sets William Rory Kronmiller

Why Do I Care?

Some Companies Using Spark Alibaba Databricks Amazon eBay Baidu Facebook Berkeley AMPLab NASA JPL Opentable CERN Tencent Concur TripAdvisor

Some Things You Can Do With Spark Real-Time Advertisement Selection Genome Sequencing Network IDS Fraud Detection https://www.mapr.com/blog/game-changing-real-time-use-cases-apache-spark-on-hadoop

Okay, But What Is It?

Spark The Distributed System UC Berkeley Research Project Designed to run on a cluster of unreliable machines Resilient Distributed Datasets (RDDs) Collections Partitioned and Parallel Immutable Recoverable LAZY In-Memory Originated as a UC Berkely project; originally just designed to show off Berkely’s distributed job scheduler Mesos Like the projects leading up to it, designed to run under the assumption that worker machines will fail Core of the system is the Resilient Distributed Dataset RDD is a programming abstraction Can think of it as a partitioned collection of tuples, somwhat similar to a view in a database Write Spark programs by transforming one/more RDDs into another RDD - very functionally-oriented programming model RDDs are lazy and immutable and keep track of their “lineages” so that nodes can fail without the computation having to stop; also involves less checkpointing Materialized RDDs can be cached in memory for efficient iterative and ad-hoc computation. Lazy evaluation also makes pipelining easier tl;dr RDDs let you perform ad-hoc and iterative computation on massive data sets using a cluster of consumer-grade hardware

Output Operators Trigger Materialization Sometimes called Output Operations (DStreams) Also called Actions (RDDs) Examples reduce() collect() count() foreach() take() saveAsTextFile() Output operators a.k.a. Actions are operations on RDDs that trigger materialization Basically are functions that are called when you are done manipulating the data and want to see the results If you don’t have an output operator, your Spark job will not do anything

Spark High Level Design Spark can be broken up into three main components: Driver: manages a particular Spark Job Cluster Manager: manages a pool of worker machines servicing one or more Spark Jobs Worker: provides resources for performing parallel computation on and memory for RDDs https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html

Spark Driver The Core of a Spark Job: Talks to cluster Creates and schedules tasks Tracks computation progress Hosts WebUI To make a Spark Job, you write a Scala, Java, or Python program that hooks into the Spark libraries Your program consists of a series of transformations on RDDs and ends with some kind of output operation Submit your program to a Spark driver (e.g. a program running on your laptop with VPN access to a Spark cluster) Spark driver breaks program up into a Directed Acyclic Graph of tasks (RDD transformations) Driver gets resources from cluster and schedules the tasks Driver also hosts a WebUI for viewing progress of computation, usually on http://localhost:4040

Cluster Manager Spark runs on multiple clusters: Mesos YARN Spark Standalone Nice feature of Spark: built from day 1 to play nice with other computing frameworks Can work on multiple cluster computing frameworks: just needs a way of getting resources to schedule tasks Originally built to run on Mesos, but can also now run on YARN or in “Standalone mode” where it serves as its own resource manager http://www.agildata.com/wp-content/uploads/2016/03/Blog-InlineIMAGES-ClusterManagement.png

Spark Executor Hosts RDD Partitions Performs Computations Ideally Executor Runs on HDFS Host for Partition Not much more to be added about Spark Executors Data locality reduces latency so run spark executors on the machines where the data is stored if practical

Spark SQL RDDs as SQL Tables DataFrame Similar to R DataFrames ¯\_(ツ)_/¯ Untyped :( DataSet Typed :) Has Become Superset for DataFrame Compatible with Hive and HiveQL Compatible with JDBC and ODBC thru Thrift Server Adds support for high-performance SQL-like operations Allows RDDs to be represented like tables in SQL databases DataFrame untyped - access columns by name using strings DataSet typed - access columns using regular Java/Scala/Python dot operator syntax DataSets superset of data frames now, with DataFrames being DataSets of type Row Can use SQL queries and ODBC with Spark https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

MLlib Machine Learning Library Huge Just Google It https://goo.gl/5m0ijJ https://goo.gl/uM5CLv Could Be Its Own UPE IO Spark has rich suite of Machine Learning libraries High performance iterative computation well suited to ML MLlib is probably one of the biggest focuses for the researchers developing Spark at the moment, so too big to get into much detail Some things you can do: K Means Clustering PCA TF-IDF Word2Vec Linear Regression Decision Trees Collaborative Filtering Much much more https://cdn.meme.am/instances/500x/67240542.jpg https://databricks.com/wp-content/uploads/2016/01/image04.png

GraphX Graph Processing on Spark Edge Computation Vertex Centric Computation Pregel API https://goo.gl/YM0ruw http://spark.apache.org/docs/latest/img/graphx_logo.png Can also treat RDDs as collections of vertices and edges then perform graph computation on them Could also be its own UPE IO http://spark.apache.org/docs/latest/img/property_graph.png

Word Count Demo Show word count TF/IDF demo Explain what’s going on

Postgres vs Spark PostgreSQL Unlimited Database Size 32TB Table Size SQL Databases Are Good For Transactions Low Latency Small-Medium Size Data Sets One Big Server (Probably) Efficiency: Decades of Query Optimizers Spark is Good For ETL Data Mining Machine Learning Cloud Environments PostgreSQL Unlimited Database Size 32TB Table Size 1.6TB Row Size Spark 2014 Daytona GraySort Contest Winner 100TB in 23 Minutes 1000TB in 243 Minutes Some of you might be wondering why you would use something like Spark rather than a SQL database Not a valid comparison Spark meant for data mining/processing, has lots of tools/features for doing that Databases meant to process transactions, store/update data Encourage people to google how to use Spark with Postgres

Hadoop MapReduce vs Spark Both perform batch computation Both support Map/Reduce paradigm Both distributed/fault-tolerant Both use HDFS Spark RDDs Enable Iterative Computation Some of you might also be wondering how Spark compares to MapReduce Easier comparison: Spark compatible with HDFS, more capable than MapReduce If you encounter a stack using MapReduce that works…you’re fine If you have a choice, I’d suggest Spark for its performance and flexibility

Spark Streaming

Micro Batch Model Spark Uses Batch Computation Take a bucket of data Transform the bucket of data … Profit What about unbounded streams of data? Micro-Batches Take a bucket of data every N seconds… One of Spark’s most powerful features: Spark Streaming Perform near-real-time analysis on unbounded data How? Take a stream, slice it up into buckets and treat each bucket like a regular spark batch

DStreams Micro-batch programming abstraction Unbounded collection of RDDs Still parallel, fault- tolerant, etc… Still lazy - Spark provides programming abstraction for micro-batches called DStreams, which are collections of RDDs http://image.slidesharecdn.com/apachesparkhs-141212054258-conversion-gate01/95/apache-spark-streaming-46-638.jpg?cb=1418381382

Streaming DEMO

Try This At Home https://databricks.com/try-databricks

spark.apache.org cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark databricks.com https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#00%20Welcome%20to%20Databricks.html https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html sparkhub.databricks.com deepspace.jpl.nasa.gov mapr.com/blog/game-changing-real-time-use-cases-apache-spark-on-hadoop databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf mapr.com/blog/how-get-started-using-apache-spark-graphx-scala blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html