Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
In-Memory Cluster Computing for Iterative and Interactive Applications
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Engineering How MapReduce Works
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
CS (borrowing heavily from slides by Kay Ousterhout)
Architecture and design
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Big Data is a Big Deal!.
Presented by Peifeng Yu
PROTECT | OPTIMIZE | TRANSFORM
Concept & Examples of pyspark
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Topo Sort on Spark GraphX Lecturer: 苟毓川
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Spark Presentation.
MapReduce Simplified Data Processing on Large Cluster
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Distributed Computing with Spark
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
Overview of big data tools
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC 2012

Why Spark? Another system for big data analytics Isn’t MapReduce good enough? Simplifies batch processing on large commodity clusters

Expensive save to disk for fault tolerance

Why Spark? MapReduce can be expensive for some applications e.g., Iterative Interactive Lacks efficient data sharing Specialized frameworks did evolve for different programming models Bulk Synchronous Processing (Pregel) Iterative MapReduce (Haloop) ….

Solution: Resilient Distributed Datasets (RDDs) Immutable, partitioned collection of records Built through coarse grained transformations (map, join …) Can be cached for efficient reuse Apply the same operations to many pieces of data

RDD RDD RDD Read HDFS Cache Read Reduce Map

Solution: Resilient Distributed Datasets (RDDs) Immutable, partitioned collection of records Built through coarse grained, ordered transformations (map, join …) Fault Recovery? Lineage! Log the coarse grained operation applied to a partitioned dataset Simply recompute the lost partition if failure occurs! No cost if no failure

Cache Lineage RDD RDD RDD Read HDFS Read Map Reduce HDFS Read Map RDD

Cache Lineage RDD RDD RDD Read HDFS RDDs track the graph of transformations that built them (their lineage) to rebuild lost data Cache Read Map Reduce HDFS Read Map Reduce Lineage RDD RDD RDD

What can you do with Spark? RDD operations Transformations e.g., filter, join, map, group-by … Actions e.g., count, print … Control Partitioning Persistence

Partitioning PageRank Links (url, neighbors) Ranks (url, ranks) Joins take place repeatedly Good partitioning reduces shuffles Contributions Ranks (url, ranks) Contributions

Generality RDDs allow unification of different programming models Stream Processing Graph Processing Machine Learning …..

Gather-Apply-Scatter on GraphX B C D Triplets Vertices Neighbors A B C D A B D C Graph Represented In a Table

Gather-Apply-Scatter on GraphX B C D Group-By A A B D C Gather at A

Gather-Apply-Scatter on GraphX B C D Map A B D C Apply

Gather-Apply-Scatter on GraphX B C D Triplets A B C Join D A B D C Scatter

Summary RDDs provide a simple and efficient programming model Generalized to a broad set of applications Leverages coarse-grained nature of parallel algorithms for failure recovery