CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Distributed Computations
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Storage in Big Data Systems
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
PySpark Tutorial - Learn to use Apache Spark with Python
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Spark Programming By J. H. Wang May 9, 2017.
Concept & Examples of pyspark
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Machine Learning Library for Apache Ignite
Topo Sort on Spark GraphX Lecturer: 苟毓川
Introduction to Distributed Platforms
Spark.
Scaling Spark on HPC Systems
Hadoop Tutorials Spark
Hadoop MapReduce Framework
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Kishore Pusukuri, Spring 2018
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Spark and Scala.
Interpret the execution mode of SQL query in F1 Query paper
CS 345A Data Mining MapReduce This presentation has been altered.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
CS639: Data Management for Data Science
Introduction to Spark.
CS639: Data Management for Data Science
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Apache Hadoop and Spark
CS5412 / Lecture 25 Apache Spark and RDDs
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

CS639: Data Management for Data Science Lecture 11: Spark Theodoros Rekatsinas

Logistics/Announcements Questions on PA3?

Today’s Lecture MapReduce Implementation Spark

MapReduce Implementation

Recall: The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage map map map map map map Map (Shuffle) Reduce reduce reduce reduce reduce

MapReduce: what happens in between?

MapReduce: the complete picture

Step 1: Split input files into chunks (shards)

Step 2: Fork processes

Step 3: Run Map Tasks

Step 4: Create intermediate files

Step 4a: Partitioning

Step 5: Reduce Task - sorting

Step 6: Reduce Task - reduce

Step 7: Return to user

MapReduce: the complete picture We need a distributed file system!

2. Spark

Intro to Spark Spark is really a different implementation of the MapReduce programming model What makes Spark different is that it operates on Main Memory Spark: we write programs in terms of operations on resilient distributed datasets (RDDs). RDD (simple view): a collection of elements partitioned across the nudes of a cluster that can be operated on in parallel. RDD (complex view): RDD is an interface for data transformation, RDD refers to the data stored either in persisted store (HDFS) or in cache (memory, memory+disk, disk only) or in another RDD

RDDs in Spark

MapReduce vs Spark

Partitions are recomputed on failure or cache eviction RDDs Partitions are recomputed on failure or cache eviction Metadata stored for interface: Partitions – set of data splits associated with this RDD Dependencies – list of parent RDDs involved in computation Compute – function to compute partition of the RDD given the parent partitions from the Dependencies Preferred Locations – where is the best place to put computations on this partition (data locality) Partitioner – how the data is split into partitions

RDDs

DAG Directed Acyclic Graph – sequence of computations performed on data Node – RDD partition Edge – transformation on top of the data Acyclic – graph cannot return to the older partition Directed – transformation is an action that transitions data partitions state (from A to B)

Example: Word Count

Spark Architecture

Spark Components

Spark Driver Entry point of the Spark Shell (Scala, Python, R) The place where SparkContext is created Translates RDD into the execution graph Splits graph into stages Schedules tasks and controls their execution Stores metadata about all the RDDs and their partitions Brings up Spark WebUI with job information

Spark Executor Stores the data in cache in JVM heap or on HDDs Reads data from external sources Writes data to external sources Performs all the data processing

Dag Scheduler

More RDD Operations

Spark’s secret is really the RDD abstraction

Typical NoSQL architecture

CAP theorem for NoSQL What the CAP theorem really says: If you cannot limit the number of faults and requests can be directed to any server and you insist on serving every request you receive then you cannot possibly be consistent How it is interpreted: You must always give something up: consistency, availability or tolerance to failure and reconfiguration

CAP theorem for NoSQL

Sharding of data

Replica Sets

How does NoSQL vary from RDBMS?

Benefits of NoSQL

Benefits of NoSQL

Drawbacks of NoSQL

Drawbacks of NoSQL

ACID or BASE