CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
Spark: Cluster Computing with Working Sets
Advertisements

Spark Fast, Interactive, Language-Integrated Cluster Computing.
Distributed Computations
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Spark Presentation.
Scalable systems.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Distributed System Gang Wu Spring,2018.
Interpret the execution mode of SQL query in F1 Query paper
CS 345A Data Mining MapReduce This presentation has been altered.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
Introduction to MapReduce
Introduction to Spark.
CS639: Data Management for Data Science
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

CS639: Data Management for Data Science Midterm Review 2: MapReduce and NoSQL Theodoros Rekatsinas

Today’s Lecture Review Relational Databases and Relational Algebra Next Lecture: Review MapReduce and NoSQL systems

The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage Map (Shuffle) Reduce

The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage map map map map map map Map (Shuffle) Reduce reduce reduce reduce reduce

The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage map map map map map map Map (Shuffle) Reduce reduce reduce reduce reduce

Map Reduce Shift (docID=1, v1) (docID=2, v2) (docID=3, v3) …. (w1, 1)

The Map Reduce Abstraction for Distributed Algorithms MapReduce is a high-level programming model and implementation for large-scale parallel data processing Like RDBMS adopt the the relational data model, MapReduce has a data model as well

MapReduce’s Data Model Files! A File is a bag of (key, value) pairs A bag is a multiset A map-reduce program: Input: a bag of (inputkey, value) pairs Output: a bag of (outputkey, value) pairs

All the user needs to define are the MAP and REDUCE functions User input All the user needs to define are the MAP and REDUCE functions Execute proceeds in multiple MAP – REDUCE rounds MAP – REDUCE = MAP phase followed REDUCE

MAP Phase Step 1: the MAP phase User provides a MAP-function: Input: (input key, value) Output: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file

REDUCE Phase Step 2: the REDUCE phase User provides a REDUCE-function: Input: (intermediate key, bag of values) Output: (intermediate key, values) The system will group all pairs with the same intermediate key, and passes the bag of values to the REDUCE function

MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> (out_key, list(out_values)) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

MapReduce: what happens in between?

MapReduce: the complete picture

Step 1: Split input files into chunks (shards)

Step 2: Fork processes

Step 3: Run Map Tasks

Step 4: Create intermediate files

Step 4a: Partitioning

Step 5: Reduce Task - sorting

Step 6: Reduce Task - reduce

Step 7: Return to user

MapReduce: the complete picture We need a distributed file system!

2. Spark

Intro to Spark Spark is really a different implementation of the MapReduce programming model What makes Spark different is that it operates on Main Memory Spark: we write programs in terms of operations on resilient distributed datasets (RDDs). RDD (simple view): a collection of elements partitioned across the nudes of a cluster that can be operated on in parallel. RDD (complex view): RDD is an interface for data transformation, RDD refers to the data stored either in persisted store (HDFS) or in cache (memory, memory+disk, disk only) or in another RDD

RDDs in Spark

MapReduce vs Spark

Partitions are recomputed on failure or cache eviction RDDs Partitions are recomputed on failure or cache eviction Metadata stored for interface: Partitions – set of data splits associated with this RDD Dependencies – list of parent RDDs involved in computation Compute – function to compute partition of the RDD given the parent partitions from the Dependencies Preferred Locations – where is the best place to put computations on this partition (data locality) Partitioner – how the data is split into partitions

RDDs

DAG Directed Acyclic Graph – sequence of computations performed on data Node – RDD partition Edge – transformation on top of the data Acyclic – graph cannot return to the older partition Directed – transformation is an action that transitions data partitions state (from A to B)

Example: Word Count

Spark Architecture

Spark Components

Typical NoSQL architecture

CAP theorem for NoSQL What the CAP theorem really says: If you cannot limit the number of faults and requests can be directed to any server and you insist on serving every request you receive then you cannot possibly be consistent How it is interpreted: You must always give something up: consistency, availability or tolerance to failure and reconfiguration

CAP theorem for NoSQL

Sharding of data

Replica Sets

How does NoSQL vary from RDBMS?

Benefits of NoSQL

Benefits of NoSQL

Drawbacks of NoSQL

Drawbacks of NoSQL

ACID or BASE