RDDs and Spark.

Slides:



Advertisements
Similar presentations
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Advertisements

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Distributed Computations
MapReduce and databases Data-Intensive Information Processing Applications ― Session #7 Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.
Distributed Computations MapReduce
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Introduction to Hadoop and HDFS
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Review COMPSCI 210 Recitation 7th Dec 2012 Vamsi Thummala.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Data Engineering How MapReduce Works
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 3: From MapReduce to Spark (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
Big Data Infrastructure
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
CS 440 Database Management Systems
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark SQL.
Alternative system models
Spark Presentation.
Pig : Building High-Level Dataflows over Map-Reduce
Spark SQL.
Introduction to MapReduce and Hadoop
Pig Latin - A Not-So-Foreign Language for Data Processing
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Lecture 11: DMBS Internals
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Data-Intensive Distributed Computing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Building your own modern distributed system
Overview of big data tools
Pig : Building High-Level Dataflows over Map-Reduce
Interpret the execution mode of SQL query in F1 Query paper
Introduction to MapReduce
Introduction to Spark.
Wednesday, 5/8/2002 Hash table indexes, physical operators
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
The Gamma Database Machine Project
Pig and pig latin: An Introduction
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Presentation transcript:

RDDs and Spark

The paper itself Great model for a systems paper Talk about something that is useful + used by many many real users Argue not just that your techniques are good but also that your limitations are not fundamentally bad Extensive experiments to back it up. Awesome performance numbers always help. Won the best paper award at NSDI’12

Memory vs. Disk (borrowed) L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns

Spark Primitives vs. MapReduce

Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ Get SM G S O Take string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea? Good idea: User may know which datasets need to be used and how Bad idea: System may be able to optimize and schedule computation across nodes Standard argument of declarative vs. imperative

What are other ways Spark can be optimized?

What are other ways Spark can be optimized? More Declarative than Imperative Relational Query Optimization Reordering predicates Caching, fault-tolerance only when needed Careful scheduling Careful partitioning, co-location, and persistence Indexes