First steps in SparkR Mikael Huss SciLifeLab / Stockholm University 16 February, 2015.

Slides:

Advertisements

Similar presentations

Simplifying Rational Expressions We are trying to get common terms to factor ( cancel ) to = 1. You might have to distribute or FOIL to get started. ALWAYS.

Advertisements

Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports Sweetwater ISD presents:

Spark Lightning-Fast Cluster Computing UC BERKELEY.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Fast and Expressive Big Data Analytics with Python

1 CSE 390a Lecture 4 Persistent shell settings; users/groups; permissions slides created by Marty Stepp, modified by Jessica Miller

Renjie Weng presentation 4/22/2013.

Get started with PivotTable reports Make your data work for you Imagine an Excel worksheet of sales figures. It lays out thousands of rows of data about.

Mining Large Software Compilations over Time Another Perspective on Software Evolution: Gregorio Robles, Jesus M. Gonzalez-Barahona, Martin Michlmayr,

This presentation will guide you though the initial stages of installation, through to producing your first report Click your mouse to advance the presentation.

SparkR: Enabling Interactive Data Science at Scale

MCDevOps Infrastructure In One Hour. Sponsors Improving Enterprises Software Development.

Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.

Accessing the Amazon Elastic Compute Cloud (EC2) Angadh Singh Jerome Braun.

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Storage in Big Data Systems

Find the product. 1) 3 x 4 x 5 2) 8 x 4 x 3 3) 2 x 3 x 9 4) 2 x 6 x 4 5) 8 x 2 x 4 6) 7 x 5 x2 5-Minute Check.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

You must follow the steps!

Spark Streaming Large-scale near-real-time stream processing

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Setting up Cygwin Computer Organization I 1 May 2010 ©2010 McQuain Cygwin: getting the setup tool Free, almost complete UNIX environment emulation.

Evaluation in the FEEL Project - A Pilot Study Peter Lönnqvist and Hillevi Sundholm The Future Ubiquitous Service Environments Research Group, Stockholm.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

Opening Slide You’re About to Discover the One Secret “__________” That Makes it Super- Easy to ____________________ That Allows You to __________________and.

Data Engineering How MapReduce Works

Number your paper from 1 through 25.. Multiplication Facts Ready Set Begin.

Implementing MST on a Large Campus Implementing MST in a Large Campus Environment February 13, 2007 Rich Ingram

Objective The student will be able to: solve equations using multiplication and division.

Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.

Opening Activity Each table will need the following: - Scissors - Colored paper (1 per person) - Glue Stick Have homework out and ready to be checked.

SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.

CO Timing Review: The OP Requirements R. Steerenberg on behalf of AB/OP Prepared with the help of: M. Albert, R. Alemany-Fernandez, T. Eriksson, G. Metral,

Scaling up R computation with high performance computing resources.

WP3 WP3 at Budapest 2/9/2002 Steve Fisher / RAL. WP3 Steve Fisher/RAL - 2/9/2002WP3 at Budapest2 Summary News –EDG Retreat –EDG Tutorials –Quality –Release.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Factoring Greatest Common Factor. Factoring We are going to start factoring today. I will take it easy on you in the beginning. Factoring is one skill.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

PySpark Tutorial - Learn to use Apache Spark with Python

COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT

Running Apache Spark on HPC clusters

Spark Programming By J. H. Wang May 9, 2017.

PROTECT | OPTIMIZE | TRANSFORM

Concept & Examples of pyspark

Hadoop Tutorials Spark

Everything You Need in One Simple Overview…. July 27, 2016

Spark Presentation.

Assessing the Performance Impact of Scheduling Policies in Spark

WEBINAR: Integrating SpiraTest with JIRA

Webinar # April 2017 Isolates in the Cloud

Lesson Objectives Aims

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

30 A 30 B 30 C 30 D 30 E 77 TOTALS ORIGINAL COUNT CURRENT COUNT

Spark and Scala.

Interpret the execution mode of SQL query in F1 Query paper

Spark and Scala.

Introduction to Spark.

Introduction to Docker

January 15, 2004 Adrienne Noble

Hadoop Installation Fully Distributed Mode

CS639: Data Management for Data Science

Presentation transcript:

First steps in SparkR Mikael Huss SciLifeLab / Stockholm University 16 February, 2015

441 kr 232 kr 317 kr

Borrowed from:

Borrowed from:

Resilient Distributed Datasets (RDDs) Data sets have a lineage Example from original RDD paper ected-files/nsdi_zaharia.pdf

Overview by Shivaram Venkataraman & Zongheng Yang from AMPlab SparkR SparkR reimplements lapply so that it works on RDDs, and implements other transformations on RDDs in R

SparkR example (on a single node) Also check out this “AmpCamp” exercise library(SparkR) Sys.setenv(SPARK_MEM="1g") sc <- sparkR.init(master="local[*]") # creating a SparkContext sc

SparkR example (on a single node) library(SparkR) Sys.setenv(SPARK_MEM="1g") sc <- sparkR.init(master="local[*]") # creating a SparkContext sc lines <- textFile(sc=sc,path="rodarummet.txt”) lines take(lines, 2) count(lines)

SparkR example (on a single node) library(SparkR) Sys.setenv(SPARK_MEM="1g") sc <- sparkR.init(master="local[*]") # creating a SparkContext sc lines <- textFile(sc=sc,path="rodarummet.txt”) lines take(lines, 2) count(lines) words <- flatMap(lines, function(line){strsplit(line," ")[[1]]}) take(words,5)

SparkR example (on a single node) library(SparkR) Sys.setenv(SPARK_MEM="1g") sc <- sparkR.init(master="local[*]") # creating a SparkContext sc lines <- textFile(sc=sc,path="rodarummet.txt”) lines take(lines, 2) count(lines) words <- flatMap(lines, function(line){strsplit(line," ")[[1]]}) take(words,5) wordCount <- lapply(words, function(word){list(word,1L)}) counts<-reduceByKey(wordCount,"+",2L) res <- collect(counts) df <- data.frame(matrix(unlist(res), nrow=length(res),byrow=T))

Installing SparkR (on a single node) All-in-one? Installing Spark first -Docker ( -Amazon AMIs (note: US East is the region you want) -But really, all you need to do is to download a binary distribution

Installing SparkR (on a single node) After downloading, you should be able to simply run spark-shell

Installing SparkR (on a single node) Now we have Spark itself – what about the SparkR part? Need to install the rJava package. Try: install.packages(“rJava”) Doesn’t work? If you are on Ubuntu, try: apt-get install r-cran-rjava Not on Ubuntu/still doesn’t work? (I feel your pain) Fiddle around with R CMD javareconf and look for StackOverflow questions such as: Also:

Installing SparkR (on a single node) Assuming you have successfully installed rJava: library(devtools) install_github("amplab-extras/SparkR-pkg", subdir="pkg") … and you should be ready to go with e g the word count example shown earlier!

Installing SparkR (on multiple nodes) On Amazon EC2 Note: not super easy to install SparkR afterwards! I found these notes helpful: Standalone mode Install Spark separately on each node

That’s it… A lot more detail on how to use Spark: (nothing about SparkR though …)