Rahi Ashokkumar Patel U

Rahi Ashokkumar Patel U00295928
BIG DATA IN R Rahi Ashokkumar Patel U

AGENDA What is Big Data? Introduction to Hadoop.
Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop?

What is Big Data? Extremely large datasets that are hard to deal with using Relational Databases Storage/Cost Search/Performance Analytics and Visualization Need for parallel processing on hundreds of machines IBM termed Big Data challenges as the following Velocity Volume Veracity Variety For example Facebook > 800 million active users in Facebook. interacting with > 1 billion objects. 2.5 petabytes of userdata per day!

What is Hadoop? Hadoop was created by Doug Cutting and Mike Cafarella in Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Operates on unstructured and structured data. A large and active ecosystem. Handles thousands of nodes and petabytes of data. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.

Introduction to Hadoop

Introduction to Hadoop
Components of Hadoop : Mahout Pig Hive HBase Sqoop Oozie Flume

Hadoop Distributed File System(HDFS)
The characteristics of HDFS: Fault tolerant. Runs with commodity hardware. Able to handle large datasets. Master slave paradigm. Write once file access only HDFS components : NameNode. DataNode Secondary NameNode

HDFS: Hadoop Distributed FS
Block Size = 64MB Replication Factor = 3

MapReduce Is derived from Google MapReduce.
Is a programming model for processing large datasets distributed on a large cluster. MapReduce is the heart of Hadoop. Patented Google framework Distributed processing of large datasets MapReduce components Job Tracker Task Tracker

MapReduce components

Why R + Hadoop ? All calculations are performed by loading entire data in RAM. Easy to write MapReduce programs in R Hadoop framework allows parallel processing of massive amount of data. Using R with Hadoop facilitates horizontal scalability of statistical calculations

How to link R and Hadoop? Three ways to link R and Hadoop are as follows: RHIPE. RHadoop. Hadoop streaming.

RHadoop RHadoop is a great open source software framework of R for performing data analytics with the Hadoop platform via R functions. These three different R packages have been designed on Hadoop’s two main features HDFS and MapReduce: rhdfs: for providing all Hadoop HDFS access to R. rmr: for providing Hadoop MapReduce interfaces to R. rhbase: for handling data at HBase distributed database through R.

Hadoop Streaming This utility allows you to create and run MapReduce jobs with any executable or script as the Mapper and/or Reducer. This is supported by R, Python, Ruby, Bash, Perl, and so on. We will use the R language with a bash script. There is one R package named HadoopStreaming that has been developed for performing data analysis on Hadoop clusters With the help of R scripts, it provides an interface to Hadoop streaming with R.

RHIPE R and Hadoop Integrated Programming Environment (RHIPE) is a free and open source project. RHIPE is widely used for performing Big Data analysis with D & R analysis. D & R analysis is used to divide huge data, process it in parallel on a distributed network to produce intermediate output, and finally recombine all this intermediate output into a set. RHIPE is designed to carry out D & R analysis on complex Big Data in R on the Hadoop platform.

MapReduce with RHadoop
rhdfs HDFS R rmr2 MapReduce Streaming API rhbase Hbase Thrift Gateway Hbase

MapReduce WordCount Example

Move File to HDFS # Put data into hdfs Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop mapreduce/contrib/streaming/hadoop-streaming mr1-cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/hduser/wordcount/data”) hdfs.put("wc_input.txt", "/user/hduser/wordcount/data") $ hadoop fs –mkdir /user/hduser/wordcount/data $ hadoop fs –put wc_input.txt /user/hduser/word/count/data

Wordcount Mapper and Reducer
#Mapper map <- function(k,lines) { words.list <- strsplit(lines, '\\s') words <- unlist(words.list) return( keyval(words, 1) ) } reduce <- function(word, counts) { keyval(word, sum(counts)) } #Reducer

Call Wordcount hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)

Read data from HDFS results <- from.dfs(out) results$key[order(results$val, decreasing = TRUE)][1:10] $ hadoop fs –cat /user/hduser/wordcount/out/part | sort –k 2 – nr | head –n 10

V.S. MapReduce Benchmark
> a.time <- proc.time() > small.ints2=1: > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time > b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time V.S.

sapply Elapsed second

Mapreduce Elapsed 102.755 seconds

Conclusion Each of the approaches presented here has benefits and limitations. While using R with Streaming raises no problems regarding installation, Rhipe and RHadoop requires some effort in order to set up the cluster. The integration with R from the client side part is high for Rhipe and Rhadoop and is missing for R and Streaming. Rhipe and RHadoop allows users to define and call their own map and reduce functions within R while Streaming uses a command line approach where the map and reduce functions are passed as arguments. For simple Map-Reduce jobs the straightforward solution is Streaming but this solution is limited to text only input data files. For more complex jobs the solution should be Rhipe or RHadoop

References Glen Mules, Warren Pettit, Introduction to MapReduce Programming, September 2013. John Maindonald, W. John Braun, Introduction and Hadoop Overview, Lab Course: Databases & Cloud Computing, University of Freiburg, 2012. Vignesh Prajapati, Big Data Analytics with R and Hadoop, November Tom White, Hadoop: The Definitive Guide, Second Edition, October 2010.

Thank You

Rahi Ashokkumar Patel U

Similar presentations

Presentation on theme: "Rahi Ashokkumar Patel U"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rahi Ashokkumar Patel U

Similar presentations

Presentation on theme: "Rahi Ashokkumar Patel U"— Presentation transcript:

Similar presentations

About project

Feedback