RHadoop rev 2 2014-07-07.

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
Big Data Analytics with R and Hadoop
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
The Development Process Compilation. Compilation - Dr. Craig A. Struble 2 Programming Process Problem Solving Phase We will spend significant time on.
Scaling up R computation with high performance computing resources.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
MapReduce Compiler RHadoop
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Spark Presentation.
Rahi Ashokkumar Patel U
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Introduction to Hadoop and Spark
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
MAPREDUCE TYPES, FORMATS AND FEATURES
Big Data, Bigger Data & Big R Data
CS639: Data Management for Data Science
Bryon Gill Pittsburgh Supercomputing Center
Server & Tools Business
MapReduce: Simplified Data Processing on Large Clusters
Big Data Technology: Introduction to Hadoop
Map Reduce, Types, Formats and Features
Presentation transcript:

RHadoop rev 2 2014-07-07

RHadoop Collection of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

Prerequisites Installation of Hadoop cluster Installation of R Installation of RHadoop packages Environment variables HADOOP_CMD HADOOP_STREAMING As configured in the Sandbox: export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar export HADOOP_CMD=/usr/bin/hadoop

RHadoop Packages rhdfs rmr2 rhbase Interface for reading and writing files from/to a HDFS cluster rmr2 Interface to MapReduce through R rhbase Interface to HBase

rhdfs As Hadoop MapReduce programs use HDFS for taking their input and writing their output, it necessary to access them from R console The R programmer can easily perform read and write operations on distributed data files. Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS.

rhdfs Functions File Manipulations File Read/Write Directory Utility hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory hdfs.dircreate, hdfs.mkdir Utility hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization hdfs.init, hdfs.defaults

rmr2 rmr2 is an R interface for providing Hadoop MapReduce facility inside the R environment. So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr2 methods. After that, rmr2 calls the Hadoop streaming MapReduce API with several job parameters as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over Hadoop cluster.

rhbase R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server. The rhbase package is designed with several methods for initialization and read/write and table manipulation operations.

Our First mapreduce Job Compute the first thousand squares Regular R implementation mapreduce equivalent small.ints = 1:1000 sapply(small.ints, function(x) x^2) library('rhdfs') library('rmr2') hdfs.init() small.ints = to.dfs(1:1000) mapreduce( input = small.ints, map = function(k, v) cbind(v, v^2))

to.dfs It is not possible to write out big data with to.dfs, not in a scalable way. useful for writing test cases, learning and debugging to.dfs can put the data in a file of your own choosing, but if you don't specify one it will create temp files and clean them up when done. The return value is something we call a big data object. You can assign it to variables, pass it to other rmr functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose size exceeds memory limits.

mapreduce map and reduce function present the usual interface The mapreduce function takes as input a set of named parameters input: input path or variable input.format: specification of input format output: output path or variable map: map function reduce: reduce function map and reduce function present the usual interface A call to keyval(k,v) inside the map and reduce function is used to emit respectively intermediate and output key-value pairs

from.dfs from.dfs is complementary to to.dfs and returns a key-value pair collection that can be passed to mapreduce jobs or read into memory watch out, it will fail for big data! from.dfs is useful in defining map reduce algorithms whenever a mapreduce job produces something of reasonable size, like a summary, that can fit in memory and needs to be inspected to decide on the next steps, or to visualize it. It is much more important than to.dfs in production work.

Our Second mapreduce Job Creates a sample from the binomial and counts how many times each outcome occurred library('rhdfs') library('rmr2') hdfs.init() groups = rbinom(32, n = 50, prob = 0.4) groups = to.dfs(groups) from.dfs( mapreduce( input = groups, map = function(., v) keyval(v, 1), reduce = function(k, vv) keyval(k, length(vv)) ))

WordCount in R wordcount = wc.reduce = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Reading delimited data tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "\t") keyval( sapply(delim,function(x) x[1]), sapply(delim,function(x) x[-1]))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) keyval(v[1,], 1), reduce = function(k, vv) keyval(k, sum(vv)))

Reading named columns tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "\t") keyval(sapply(delim, function(x) x[1]), data.frame( location = sapply(delim, function(x) x[2]), name = sapply(delim, function(x) x[3]), value = sapply(delim, function(x) x[4])))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) { filter = (v$name == "blarg") keyval(k[filter], log(as.numeric(v$value[filter])))}, reduce = function(k, vv) keyval(k, mean(vv)))