MapReduce Compiler RHadoop

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

HADOOP ADMIN: Session -2

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

HAMS Technologies 1

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

BIG DATA/ Hadoop Interview Questions.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

A Tutorial on Hadoop Cloud Computing : Future Trends.

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Big Data is a Big Deal!.

SNS COLLEGE OF TECHNOLOGY

SAS users meeting in Halifax

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Hadoop Aakash Kag What Why How 1.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

An Open Source Project Commonly Used for Processing Big Data Sets

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Hadoop MapReduce Framework

Chapter 14 Big Data Analytics and NoSQL

Spark Presentation.

The Hadoop Sandbox The Playground for the Future of Your Career

Rahi Ashokkumar Patel U

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Introduction to Spark.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Big Data - in Performance Engineering

湖南大学-信息科学与工程学院-计算机与科学系

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Introduction to Apache

Overview of big data tools

TIM TAYLOR AND JOSH NEEDHAM

Zoie Barrett and Brian Lam

Charles Tappert Seidenberg School of CSIS, Pace University

Introduction to MapReduce

Big Data, Bigger Data & Big R Data

Apache Hadoop and Spark

MapReduce: Simplified Data Processing on Large Clusters

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Map Reduce, Types, Formats and Features

Pig Hive HBase Zookeeper

Presentation transcript:

MapReduce Compiler RHadoop Bhumika Patel (101047616) Sagar Patel (101053635)

Agenda Introduction Analysing Big Data? Introduction to R Overview of Hadoop and MapReduce RHadoop How to use Rhadoop Why Hadoop with R? Word count example References

Introduction BIG DATA! We have lots of Big data nowadays. Analysing this Big data is problem. Hard to find true value False Positive Everyone here is familiar with the term BIIIIIIIIIIIIIIG DATA but, Big data is most useful if you can do something with it. Analysing massive data is not an easy task at all. Big Data can guide you to a more accurate prediction of the future, but it should not be taken at face value; there needs to be a human element involved to process, analyze and find conclusions. People rush decisions based on a subset of data but With big data, thinking fast (not analyzing the data fully) can lead to false positives. The more data you have, sometimes the harder it can be to find true value from the data.

How can we analyse this BIG DATA? So how can we analyse this BIG DATA? -> one of the possible solution is to combine the power of R programming with Hadoop i.e. Rhadoop. We will see further features of R and the power of R with Hadoop. Posssible Solution!!!

How can we analyse this BIG DATA? There are a number of ways to use R with Hadoop, including: 1. Hadoop streaming. 2. RHadoop, an R/Hadoop integration. 3. RHIPE (pronounced hree-pay).

Introduction to R R is an open source language and environment for statistical computing and graphics. R provides a wide variety of statistical and graphical techniques, and is highly extensible. It includes a large, coherent, integrated collection of intermediate tools for data analysis.

Introduction to R Useful features of R: Effective programming language Relational database support Data analytics Data visualization Extension through the vast library of R packages It is easier than java and powerful

Introduction to R R has various built-in as well as extended functions for statistical, machine learning, and visualization tasks such as: Data extraction Data cleaning Data loading Data transformation Statistical analysis Predictive modelling Data visualization

Overview of Hadoop Apache Hadoop is an open source Java framework for processing and storage of extremely large dataset on large clusters of commodity hardware. Apache Hadoop has two main features: HDFS (Hadoop Distributed File System) - storage MapReduce - processing I am giving quick overview of Hadoop as We all are familiar with Hadoop now (I guess so).

Overview of Hadoop There are four core modules included in the basic framework from the Apache Foundation: 1. Hadoop Common 2. Hadoop Distributed File System (HDFS) 3. Yet Another Resource Negotiator (YARN) 4. MapReduce Hadoop Common – the libraries and utilities used by other Hadoop modules. Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization. YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop. MapReduce – a parallel processing software framework.

Why is Hadoop important? Storage and processing speed Low cost Flexibility Why is Hadoop important? Scalability Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Computing power Fault Tolerance Fig 1: Advantages of Hadoop[3]

What is Hadoop MapReduce? Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce: Takes the output from a map as input and combines those data tuples into a smaller set of tuples. The reduce job is always performed after the map job. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Map step is a master node that takes inputs and partitions them into smaller subproblems and then distributes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the subproblems and combines them to produce output.

What is RHadoop? RHadoop is a bridge between R and Hadoop. R, a language and environment to statistically explore data sets. Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. RHadoop is built out of 3 components which are R packages: 1 - rmr 2 - rhdfs 3 - rhbase

How to use Hadoop with R? rmr : A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. install.packages(“filepath/rmr2_3.3.1.tar.gz",type=source, repos=NULL) rhdfs : This package provides basic connectivity to the Hadoop Distributed File System. install.packages(“filepath/rhdfs_1.0.8.tar.gz", type=source, repos=NULL) rhbase : This package provides basic connectivity to the HBASE distributed database. install.packages(“filepath/rhbase_1.0.8.tar.gz", type=source, repos=NULL)

Why Hadoop with R? Strength of R - Ability to analyze data using a rich library of packages. Weakness of R - Falls short when it comes to working on very large datasets. Strength of Hadoop - To store and process very large amounts of data in the TB and even PB range.

Why Hadoop with R? In combined RHadoop system, R will take care of data analysis operations with the preliminary functions, such as data loading, exploration, analysis, and visualization. Hadoop will take care of parallel data storage as well as computation power against distributed data.

Word count - RHadoop Script[2] mytext <- to.dfs(readLines("/home/cloudera/Downloads/tryme.txt")) countmapper <- function(key,line) { word <- unlist(strsplit(line, split = " ")) keyval(word, 1) } mr <- mapreduce( input = mytext, map = countmapper, reduce = function(k,v) { keyval(k,length(v)) ) out<-from.dfs(mr) head(as.data.frame(out)) To.dfs loads file in hdfs unlist simplifies it to produce a vector which contains all the atomic components which occurs in line. Split the elements of a character vector x into substrings according to the matches to substring split within them. Head returns first part of funcion

Word Count – Rhadoop Script Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the Map phase. • Reduce phase: The master node then collects the answers to all the subproblems and combines them in some way to form the output; the answer to the problem it was originally trying to solve. Fig 2: MapReduce Word count process [4]

THANK YOU

References [1] Big Data analytics with R and Hadoop, Vignesh Prajapati [2] http://jeremy.kiwi.nz/2015/07/23/hadoop.html [3] https://www.sas.com/en_us/insights/big-data/hadoop.html [4] https://cs.calvin.edu/courses/cs/374/exercises/12/lab/