Big Data, Bigger Data & Big R Data

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Spark: Cluster Computing with Working Sets
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Gamma Operator for Big Data Summarization
RHadoop rev
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Big Data Analytics with R and Hadoop
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Scaling up R computation with high performance computing resources.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Introduction to MapReduce and Hadoop
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Large-scale file systems and Map-Reduce
Spark Presentation.
Introduction to R Programming with AzureML
Introduction to MapReduce and Hadoop
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to Hadoop and Spark
Introduction to Apache
Big Data Overview.
Parallel Analytic Systems
Overview of big data tools
Introduction to Teradata
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
CS639: Data Management for Data Science
The Gamma Operator for Big Data Summarization
5/7/2019 Map Reduce Map reduce.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
MapReduce: Simplified Data Processing on Large Clusters
Big Data Technology: Introduction to Hadoop
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23rd April 2013 Andy Pryke Andy@The-Data-Mine.co.uk / @AndyPryke

I work in commercial data mining, data analysis and data visualisation My Bias… I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data

What is Big Data? Depends who you ask. Answers are often “too big to ….” …load into memory …store on a hard drive …fit in a standard database Plus “Fast changing” Not just relational

My “Big Data” Definition “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke

Servers with 1Tb+ memory are available Data Size Limits in R Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") Vectors limited to 2 Billion items Memory limit of ~128Tb Servers with 1Tb+ memory are available Also, Amazon EC2 servers up to 244Gb

Overview Problems using R with Big Data Processing data on disk Hadoop for parallel computation and Big Data storage / access “In Database” analysis What next for Birmingham R User Group?

Background: R matrix class - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values) Can be replaced by…..

Sparse / Disk Based Matrices Matrix – Package Matrix. Sparse. In RAM big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc. More details? “Large-Scale Linear Algebra with R”, Bryan W. Lewis, Boston R Users Meetup

Commercial Versions of R Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info here

Background: Hadoop Parallel data processing environment based on Google’s “MapReduce” model “Map” – divide up data and sending it for processing to multiple nodes. “Reduce” – Combine the results Plus: Hadoop Distributed File System (HDFS) HBase – Distributed database like Google’s BigTable

RHadoop – Revolution Analytics Package: rmr2, rhbase, rhdfs Example code using RMR (R Map-Reduce) R and Hadoop – Step by Step Tutorials Install and Demo RHadoop (Google for more of these online) Data Hacking with RHadoop

RHadoop ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345 wc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once } wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts)) wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T) E.g. Function Output ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345 ## word, 987 ## beginning, 123 RHadoop

Other Hadoop libraries for R Other packages: hive, segue, RHIPE… segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…

RHadoop # first, let's generate a 10-element list of # 999 random numbers + 1 NA: > myList <- getMyTestList() # Add up each set of 999 numbers > outputLocal <- lapply(myList, mean, na.rm=T) > outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T) RUNNING - 2011-01-04 15:16:57 RUNNING - 2011-01-04 15:17:27 RUNNING - 2011-01-04 15:17:58 WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match > all.equal(outputEmr, outputLocal) [1] TRUE # The key is the emrlapply() function. It works just like lapply(), # but automagically spreads its work across the specified cluster RHadoop

Oracle R Connector for Hadoop Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS Map-Reduce is very similar to the rmr example Documentation lists examples for Linear Regression, k-means, working with graphs amongst others Introduction to Oracle R Connector for Hadoop. Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)

Teradata Integration Package: teradataR Teradata offer in-database analytics, accessible through R These include k-means clustering, descriptive statistics and the ability to create and call in-database user defined functions

“R” you interested? What Next? I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. “R” you interested?