Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke

Slides:



Advertisements
Similar presentations
Best of UseR! 2011 A personal & biased view with an emphasis on data visualisation Andy Pryke Birmingham.
Advertisements

P3- Represent how data flows around a computer system
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
Spark: Cluster Computing with Working Sets
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Gamma Operator for Big Data Summarization
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
D4M – Signal Processing On Databases
Big Data Analytics with R and Hadoop
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Cloud Computing. Cloud Computing Overview Course Content
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Disk Fragmentation 1. Contents What is Disk Fragmentation Solution For Disk Fragmentation Key features of NTFS Comparing Between NTFS and FAT 2.
1 HBase Intro 王耀聰 陳威宇
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Technology Education THE PERSONAL COMPUTER (PC) HARDWARE PART 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
Scaling up R computation with high performance computing resources.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
BIG DATA/ Hadoop Interview Questions.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Column-Based.
Hadoop Aakash Kag What Why How 1.
Nope OS Prepared by, Project Guides: Ms. Divya K V Ms. Jucy Vareed
Spark Presentation.
Technology Education THE PERSONAL COMPUTER (PC) HARDWARE PART 1
Hadoop Clusters Tess Fulkerson.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Introduction to Apache
Overview of big data tools
DriveScale Log Collection Method of Procedure
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
Big Data, Bigger Data & Big R Data
The Gamma Operator for Big Data Summarization
Presentation transcript:

Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke

My Bias… I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data

What is Big Data? Depends who you ask. Answers are often too big to …. …load into memory …store on a hard drive …fit in a standard database Plus Fast changing Not just relational

My Big Data Definition Data collections big enough to require you to change the way you store and process them. - Andy Pryke

Data Size Limits in R Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") Vectors limited to 2 Billion items Memory limit of ~128Tb Servers with 1Tb+ memory are available Also, Amazon EC2 servers up to 244Gb

Overview Problems using R with Big Data Processing data on disk Hadoop for parallel computation and Big Data storage / access In Database analysis What next for Birmingham R User Group?

matrix - Built in (package base). - Stored in RAM - Dense - takes up memory to store zero values) Can be replaced by….. Background: R matrix class

Sparse / Disk Based Matrices Matrix – Package Matrix. Sparse. In RAM big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc.R-Forge list More details? Large-Scale Linear Algebra with R, Bryan W. Lewis, Boston R Users MeetupLarge-Scale Linear Algebra with R

Commercial Versions of R Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info heremore info here

Background: Hadoop Parallel data processing environment based on Googles MapReduce model Map – divide up data and sending it for processing to multiple nodes. Reduce – Combine the results Plus: Hadoop Distributed File System (HDFS) HBase – Distributed database like Googles BigTable

RHadoop – Revolution Analytics Package: rmr2, rhbase, rhdfs Example code using RMR (R Map-Reduce) Example code using RMR R and Hadoop – Step by Step Tutorials Install and Demo RHadoop (Google for more of these online) Install and Demo RHadoop Data Hacking with RHadoop

RHadoop wc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once } wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts)) } wordcount <- function(input, output = NULL){ mapreduce( input = input, output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T) } E.g. Function Output ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345 ## word, 987 ## beginning, 123 ##...

Other Hadoop libraries for R Other packages: hive, segue, RHIPE…segue – easy way to distribute CPU intensive work - Uses Amazons Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…

RHadoop # first, let's generate a 10-element list of # 999 random numbers + 1 NA: > myList <- getMyTestList() # Add up each set of 999 numbers > outputLocal <- lapply(myList, mean, na.rm=T) > outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T) RUNNING :16:57 RUNNING :17:27 RUNNING :17:58 WAITING :18:29 ## Check local and cluster results match > all.equal(outputEmr, outputLocal) [1] TRUE # The key is the emrlapply() function. It works just like lapply(), # but automagically spreads its work across the specified cluster

Oracle R Connector for Hadoop Integrates with Oracle Db, Oracle Big Data Appliance (sounds expensive!) & HDFS Map-Reduce is very similar to the rmr example Map-Reduce Documentation lists examples for Linear Regression, k-means, working with graphs amongst otherslists examples Introduction to Oracle R Connector for Hadoop. Introduction to Oracle R Connector for Hadoop Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)Oracle R Enterpriseoverview

Teradata Integration Package: teradataRteradataR Teradata offer in-database analytics, accessible through R These include k-means clustering, descriptive statistics and the ability to create and call in- database user defined functions

What Next? I propose an informal big data Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. R you interested?