Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: Materials:
Map-Reduce is from functional programming Hadoop on Azure 2 // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count
Hadoop: ◦ Created by to drive internet search ◦ Parallelism ◦ Data partitioning ◦ Fault tolerance 3 BIG Data BIG Data page hits
Freely-available framework for big data ◦ Based on concept of Map-Reduce: 4 BIG data BIG data Map Reduce R R map function reduce intermediate results......
5 Map Sort Reduce Merge [,, … ] [, … ] R R Data Map Sort Map Sort [,,, … ] [,, … ]
We’ll be working with Chicago crime data… ◦ ◦ GB 5M rows 1 GB 5M rows
Compute top-10 crimes… IUCR Count IUCR = Illinois Uniform Crime Codes Uniform-Crime-R/c7ck-438e IUCR = Illinois Uniform Crime Codes Uniform-Crime-R/c7ck-438e
Hadoop on Azure… 8 Hadoop on Azure // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
Rich ecosystem around Hadoop ◦ Pig ◦ Hive ◦ HBASE ◦ … Hadoop on Azure 9 // interactive PIG with explicit Map-Reduce functions: pig.from("CC-from-2001.txt"). mapReduce("IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // interactive PIG with explicit Map-Reduce functions: pig.from("CC-from-2001.txt"). mapReduce("IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // interactive PIG without explicit Map-Reduce: schema = "ID,Case Number,Date,Block,IUCR,..." pig.from("CC-from-2001.txt", schema). groupBy("IUCR"). select("group, SUM($1.Count"). orderBy("Count DESC"). take(10). to("output-from-2001") // interactive PIG without explicit Map-Reduce: schema = "ID,Case Number,Date,Block,IUCR,..." pig.from("CC-from-2001.txt", schema). groupBy("IUCR"). select("group, SUM($1.Count"). orderBy("Count DESC"). take(10). to("output-from-2001")
Microsoft is offering free access to Hadoop ◦ Request Hadoop connector for Excel ◦ Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure 10
Freely-available plugin for Excel 2010 ◦ Turns Excel into an in-memory database ◦ More precisely, turns spreadsheet into an OLAP cube Note: ◦ If you have 32-bit Excel, install 32-bit PowerPivot ◦ If you have 64-bit Excel, install 64-bit PowerPivot ◦ GBs of data will require 64-bit ◦ [ How to tell what version of Excel you have? File menu, help… ] Big Data Processing, Cheap 11
PowerPivot… ◦ Install ◦ PowerPivot menu ◦ PowerPivot Window ◦ Get Data... ◦ PivotTable… 12 Big Data Processing, Cheap
15 Big Data Processing, Cheap
Presenter: Joe Hummel ◦ ◦ Materials: Keep an eye for final release of: ◦ Hadoop on Azure ◦ Hadoop on Windows ◦ PowerView plugin for Excel Big Data Processing, Cheap