Download presentation
Presentation is loading. Please wait.
Published bySharlene Rose Modified over 9 years ago
1
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net
2
A little history… Why Hadoop? How it works Demos Summary Hadoop on Azure 2
3
Map-Reduce is from functional programming Hadoop on Azure 3 // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count
4
Created by to drive internet search ◦ BIG data ― scalable to TBs and beyond ◦ Parallelism: to get the performance ◦ Data partitioning: to drive the parallelism ◦ Fault tolerance: at this scale, machines are going to crash, a lot… 4 BIG Data BIG Data page hits
5
Search engines: Google, Yahoo, Bing Facebook Twitter Financials Health industry Insurance Credit card companies Just about any company collecting user data… Hadoop on Azure 5
6
Freely-available framework for big data ◦ http://hadoop.apache.org/ http://hadoop.apache.org/ Based on concept of Map-Reduce: 6 BIG data BIG data Map...... Reduce R R map function reduce intermediate results......
7
Hadoop on Azure 7 Mapper Reducer
8
8 Map Sort Reduce Merge [,, … ] [, … ] R R Data Map Sort Map Sort [,,, … ] [,, … ]
9
Netflix data-mining… Hadoop on Azure 9 Netflix Movie Reviews (.txt) Netflix Movie Reviews (.txt) Netflix Data Mining App Average rating… movieid,userid,rating,date 1,2390087,3,2005-09-06 217,5567801,5,2006-01-03 42,1121098,3,2006-03-25 1,8972234,5,2003-12-02.
10
10 Map Sort Reduce Merge [,,,, … ] [,,, … ] R R Data Map Sort Map Sort [,,,,,, … ]
11
To compute average rating for every movie: Hadoop on Azure 11 // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]); }; var reduce = function (key, values, context) { var sum = 0; var count = 0; while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count); }; // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]); }; var reduce = function (key, values, context) { var sum = 0; var count = 0; while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count); };
12
Hadoop on Azure 12 Upload data to HDFS ◦ Hadoop file system Write map / reduce functions ◦ default is to use Java ◦ most languages supported: C, C++, C#, JavaScript, Python, … Compile and upload code ◦ For Java, you upload.jar file ◦ For others,.exe or script Submit MapReduce job Wait for job to complete
13
Hadoop on Azure 13 Queries against big datasets Embarrassingly-parallel problems ◦ Solution must fit into map-reduce framework Non-real-time demands Hadoop is not for: ◦ Small datasets (< 1GB?) ◦ Sub-second / real-time needs (though clearly Google makes it work)
14
We’ll be working with Chicago crime data… ◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 ◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html 14 1 GB 5M rows 1 GB 5M rows
15
Compute top-10 crimes… 15 0486 366903 0820 308074. 0890 166916 0486 366903 0820 308074. 0890 166916 IUCR Count IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois- Uniform-Crime-R/c7ck-438e IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois- Uniform-Crime-R/c7ck-438e
16
Hadoop on Azure… Supports traditional Hadoop usage ◦ Upload data ◦ Write MapReduce program ◦ Submit job Additional features: ◦ Allows access to persistent data from Azure Storage Vault ◦ Provides interactive JavaScript console ◦ Built-in higher-level query languages (PIG, HIVE) 16 Hadoop on Azure
17
17 // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; 0486 366903 0820 308074. 0486 366903 0820 308074.
18
Hadoop on Azure 18 // interactive PIG with explicit Map-Reduce functions: pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // interactive PIG with explicit Map-Reduce functions: pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // visualize the results: file = fs.read("output-from2001/part-r-00000") data = parse(file.data, "IUCR, Count:long") graph.bar(data) // visualize the results: file = fs.read("output-from2001/part-r-00000") data = parse(file.data, "IUCR, Count:long") graph.bar(data)
19
Microsoft is offering free access to Hadoop ◦ Request invitation @ http://www.hadooponazure.com/http://www.hadooponazure.com/ Hadoop connector for Excel ◦ Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure 19
20
20 Hadoop on Azure
21
21 Hadoop is all about big data processing ◦ Scalable, parallel, fault-tolerant Easy to understand programming model ◦ Map-Reduce ◦ But then solution must fit into this framework… Rich ecosystem developing around Hadoop ◦ Technologies: PIG, HIVE, HBase, … ◦ Companies: Cloudera, Hortonworks, MapR, …
22
Presenter: Joe Hummel ◦ Email: joe@joehummel.net ◦ Materials: http://www.joehummel.net/downloads.html For more info: ◦ http://www.hadooponazure.com/ http://www.hadooponazure.com/ ◦ http://msdn.microsoft.com/en-us/magazine/jj190805.aspx http://msdn.microsoft.com/en-us/magazine/jj190805.aspx ◦ Overview, including how to access via.NET API: http://www.simple-talk.com/cloud/data-science/analyze- big-data-with-apache-hadoop-on-windows-azure- preview-service-update-3/ http://www.simple-talk.com/cloud/data-science/analyze- big-data-with-apache-hadoop-on-windows-azure- preview-service-update-3/ 22 Hadoop on Azure
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.