Download presentation
Presentation is loading. Please wait.
Published byPamela Carroll Modified over 9 years ago
1
Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http://creativecommons.org/licenses/by-nc-sa/3.0/ http://creativecommons.org/licenses/by-nc-sa/3.0/ Boston Azure User Group http://www.bostonazure.org @bostonazure Bill Wilder http://blog.codingoutloud.com http://blog.codingoutloud.com @codingoutloud Big Data tools for the Windows Azure cloud platform
2
Windows Azure MVP Windows Azure Consultant Boston Azure User Group Founder Cloud Architecture Patterns book (due 2012) Bill Wilder
3
We will consider… 1.How might we build a simple Word Frequency Counter? 2.What are Map and Reduce? 3.How do we scale our Word Frequency Counter? – Hint: we might use Hadoop 4.How does Windows Azure make Hadoop easier with “Hadoop as a Service” – CTP
4
Five exabytes of data created every two days - Eric Schmidt (CEO Google at the time) As much as from the dawn of civilization up until 2003
5
Three Vs Volume lots of it already Velocity more of it every day Variety many sources, many formats “Big Data” Challenge
6
Short History of Hadoop ////// 1. Inspired by: Google Map/Reduce paper – http://research.google.com/archive/mapreduce.html http://research.google.com/archive/mapreduce.html Google File System (GFS) – Goals: distributed, fault tolerant, fast enough 2. Born in: Lucene Nutch project Built in Java Hadoop cluster appears as single über- machine
7
Hadoop: batch processing, big data Batch, not real-time or transactional Scale out with commodity hardware Big customers like LinkedIn and Yahoo! – Clusters with 10s of Petabytes (pssst… these fail… daily) Import data from Azure Blob, Data Market, S3 – Or from files, like we will do in our example
8
Word Frequency Counter – how? The “hello world” of Hadoop / MapReduce – But we start without Hadoop / MapReduce Input: large corpus – Wikipedia extract for example – Can handle into PB Output: list of words, ordered by frequency the 31415 be9265 to3589 of 793 and238…
9
Simple Word Frequency Counter const string file = @"e:\dev\azure\hadoop\wordcount\davinci.txt"; var text = File.ReadAllText(file); var matches = Regex.Matches(text, @"\b[\w]*\b"); var words = (from m in matches.Cast () where !string.IsNullOrEmpty(m.Value) orderby m.Value.ToLower() select m.Value.ToLower()).ToArray(); var wordCounts = new Dictionary (); foreach (var word in words) { if (wordCounts.ContainsKey(word)) wordCounts[word]++; else wordCounts.Add(word, 1); } foreach (var wc in wordCounts) Console.WriteLine(wc.Key + " : " + wc.Value); Read in all text Parse out words Normalize & Sort How many times does each word appear aware : 7 away : 99 awning : 2 awoke : 1 axes : 16 axil : 3 axiom : 2 Output REDUCE MAP
10
Map “Apply a function to each element in this list and return a new list” Reduce “Apply a function collectively to all elements in this list and return the final answer.”
11
Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 } { 1, 4, 9, 16 } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 } 10
12
Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 } { 1, 4, 9, 16 } { square(1), square(2), square(3), square(4) } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 } 10 sum(sum(sum(1,2),3),4)
13
Map Example 2 strlen(s) { return s.Length } { “Celtics”, “Bruins” } { 7, 6 } Reduce Example 2 strlen_sum(x, y) { return x.Length + y.Length } { “Celtics”, “Bruins” } 13
14
Map Example 3 (the fancy one) fancy_mapper(s) { if (s == “SkipMe”) return null; return ToLower(s) + “, “ + s.Length; } { “Will”, “Dan”, “SkipMe”, “Kevin”, “T.J.” } { “will, 4”, “dan, 3”, “kevin, 5”, “t.j., 4” }
15
Problems with Word Counter? What happens if our data is… 1 GB, 1 TB, 1 PB, … What happens if our data is… Images, videos, tweets, Facebook updates, … What happens if our processing is… Complex, multiple steps, …
16
Simplified Example Word Frequency Counter Which word appears most frequently, and how many times?
17
Workflow 1.Setup 2.Map 3.Shuffle 4.Reduce 5.Celebrate
18
Hadoop Cluster 1 MASTER NODEMany SLAVE NODES - Job Tracker- Task Tracker on each “the boss” HDFS on all nodes
19
Step 1. Setup (Assumes: you’ve installed Hadoop on a cluster of computers) You supply: 1.Map and Reduce logic – This is “code” – packaged in a Java JAR file – Other language support exists, more coming 2.A big pile of input files – “Get the data here” – For Word Frequency Counter, we might use Wikipedia or Project Gutenberg files 3.Go!
20
Step 2. Map Job Tracker distributes your Mapper and Reducer – To every node Job Tracker distributes your data – To some nodes (at least 3) in 64 MB chunks Task Tracker on each node calls Mapper – Repeatedly until done; lots of parallelism Job Tracker watches for problems – Manages retries, failures, optimistic attempts
21
Mapper’s Job is Simple* Read input, write output – that’s all – Function signature: Map(text) – Parses the text and returns { key, value } – Map(“a b a foo”) returns {a, 1}, {b, 1}, {a, 1}, {foo, 1} * for Word Frequency Counter!
22
Actual Java Map Function public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, “1”); }
23
Step 3. Shuffle Shuffle collects all data from Map, organizes by key, and redistributes data to the nodes for Reduce
24
Step 3. Shuffle – example Mapper input: “the Bruins are the best in the NHL” Mapper output: { the,1 } { the,1 }{ the, 1 } { Bruins,1 } { best,1 }{ NHL, 1 } { are, 1 } { in, 1 } Shuffle transforms this into Reducer input: { are,[ 1 ] } { in, [ 1 ] } { Bruins, [ 1 ] } { best, [ 1 ] } { the, [ 1, 1, 1 ] }{ NHL, [ 1 ] }
25
Step 4. Reduce Output from Step 3. Shuffle has been distributed to datanodes Your “Reducer” is called on local data – Repeatedly, until all complete – Tasks run in parallel on nodes This is very simple for Word Frequency Counter! – Function signature: Reduce(key, values[]) – Adds up all the values and returns { key, sum }
26
Actual Java Reduce Function public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
27
Step 5. Celebrate! You are done – you have your output In more complex scenario, might repeat – Hadoop tool ecosystem knows how to do this There are other projects in the Hadoop ecosystem for … – Multi-step jobs – Managing a data warehouse – Supporting ad hoc querying – And more!
28
www.hadooponazure.com demo
29
There’s a LOT MORE to the Hadoops… Hadoop streaming interface allows other languages – C# HIVE (HiveQL) Pig (Pig Latin language) Cascading.org Commercial companies dedicated: – HortonWorks
30
Questions? Comments? More information? ?
31
BostonAzure.org Boston Azure cloud user group Focused on Microsoft’s PaaS cloud platform Last Thursday, monthly, 6:00-8:30 PM at NERD – Food; wifi; free; great topics; growing community Boston Azure Boot Camp: June 2012 ( planning ) Follow on Twitter: @bostonazure More info or to join our Meetup.com group: http://www.bostonazure.org
32
Contact Me Looking for … consulting help with Windows Azure Platform? someone to bounce Azure or cloud questions off? a speaker for your user group or company technology event? Just Ask! Bill Wilder @codingoutloud http://blog.codingoutloud.com
33
Hadoop A tool to economically create value from the Three Vs New tool – Complements, rather than displaces, existing data analysis tools There are other approaches, but Hadoop is winning – Emerging as standard (remember XP vs. Scrum?) – Microsoft embracing (http://www.hadooponazure.com/)http://www.hadooponazure.com/ Large Ecosystem – Apache Hadoop project – HBASE, HIVE, Pig, Mahout, … – Hadoop-specific companies: HortonWorks, Cloudera
34
Microsoft working with HortonWorks to port to Windows and enable on Windows Azure Looking to make JavaScript, C# first-class languages – Native language is Java – Community support for Python and others
35
HIVE HIVE QL – SQL-like Creates MapReduce job in the background Remember: Hadoop is BATCH oriented Hive Excel Plugin Hive Interactive Console in Azure
36
MapReduce is Functional Programming Like C#!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.