Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Similar presentations


Presentation on theme: "Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license"— Presentation transcript:

1 Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http://creativecommons.org/licenses/by-nc-sa/3.0/ http://creativecommons.org/licenses/by-nc-sa/3.0/ Boston Azure User Group http://www.bostonazure.org @bostonazure Bill Wilder http://blog.codingoutloud.com http://blog.codingoutloud.com @codingoutloud Big Data tools for the Windows Azure cloud platform

2 Windows Azure MVP Windows Azure Consultant Boston Azure User Group Founder Cloud Architecture Patterns book (due 2012) Bill Wilder

3 We will consider… 1.How might we build a simple Word Frequency Counter? 2.What are Map and Reduce? 3.How do we scale our Word Frequency Counter? – Hint: we might use Hadoop 4.How does Windows Azure make Hadoop easier with “Hadoop as a Service” – CTP

4 Five exabytes of data created every two days - Eric Schmidt (CEO Google at the time) As much as from the dawn of civilization up until 2003

5 Three Vs Volume  lots of it already Velocity  more of it every day Variety  many sources, many formats “Big Data” Challenge

6 Short History of Hadoop ////// 1. Inspired by: Google Map/Reduce paper – http://research.google.com/archive/mapreduce.html http://research.google.com/archive/mapreduce.html Google File System (GFS) – Goals: distributed, fault tolerant, fast enough 2. Born in: Lucene Nutch project Built in Java Hadoop cluster appears as single über- machine

7 Hadoop: batch processing, big data Batch, not real-time or transactional Scale out with commodity hardware Big customers like LinkedIn and Yahoo! – Clusters with 10s of Petabytes (pssst… these fail… daily) Import data from Azure Blob, Data Market, S3 – Or from files, like we will do in our example

8 Word Frequency Counter – how? The “hello world” of Hadoop / MapReduce – But we start without Hadoop / MapReduce Input: large corpus – Wikipedia extract for example – Can handle into PB Output: list of words, ordered by frequency the 31415 be9265 to3589 of 793 and238…

9 Simple Word Frequency Counter const string file = @"e:\dev\azure\hadoop\wordcount\davinci.txt"; var text = File.ReadAllText(file); var matches = Regex.Matches(text, @"\b[\w]*\b"); var words = (from m in matches.Cast () where !string.IsNullOrEmpty(m.Value) orderby m.Value.ToLower() select m.Value.ToLower()).ToArray(); var wordCounts = new Dictionary (); foreach (var word in words) { if (wordCounts.ContainsKey(word)) wordCounts[word]++; else wordCounts.Add(word, 1); } foreach (var wc in wordCounts) Console.WriteLine(wc.Key + " : " + wc.Value);  Read in all text  Parse out words  Normalize & Sort  How many times does each word appear aware : 7 away : 99 awning : 2 awoke : 1 axes : 16 axil : 3 axiom : 2 Output   REDUCE MAP

10 Map “Apply a function to each element in this list and return a new list” Reduce “Apply a function collectively to all elements in this list and return the final answer.”

11 Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 }  { 1, 4, 9, 16 } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 }  10

12 Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 }  { 1, 4, 9, 16 } { square(1), square(2), square(3), square(4) } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 }  10 sum(sum(sum(1,2),3),4)

13 Map Example 2 strlen(s) { return s.Length } { “Celtics”, “Bruins” }  { 7, 6 } Reduce Example 2 strlen_sum(x, y) { return x.Length + y.Length } { “Celtics”, “Bruins” }  13

14 Map Example 3 (the fancy one) fancy_mapper(s) { if (s == “SkipMe”) return null; return ToLower(s) + “, “ + s.Length; } { “Will”, “Dan”, “SkipMe”, “Kevin”, “T.J.” }  { “will, 4”, “dan, 3”, “kevin, 5”, “t.j., 4” }

15 Problems with Word Counter? What happens if our data is… 1 GB, 1 TB, 1 PB, … What happens if our data is… Images, videos, tweets, Facebook updates, … What happens if our processing is… Complex, multiple steps, …

16 Simplified Example Word Frequency Counter Which word appears most frequently, and how many times?

17 Workflow 1.Setup 2.Map 3.Shuffle 4.Reduce 5.Celebrate

18 Hadoop Cluster 1 MASTER NODEMany SLAVE NODES - Job Tracker- Task Tracker on each “the boss” HDFS on all nodes

19 Step 1. Setup (Assumes: you’ve installed Hadoop on a cluster of computers) You supply: 1.Map and Reduce logic – This is “code” – packaged in a Java JAR file – Other language support exists, more coming 2.A big pile of input files – “Get the data here” – For Word Frequency Counter, we might use Wikipedia or Project Gutenberg files 3.Go!

20 Step 2. Map Job Tracker distributes your Mapper and Reducer – To every node Job Tracker distributes your data – To some nodes (at least 3) in 64 MB chunks Task Tracker on each node calls Mapper – Repeatedly until done; lots of parallelism Job Tracker watches for problems – Manages retries, failures, optimistic attempts

21 Mapper’s Job is Simple* Read input, write output – that’s all – Function signature: Map(text) – Parses the text and returns { key, value } – Map(“a b a foo”) returns {a, 1}, {b, 1}, {a, 1}, {foo, 1} * for Word Frequency Counter!

22 Actual Java Map Function public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, “1”); }

23 Step 3. Shuffle Shuffle collects all data from Map, organizes by key, and redistributes data to the nodes for Reduce

24 Step 3. Shuffle – example Mapper input: “the Bruins are the best in the NHL” Mapper output: { the,1 } { the,1 }{ the, 1 } { Bruins,1 } { best,1 }{ NHL, 1 } { are, 1 } { in, 1 } Shuffle transforms this into Reducer input: { are,[ 1 ] } { in, [ 1 ] } { Bruins, [ 1 ] } { best, [ 1 ] } { the, [ 1, 1, 1 ] }{ NHL, [ 1 ] }

25 Step 4. Reduce Output from Step 3. Shuffle has been distributed to datanodes Your “Reducer” is called on local data – Repeatedly, until all complete – Tasks run in parallel on nodes This is very simple for Word Frequency Counter! – Function signature: Reduce(key, values[]) – Adds up all the values and returns { key, sum }

26 Actual Java Reduce Function public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

27 Step 5. Celebrate! You are done – you have your output In more complex scenario, might repeat – Hadoop tool ecosystem knows how to do this There are other projects in the Hadoop ecosystem for … – Multi-step jobs – Managing a data warehouse – Supporting ad hoc querying – And more!

28 www.hadooponazure.com demo

29 There’s a LOT MORE to the Hadoops… Hadoop streaming interface allows other languages – C# HIVE (HiveQL) Pig (Pig Latin language) Cascading.org Commercial companies dedicated: – HortonWorks

30 Questions? Comments? More information? ?

31 BostonAzure.org Boston Azure cloud user group Focused on Microsoft’s PaaS cloud platform Last Thursday, monthly, 6:00-8:30 PM at NERD – Food; wifi; free; great topics; growing community Boston Azure Boot Camp: June 2012 ( planning ) Follow on Twitter: @bostonazure More info or to join our Meetup.com group: http://www.bostonazure.org

32 Contact Me Looking for … consulting help with Windows Azure Platform? someone to bounce Azure or cloud questions off? a speaker for your user group or company technology event? Just Ask! Bill Wilder @codingoutloud http://blog.codingoutloud.com

33 Hadoop A tool to economically create value from the Three Vs New tool – Complements, rather than displaces, existing data analysis tools There are other approaches, but Hadoop is winning – Emerging as standard (remember XP vs. Scrum?) – Microsoft embracing (http://www.hadooponazure.com/)http://www.hadooponazure.com/ Large Ecosystem – Apache Hadoop project – HBASE, HIVE, Pig, Mahout, … – Hadoop-specific companies: HortonWorks, Cloudera

34 Microsoft working with HortonWorks to port to Windows and enable on Windows Azure Looking to make JavaScript, C# first-class languages – Native language is Java – Community support for Python and others

35 HIVE HIVE QL – SQL-like Creates MapReduce job in the background Remember: Hadoop is BATCH oriented Hive Excel Plugin Hive Interactive Console in Azure

36 MapReduce is Functional Programming Like C#!


Download ppt "Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license"

Similar presentations


Ads by Google