Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Slides:



Advertisements
Similar presentations
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Advertisements

Overview of MapReduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
R and HDInsight in Microsoft Azure
What’s New in Windows Azure A platform overview + how it can fit into my development shop today… New England Microsoft Dev Group 06-June-2013 (6:30-8:30.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
An Introduction to HDInsight June 27 th,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
This is a free Course Available on Hadoop-Skills.com.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Apache Hadoop on Windows Azure Avkash Chauhan
Big Data Anton Boyko. Agenda What is Big Data? Why Big Data? How to Big Data?
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Cloud-Native Architecture Patterns (Or… why your pre-cloud architecture won’t work so well in the cloud) Azure Florida Association 28-March-2012 Boston.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data is a Big Deal!.
Introduction to Google MapReduce
Hadoop Aakash Kag What Why How 1.
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
Hadoopla: Microsoft and the Hadoop Ecosystem
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Hadoop Basics.
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license Boston Azure User Group Bill Wilder Big Data tools for the Windows Azure cloud platform

Windows Azure MVP Windows Azure Consultant Boston Azure User Group Founder Cloud Architecture Patterns book (due 2012) Bill Wilder

We will consider… 1.How might we build a simple Word Frequency Counter? 2.What are Map and Reduce? 3.How do we scale our Word Frequency Counter? – Hint: we might use Hadoop 4.How does Windows Azure make Hadoop easier with “Hadoop as a Service” – CTP

Five exabytes of data created every two days - Eric Schmidt (CEO Google at the time) As much as from the dawn of civilization up until 2003

Three Vs Volume  lots of it already Velocity  more of it every day Variety  many sources, many formats “Big Data” Challenge

Short History of Hadoop ////// 1. Inspired by: Google Map/Reduce paper – Google File System (GFS) – Goals: distributed, fault tolerant, fast enough 2. Born in: Lucene Nutch project Built in Java Hadoop cluster appears as single über- machine

Hadoop: batch processing, big data Batch, not real-time or transactional Scale out with commodity hardware Big customers like LinkedIn and Yahoo! – Clusters with 10s of Petabytes (pssst… these fail… daily) Import data from Azure Blob, Data Market, S3 – Or from files, like we will do in our example

Word Frequency Counter – how? The “hello world” of Hadoop / MapReduce – But we start without Hadoop / MapReduce Input: large corpus – Wikipedia extract for example – Can handle into PB Output: list of words, ordered by frequency the be9265 to3589 of 793 and238…

Simple Word Frequency Counter const string file var text = File.ReadAllText(file); var matches = var words = (from m in matches.Cast () where !string.IsNullOrEmpty(m.Value) orderby m.Value.ToLower() select m.Value.ToLower()).ToArray(); var wordCounts = new Dictionary (); foreach (var word in words) { if (wordCounts.ContainsKey(word)) wordCounts[word]++; else wordCounts.Add(word, 1); } foreach (var wc in wordCounts) Console.WriteLine(wc.Key + " : " + wc.Value);  Read in all text  Parse out words  Normalize & Sort  How many times does each word appear aware : 7 away : 99 awning : 2 awoke : 1 axes : 16 axil : 3 axiom : 2 Output   REDUCE MAP

Map “Apply a function to each element in this list and return a new list” Reduce “Apply a function collectively to all elements in this list and return the final answer.”

Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 }  { 1, 4, 9, 16 } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 }  10

Map Example 1 square(x) { return x*x } { 1, 2, 3, 4 }  { 1, 4, 9, 16 } { square(1), square(2), square(3), square(4) } Reduce Example 1 sum(x, y) { return x + y } { 1, 2, 3, 4 }  10 sum(sum(sum(1,2),3),4)

Map Example 2 strlen(s) { return s.Length } { “Celtics”, “Bruins” }  { 7, 6 } Reduce Example 2 strlen_sum(x, y) { return x.Length + y.Length } { “Celtics”, “Bruins” }  13

Map Example 3 (the fancy one) fancy_mapper(s) { if (s == “SkipMe”) return null; return ToLower(s) + “, “ + s.Length; } { “Will”, “Dan”, “SkipMe”, “Kevin”, “T.J.” }  { “will, 4”, “dan, 3”, “kevin, 5”, “t.j., 4” }

Problems with Word Counter? What happens if our data is… 1 GB, 1 TB, 1 PB, … What happens if our data is… Images, videos, tweets, Facebook updates, … What happens if our processing is… Complex, multiple steps, …

Simplified Example Word Frequency Counter Which word appears most frequently, and how many times?

Workflow 1.Setup 2.Map 3.Shuffle 4.Reduce 5.Celebrate

Hadoop Cluster 1 MASTER NODEMany SLAVE NODES - Job Tracker- Task Tracker on each “the boss” HDFS on all nodes

Step 1. Setup (Assumes: you’ve installed Hadoop on a cluster of computers) You supply: 1.Map and Reduce logic – This is “code” – packaged in a Java JAR file – Other language support exists, more coming 2.A big pile of input files – “Get the data here” – For Word Frequency Counter, we might use Wikipedia or Project Gutenberg files 3.Go!

Step 2. Map Job Tracker distributes your Mapper and Reducer – To every node Job Tracker distributes your data – To some nodes (at least 3) in 64 MB chunks Task Tracker on each node calls Mapper – Repeatedly until done; lots of parallelism Job Tracker watches for problems – Manages retries, failures, optimistic attempts

Mapper’s Job is Simple* Read input, write output – that’s all – Function signature: Map(text) – Parses the text and returns { key, value } – Map(“a b a foo”) returns {a, 1}, {b, 1}, {a, 1}, {foo, 1} * for Word Frequency Counter!

Actual Java Map Function public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, “1”); }

Step 3. Shuffle Shuffle collects all data from Map, organizes by key, and redistributes data to the nodes for Reduce

Step 3. Shuffle – example Mapper input: “the Bruins are the best in the NHL” Mapper output: { the,1 } { the,1 }{ the, 1 } { Bruins,1 } { best,1 }{ NHL, 1 } { are, 1 } { in, 1 } Shuffle transforms this into Reducer input: { are,[ 1 ] } { in, [ 1 ] } { Bruins, [ 1 ] } { best, [ 1 ] } { the, [ 1, 1, 1 ] }{ NHL, [ 1 ] }

Step 4. Reduce Output from Step 3. Shuffle has been distributed to datanodes Your “Reducer” is called on local data – Repeatedly, until all complete – Tasks run in parallel on nodes This is very simple for Word Frequency Counter! – Function signature: Reduce(key, values[]) – Adds up all the values and returns { key, sum }

Actual Java Reduce Function public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Step 5. Celebrate! You are done – you have your output In more complex scenario, might repeat – Hadoop tool ecosystem knows how to do this There are other projects in the Hadoop ecosystem for … – Multi-step jobs – Managing a data warehouse – Supporting ad hoc querying – And more!

demo

There’s a LOT MORE to the Hadoops… Hadoop streaming interface allows other languages – C# HIVE (HiveQL) Pig (Pig Latin language) Cascading.org Commercial companies dedicated: – HortonWorks

Questions? Comments? More information? ?

BostonAzure.org Boston Azure cloud user group Focused on Microsoft’s PaaS cloud platform Last Thursday, monthly, 6:00-8:30 PM at NERD – Food; wifi; free; great topics; growing community Boston Azure Boot Camp: June 2012 ( planning ) Follow on More info or to join our Meetup.com group:

Contact Me Looking for … consulting help with Windows Azure Platform? someone to bounce Azure or cloud questions off? a speaker for your user group or company technology event? Just Ask! Bill

Hadoop A tool to economically create value from the Three Vs New tool – Complements, rather than displaces, existing data analysis tools There are other approaches, but Hadoop is winning – Emerging as standard (remember XP vs. Scrum?) – Microsoft embracing ( Large Ecosystem – Apache Hadoop project – HBASE, HIVE, Pig, Mahout, … – Hadoop-specific companies: HortonWorks, Cloudera

Microsoft working with HortonWorks to port to Windows and enable on Windows Azure Looking to make JavaScript, C# first-class languages – Native language is Java – Community support for Python and others

HIVE HIVE QL – SQL-like Creates MapReduce job in the background Remember: Hadoop is BATCH oriented Hive Excel Plugin Hive Interactive Console in Azure

MapReduce is Functional Programming Like C#!