MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith

Slides:



Advertisements
Similar presentations
Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Senior Project Manager & Architect Love Your Data.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
Transform + analyze Visualize + decide Capture + manage Dat a.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Putting the Sting in Hive Page 1 Alan F.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
fs.azure.account.key.accountname enterthekeyvaluehere.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
An Introduction to HDInsight June 27 th,
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
4 5 6 var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Breaking points of traditional approach What if you could handle big data?
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Azure HDInsight And Excel Analyze unstructured data at scale, then visualize! George Walters Sr. Technical Solutions Professional, Data Platform Microsoft.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
Before the Session Verify HDInsight Emulator properly installed Verify Visual Studio and NuGet installed on emulator system Verify emulator system has.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Microsoft Partner since 2011
MSBIC Hadoop Series Querying Data with Hive Bryan Smith
Connected Infrastructure
SAS users meeting in Halifax
Welcome to MSBIC! June 2014.
Yarn.
An Open Source Project Commonly Used for Processing Big Data Sets
MSBIC Hadoop Series Processing Data with Pig
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
07 | Analyzing Big Data with Excel
Introduction to Apache
Overview of big data tools
Spark and Scala.
CS639: Data Management for Data Science
HDInsight & Power BI By Łukasz Gołębiewski.
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Moving your on-prem data warehouse to cloud. What are your options?
Introduction to Azure Data Lake
CS639: Data Management for Data Science
Pig Hive HBase Zookeeper
Presentation transcript:

MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith

MSBIC Hadoop Series Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting StartedAugust – On Vacation April – Understanding the File SystemSeptember – Hadoop & MS BI May – Implementing MapReduce Jobs October – To Be Announced June – Querying the Data with Hive November – Loading Social Media Data July – Processing the Data with PigDecember – DW Integration

Today’s Session Objectives: 1.Understand Basics of MapReduce 2.Implement a MapReduce Job 3.Introduce Tez

Sample File How Many Evens & Odds? odd even odd even odd even odd even odd Step 1 odd {1,3,5,7, 9} even{2,4,6,8} Step 2 keyvalue[ ] map( ) Step 3 odd5 even4 reduce( )

Sample Files Name Node Data Node XYZ Job Map Task Reduc e Task P0P0 P1P1

Implementing MapReduce using.NET Add the following packages: Microsoft.NET Map Reduce API for Hadoop Microsoft.NET API for Hadoop WebClient Windows Azure Storage (if running against Azure HDInsight) Add the following directives: using Microsoft.Hadoop; using Microsoft.Hadoop.MapReduce; using Microsoft.Hadoop.WebClient.WebHCatClient; If running against Azure HDInsight, change project’s Target Platform to x64

MapReduce Demo

Goodbye, MapReduce Distributable for Scale Resistant to Failure “Easy” to Program Disk Liberal/Memory Conservative Rigid Step Sequencing

MapReduce as a Graph Map Reduce Map Reduce Vertex Edge

Tez: An Alternative Model Vertex Directed Acyclic Graph (DAG) Vertex Edge

MapReduce vs. Tez MapReduce Focused on Disk Rigid, Linear Step Sequencing Supports Hadoop Streaming Tez Focused on Memory Flexible, Parallel Step Sequencing ???

Guidance on MapReduce & Tez Most of your work will be at higher levels, i.e. Pig & Hive Movement from MapReduce will benefit performance & be transparent to you Apache Tez in HDP 2.1 HDInsight lags a few months Microsoft a key contributor

Today’s Session Objectives: 1.Understand Basics of MapReduce 2.Implement a MapReduce Job 3.Introduce Tez

For Next Session Topic:  Querying Data with Hive  Implement a Hive table and query it using HQL Requested Action(s):  Come with working HDInsight Emulator  Load sample data sets into HDFS on Emulator