Apache hadoop & Mapreduce

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Apache Hadoop on Windows Azure Avkash Chauhan
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Slide ‹#› DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny, Dr. Srinivas Akella.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
MapReduce Compilers-Apache Pig
Introduction to Hadoop
Image taken from: slideshare
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Slides modified from presentation by B. Ramamurthy
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
HADOOP ADMIN: Session -2
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Chapter 10 Data Analytics for IoT
Understanding Hadoop Mr. Sriram
Introduction to MapReduce and Hadoop
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Big Data Programming: an Introduction
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
Introduction to Apache
TIM TAYLOR AND JOSH NEEDHAM
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Apache hadoop & Mapreduce Wanjiang Qian

history Google whitepapers Implementations GFS (The Google File System) MapReduce: Simplified Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data Hadoop Distributed File System (HDFS) MapReduce Apache HBase

What is apache hadoop A scalable fault-tolerant distributed system for data storage and processing. Core Hadoop has two main systems: Hadoop Distributed File System (data storage): self-healing high-bandwidth clustered storage. MapReduce (processing): distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes.

Hadoop Distributed File System (HDFS) Hadoop ecosystem Select * from Pig Hive MapReduce Impala HBase Others: Mahout, Hue Hadoop Distributed File System (HDFS)

Why hadoop? (read 1 tb of data) 1 machine 10 machines 4 I/O channels Each channel: 100 MB/sec = 45 minutes 4 I/O channels Each channel: 100 MB/sec = 4.5 minutes

Why hadoop? Cheaper Faster Better Companies using Hadoop Scales to Petabytes or more Faster Parallel data processing Better Suited for particular types of BigData problems Companies using Hadoop Facebook, Yahoo. Amazon, eBay, American Airlines, IBM, The New York Times

NameNode JobTracker DataNode TaskTracker

3 5 4 2 3 5 1 5 3 2 4 1 1 4 2 NameNode File metadata: /user/wq/data1.txt -> 1,2,3 /user/2a/data2.txt -> 4,5 3 5 4 2 3 5 1 5 3 2 4 1 1 4 2

I want to write file block A Ok, write to node 1,5,8 I want to write file block A Client File Metadata File= Blk A DN:1,5,8 TCP NameNode Ready command 5+8 Ready command 8 A B C DataNode 1 DataNode 5 Pipelined Write Ready command DataNode 2 DataNode 6 DataNode 3 DataNode 7 DataNode 4 DataNode 8

Sample hdfs shell commands Bin/hadoop fs -ls Bin/hadoop fs -mkdir Bin/hadoop fs -copyFromLocal Bin/hadoop fs -copyToLocal Bin/hadoop fs -moveToLocal Bin/hadoop fs -rm Bin/hadoop fs -chmod

How mapreduce work? JobTracker TaskTracker 1 TaskTracker 2

Mapreduce example "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.

Thanks http://en.wikipedia.org/wiki/MapReduce http://en.wikipedia.org/wiki/Hadoop http://hadoop.apache.org/