Introduction to Hadoop 趨勢科技研發實驗室. Copyright 2009 - Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop File System B. Ramamurthy 4/19/2017.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Hadoop.
Slides modified from presentation by B. Ramamurthy
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Distributed Filesystem
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
THE GOOGLE FILE SYSTEM.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Introduction to Hadoop 趨勢科技研發實驗室

Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview MapReduce overview Reference Classification

Copyright Trend Micro Inc. What is Hadoop? Hadoop is a cloud computing platform for processing and keeping vast amount of data. Apache top-level project Open Source Hadoop Core includes –Hadoop Distributed File System (HDFS) –MapReduce framework Hadoop subprojects –HBase, Zookeeper,… Written in Java. Client interfaces come in C++/Java/Shell Scripting Runs on –Linux, Mac OS/X, Windows, and Solaris –Commodity hardware A Cluster of Machines Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) MapReduce HBase Cloud Applications

Copyright Trend Micro Inc. A Brief History of Hadoop –First MapReduce library written at Google –Google File System paper published –Google MapReduce paper published –Doug Cutting reports that Nutch now use new MapReduce implementation –Hadoop code moves out of Nutch into new Lucene sub-project –Google Bigtable paper published

Copyright Trend Micro Inc. A Brief History of Hadoop (continue) –First HBase code drop from Mike Cafarella –Yahoo! Running Hadoop on 1000-node cluster –Hadoop made an Apache Top Level Project Classification

Copyright Trend Micro Inc. Who use Hadoop? Yahoo! –More than 100,000 CPUs in ~20,000 computers running Hadoop Google –University Initiative to Address Internet-Scale Computing Challenges Amazon –Amazon builds product search indices using the streaming API and pre- existing C++, Perl, and Python tools. –Process millions of sessions daily for analytics IBM –Blue Cloud Computing Clusters Trend Micro – Threat solution research More… – Classification

Copyright Trend Micro Inc. Hadoop Distributed File System Overview

Copyright Trend Micro Inc. HDFS Design Single Namespace for entire cluster Very Large Distributed File System –10K nodes, 100 million files, 10 PB Data Coherency –Write-once-read-many access model –A file once created, written and closed need not be changed. –Appending write to existing files (in the future) Files are broken up into blocks –Typically 128 MB block size –Each block replicated on multiple DataNodes

Copyright Trend Micro Inc. HDFS Design Moving computation is cheaper than moving data –Data locations exposed so that computations can move to where data resides File replication –Default is 3 copies. –Configurable by clients Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Streaming Data Access –High throughput of data access rather than low latency of data access –Optimized for Batch Processing Classification

Copyright Trend Micro Inc.

NameNode Manages File System Namespace –Maps a file name to a set of blocks –Maps a block to the DataNodes where it resides Cluster Configuration Management Replication Engine for Blocks

Copyright Trend Micro Inc. NameNode Metadata Meta-data in Memory –The entire metadata is in main memory –No demand paging of meta-data Types of Metadata –List of files –List of Blocks for each file –List of DataNodes for each block –File attributes, e.g creation time, replication factor

Copyright Trend Micro Inc. NameNode Metadata (cont.) A Transaction Log (called EditLog) –Records file creations, file deletions. Etc FsImage –The entire namespace, mapping of blocks to files and file system properties are stored in a file called FsImage. –NameNode can be configured to maintain multiple copies of FsImage and EditLog. Checkpoint –Occur when NameNode startup –Read FsImange and EditLog from disk and apply all transactions from the EditLog to the in-memory representation of the FsImange –Then flushes out the new version into new FsImage Classification

Copyright Trend Micro Inc. Secondary NameNode Copies FsImage and EditLog from NameNode to a temporary directory Merges FSImage and EditLog into a new FSImage in temporary directory. Uploads new FSImage to the NameNode –Transaction Log on NameNode is purged FsImage EditLog FsImage (new)

Copyright Trend Micro Inc. NameNode Failure A single point of failure Transaction Log stored in multiple directories –A directory on the local file system –A directory on a remote file system (NFS/CIFS) Need to develop a real HA solution SPOF!!

Copyright Trend Micro Inc. DataNode A Block Server –Stores data in the local file system (e.g. ext3) –Stores meta-data of a block (e.g. CRC) –Serves data and meta-data to Clients Block Report –Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Copyright Trend Micro Inc. HDFS - Replication Default is 3x replication. The block size and replication factor are configured per file. Block placement algorithm is rack-aware

Copyright Trend Micro Inc. User Interface API –Java API –C language wrapper (libhdfs) for the Java API is also avaiable POSIX like command –hadoop dfs -mkdir /foodir –hadoop dfs -cat /foodir/myfile.txt –hadoop dfs -rm /foodir myfile.txt HDFS Admin –bin/hadoop dfsadmin –safemode –bin/hadoop dfsadmin –report –bin/hadoop dfsadmin -refreshNodes Web Interface –Ex:

Copyright Trend Micro Inc. Web Interface ( )

Copyright Trend Micro Inc. Web Interface ( ) Browse the file system Classification

Copyright Trend Micro Inc. POSIX Like command

Copyright Trend Micro Inc. Latest API – Java API

Copyright Trend Micro Inc. Classification MapReduce Overview

Copyright Trend Micro Inc. Brief history of MapReduce A demand for large scale data processing in Google The folks at Google discovered certain common themes for processing those large input sizes Many machines are needed Two basic operations on the input data: Map and Reduce Google’s idea is inspired by map and reduce functions commonly used in functional programming. Functional programming: MapReduce –Map [1,2,3,4] – (*2)  [2,3,6,8] –Reduce [1,2,3,4] – (sum)  10 –Divide and Conquer paradigm

Copyright Trend Micro Inc. MapReduce MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Users specify –a map function that processes a key/value pair to generate a set of intermediate key/value pairs –a reduce function that merges all intermediate values associated with the same intermediate key 。 And MapReduce framework handles details of execution. Classification

Copyright Trend Micro Inc. Application Area MapReduce framework is for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Examples –Large data processing such as search, indexing and sorting –Data mining and machine learning in large data set –Huge access logs analysis in large portals More applications Classification

Copyright Trend Micro Inc. MapReduce Programming Model User define two functions map (K1, V1)  list(K2, V2) –takes an input pair and produces a set of intermediate key/value reduce (K2, list(V2))  list(K3, V3) –accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values MapReduce Logical Flow shuffle / sorting

Copyright Trend Micro Inc. MapReduce Execution Flow Classification 7 shuffle / sorting Shuffle (by key) Sorting (by key)

Copyright Trend Micro Inc. Word Count Example Word count problem : –Consider the problem of counting the number of occurrences of each word in a large collection of documents. Input: –A collection of (document name, document content) Expected output –A collection of (word, count) User define: –Map: (offset, line) ➝ [(word1, 1), (word2, 1),... ] –Reduce: (word, [1, 1,...]) ➝ [(word, count)]

Copyright Trend Micro Inc. Word Count Execution Flow

Copyright Trend Micro Inc. Monitoring the execution from web console Classification

Copyright Trend Micro Inc. Reference Classification Hadoop project official site – Google File System Paper – Google MapReduce paper – Google: MapReduce in a Week – Google: Cluster Computing and MapReduce – minilecture/listing.htmlhttp://code.google.com/edu/submissions/mapreduce- minilecture/listing.html