Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Apache Hadoop and Hive.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: A Software Framework for Data Intensive Computing Applications
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
The Hadoop Distributed File System
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Introduction to MapReduce and Hadoop
Map reduce Cs 595 Lecture 11.
Hadoop Aakash Kag What Why How 1.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop: A Software Framework for Data Intensive Computing Applications
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Hadoop Technopoints.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1

Parallel programming Parallel programming is used to improve performance and efficiency. In a parallel program, the processing is broken up into parts, that run on different machines concurrently. Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers. Hadoop2

Parallel programming Hadoop3 Web graphic Super Computer Janet E. Ward, 2000 Cluster of Desktops

Map/Reduce and Hadoop What is Map/Reduce. Map/Reduce is a programming model for processing large data sets, generally used for data intensive tasks, by Google. Map/Reduce is typically used to do distributed computing on clusters of computers. – The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the Map/Reduce framework is not the same as their original forms. How Hadoop and Map/Reduce are related. – Hadoop is implementation of computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

What is Hadoop? Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. Why the name Hadoop? It includes: – Map/Reduce : Computational Model – HDFS : Hadoop Distributed File System Yahoo! is the biggest contributor Here's what makes it especially useful: – Scalable: It can reliably store and process petabytes. – Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). – Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. – Works on Commodity hardware. 5

Open source Apache project Implemented in Java Apache Top Level Project Core (15 Committers)  HDFS  MapReduce Community of contributors is growing Though mostly Yahoo for HDFS and MapReduce You can contribute too!

Big Data - Large availability of data gathering tools -Huge network of sensors - Social networks - And the data is unstructured or semi-structured - -Need to analysis the data. - - Finding useful Information from the data. - - need distributed tool.

What does it do? Hadoop implements Google’s MapReduce, using HDFS MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 8

Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: – Hardware will fail – Processing will be run in batches. Thus there is an emphasis on high throughput. – Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. – Applications need a write-once-read-many access model. – Moving Computation is Cheaper than Moving Data. – Commodity hardware. 9

Apache Hadoop Wins Terabyte Sort Benchmark (July 2008) One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk.Hadoopterabyte sort benchmark The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory. The cluster had 910 nodes; 2 quad core 2.0ghz per node; 4 SATA disks per node; 8G RAM per a node; 1 gigabit ethernet on each node; 40 nodes per a rack; 8 gigabit ethernet uplinks from each rack to the core; Red Hat Enterprise Linux Server Release 5.1 (kernel ); Sun Java JDK 1.6.0_05- b13. Again Same year Google Recored 68 Sec. Yahoo used Hadoop again to reduce it 62 Sec. 10

Example Applications and Organizations using Hadoop Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search Yahoo! AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. AOL Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; Facebook FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning FOX Interactive Media University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; 11

Main Components of Hadoop HDFS MapReduce

What is HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – highly fault-tolerant and is designed to be deployed on low- cost hardware. – provides high throughput access to application data and is suitable for applications that have large data sets. – relaxes a few POSIX requirements to enable streaming access to file system data. – part of the Apache Hadoop Core project. The project URL is 13

MapReduce Paradigm Programming model developed at Google Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. 15

NameNode Metadata Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc

DataNode A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable

NameNode Failure A single point of failure Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system Need to develop a real Fault Aware solution

Hadoop Map/Reduce The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Natural for: – Log processing – Web search indexing – Ad-hoc queries

How does MapReduce work? The run time partitions the input and provides it to different Map instances; Map (key, value)  (key’, value’) The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. Each Reduce produces a single (or zero) file output. Map and Reduce are user written functions 21

First Hadoop Example DeFacto Example: Wordcount. Google originally invented the paradigm for Inverted index calculation. Input : the quick brown fox the fox ate the mouse how now brown cos Output : you can predict. 23

Word Count Dataflow

Mapper In action

Reducer In action

Main Class In action

Partitioning function By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function. You can specify a custom partitioning function. Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine.

Combiner function After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer. Sometimes this file is highly compressible. The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer.

Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop!

Hadoop is critical to Yahoo’s business Ads Optimization Content Optimization Search Index Content Feed Processing When you visit yahoo, you are interacting with data processed with Hadoop!

Hadoop Log Analysis Failure prediction and root cause analysis Hadoop Data Rebalancing Based on access patterns and load Best use of flash memory? More Ideas for Research

Hadoop single node cluster formation Hadoop Multi-Node cluster formation Practical Installation

Thank you all Any queries?.

Map in Lisp (Scheme) (map f list [list 2 list 3 …]) (map square ‘( )) – ( ) (reduce + ‘( )) – (+ 16 (+ 9 (+ 4 1) ) ) – 30 (reduce + (map square (map – l 1 l 2 )))) Unary operator Binary operator

MapReduce ala Google map(key, val) is run on each item in set – emits new-key / new-val pairs reduce(key, vals) is run for each unique key emitted by map() – emits final output