 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Apache Hadoop and Hive.
Dan Bassett, Jonathan Canfield December 13, 2011.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Hadoop Ali Sharza Khan High Performance Computing 1.
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
Unit 2 Hadoop and big data
Slides modified from presentation by B. Ramamurthy
Software Systems Development
Big Data Technologies Based on MapReduce and Hadoop
INTRODUCTION TO BIGDATA & HADOOP
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Chapter 10 Data Analytics for IoT
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Database Management Systems Unit – VI Introduction to Big Data, HADOOP: HDFS, MapReduce Prof. Deptii Chaudhari, Assistant Professor Department of.
Presentation transcript:

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)  Hadoop ecosystem  When should we go for Hadoop ?  Real world use cases  Questions

 What is BigData ? - Twitter (over 7~ TB/day) - Facebook (over 10~ TB/day) - Google (over 20~ PB/day)  Where does it come from ?  Why to take so much of pain ? - Information everywhere, but where is the knowledge?  Existing systems (vertical scalibility)  Why Hadoop (horizontal scalibility)?

 Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale  Hadoop started as a part of the Nutch project.  In Jan 2006 Doug Cutting started working on Hadoop at Yahoo  Factored out of Nutch in Feb 2006  First release of Apache Hadoop in September 2007  Jan Hadoop became a top level Apache project

 Amazon  Cloudera  MapR  HortonWorks  Microsoft Windows Azure.  IBM InfoSphere Biginsights  Datameer  EMC Greenplum HD Hadoop distribution  Hadapt

 Flexible infrastructure for large scale computation & data processing on a network of commodity hardware  Completely written in java  Open source & distributed under Apache license  Hadoop Common, HDFS & MapReduce

 Framework for running applications on large clusters of commodity hardware Scale: Petabytes of data on thousands of nodes In Hadoop eco-system processing logic(Code) travels throughout the cluster and not the data.  Components  Storage: HDFS : Hadoop File System. Name Node SecondaryNameNode. Data Node. Job Tracker Task Tracker  Processing. Map Reduce

 A replacement for existing data warehouse systems  A File system  An online transaction processing (OLTP) system  Replacement of all programming logic  A database

 High level view (NN, DN, JT, TT) –

 Namenode : stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop.  Jobtracker : runs on the Namenode and perform the map reduce of the jobs submitted to the cluster  Secondarynamenode: maintains the backup of metadata present on the Namenode, file system change history.  Datanode: will contain the actual data.  Default Block Size in Datanode: 64MB  Tasktracker: will perform task on the local data, assigned by the Jobtracker.

 Hadoop distributed file system  Default storage for the Hadoop cluster  NameNode/DataNode  The File System Namespace(similar to our local file system)  Master/slave architecture (1 master 'n' slaves)  Virtual not physical  Provides configurable replication (user specific)  Data is stored as chunks (64 MB default, but configurable) across all the nodes

The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.

After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the workand assigns different map and reduce tasks to each TaskTracker in the cluster.

Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition Namenode tries to place replicas of block on multiple racks for improved fault tolerance. A default installation assumes all the nodes belong to the same rack.

 Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner  Comprises of three classes – Mapper class Reducer class Driver class  Tasktracker/ Jobtracker  Reducer phase will start only after mapper is done  Takes (k,v) pairs and emits (k,v) pair

 Standalone mode  Pseudo-distributed mode  Fully-distributed mode

 Need to process Multi Petabyte Datasets  Nodes fail every day – Failure is expected, rather than exceptional. – The number of datanodes in a cluster is not constant.  Need common infrastructure – Efficient, reliable, Open Source Apache License  Workloads are IO bound and not CPU bound  Since the processing is distributed we don’t need high end processors.

 Very Large Distributed File System  Thousands nodes, millions of files, Petabytes of data.  Assumes Commodity Hardware  Files are replicated to handle hardware failure  Detect failures and recovers from them  User Space, runs on heterogeneous OS  Robustness:  Its all depends on heartbeat, every 3 seconds Datanode ping the Namenode for updates, if it does not do so Namenode mark it as dead node and data replication starts automatically

 Data is too huge  Processes are independent  Online analytical processing (OLAP)  Better scalability  Parallelism  Unstructured data

 Clickstream analysis  Sentiment analysis  Recommendation engines  Ad Targeting  Search Quality