Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Hadoop MapReduce Framework
Spark Presentation.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Big Data Programming: an Introduction
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Condor Week 2007 Outline Introduction –What do we mean by ‘Grid’? –Technology Overview Technologies –HDFS –Hadoop Map-Reduce –Hadoop on Demand

Introduction

Condor Week 2007 What do we mean by ‘Grid’? Computing platform that can support many distributed applications –Runs on dedicated clusters of commodity PCs (a Grid) –Hardware can be dynamically allocated to a “job” –Plan to support many applications per Grid Good for Batch data processing –Log Processing –Document Analysis and Indexing –Web Graphs and Crawling Large scale a primary design goal –10,000 PCs / Grid a design goal 1000 now) –Very large data (10 Petabyte storage a design goal) 100+ TB inputs to a single job Bandwidth to data is a significant design driver Large production deployments –Number of CPUs that can be applied gates what you can do –Several clusters of 1000s of nodes

Condor Week 2007 Technology Overview Hadoop (Our primary Grid project) –An open source apache project, started by Doug Cutting –HDFS, a distributed file system –Implementation of Map-Reduce programming model – HOD (Hadoop-on-Demand) –Adaptor that runs Hadoop tools on batch systems –Hadoop expressed as a parallel job –Manages setup, startup, shutdown and cleanup of Hadoop –Currently supports Condor and Torque

Technologies

Condor Week 2007 HDFS - Hadoop Distributed FS Very Large Distributed File System –We plan to support 10k nodes and 10 PB data –Current deployment of 1k+ nodes, 1PB data Assumes commodity hardware that fails –Files are replicated to handle hardware failure –Checksums for corruption detection and recovery –Continues operation as nodes / racks added / removed Optimized for fast batch processing –Data location exposed to allow computes to move to data –Stores data in chunks on every node in the cluster –Provides VERY high aggregate bandwidth

Condor Week 2007 Hadoop DFS Architecture Client I/O Namenode Metadata (Name, replicas, …): /home/sameerp/foo, 3, … /home/sameerp/docs, 4, … Client Datanodes Rack 1Rack 2 Metadata ops

Condor Week 2007 Hadoop Map-Reduce Implementation of the Map-Reduce programming model –Framework for distributed processing of large data sets –Resilient to nodes failing and joining during a job –Great for web data and log processing Pluggable user code runs in generic reusable framework –Input records are transformed, sorted and combined to produce a new output –All actions plugable / configurable A reusable design pattern Input | Map | Shuffle | Reduce | Output (example) cat * | grep | sort | unique -c > file

Condor Week 2007 HOD (Hadoop on Demand) Adaptor that enables Hadoop use with batch schedulers –Provisions Hadoop clusters on demand –Scheduling is handled by resource managers like Condor –Requests N nodes from a resource manager and provisions them with a Hadoop cluster Condor interaction –User specifies: number of nodes, workload to launch –HOD generates class-ads for Hadoop master and slaves and submits them as Condor jobs –Cluster comes up when the jobs start running –HOD launches workload

Condor Week 2007 HOD (Hadoop on Demand) HOD shell –User interface to HOD is a command shell –Workloads are specified as command lines –Example: % bin/hod -c hodconf -n 100 >> run hadoop-streaming.jar –mapper ‘grep condor’ -reducer ‘uniq -c’ -input /user/sameerp/data –output /user/sameerp/condor Work in progress –Data affinity for workloads –Implementation of elastic workloads –Software distribution via BitTorrent

Condor Week 2007 Hadoop on Condor Clients launch jobs Condor dynamically allocates clusters HOD used to start Hadoop Map-Reduce on cluster Map-Reduce Reads/Writes Data from the HDFS When done –Results are stored in HDFS and/or returned to the client –Condor reclaims nodes HDFS Condor Dynamic Hadoop Map-Reduce Cluster Client 1Client 2

Condor Week 2007 Other things in the works Record I/O –Define a structure once, use it in C, Java, Python… –Export it in a binary or XML format Streaming –A simple way to use existing Unix filters and / or stdin/out programs in any language with Map-Reduce Pig - Y! Research –Higher level data manipulation language, uses Hadoop –Data analysis tasks expressed as queries, in the style of SQL or Relational Algebra –

Condor Week 2007 The end THE END