Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Slides:



Advertisements
Similar presentations
Yahoo! Experience with Hadoop OSCON 2007 Eric Baldeschwieler.
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
The Hadoop Distributed File System
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Simple introduction to HDFS Jie Wu. Some Useful Features –File permissions and authentication. –Rack awareness: to take a node's physical location into.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
© Hortonworks Inc HDFS: Hadoop Distributed FS Steve Loughran, ATLAS workshop, June 2013.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
The IEEE International Conference on Cluster Computing 2010
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Distributed File Systems Sun Network File Systems Andrew Fıle System CODA File System Plan 9 xFS SFS Hadoop.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Information Systems & Semantic Web University of Koblenz ▪ Landau, Germany Cloud Computing What, why, how? Noam Bercovici Renata Dividino.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Big Data is a Big Deal!.
Introduction to Distributed Platforms
CSS534: Parallel Programming in Grid and Cloud
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
HDFS Yarn Architecture
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Presentation transcript:

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search

Condor Week 2006 Outline Introduction –Application environment, motivation, development principles Hadoop and Condor –Description, Hadoop-Condor interaction

Introduction

Condor Week 2006 Web Search Application Environment Data intensive distributed applications –Crawling, Document Analysis and Indexing, Web Graphs, Log Processing, … –Highly parallel workloads –Bandwidth to data is a significant design driver Very large production deployments –Several clusters of 100s-1000s of nodes –Lots of data (billions of records, input/output of 10s of TB in a single run)

Condor Week 2006 Why Condor and Hadoop? To date, our Utility Computing efforts have been conducted using a command-and-control model –Closed, “cathedral” style development –Custom built, proprietary solutions Hadoop and Condor –Experimental effort to leverage open source for infrastructure components –Current deployment: Cluster for supporting research computations Multiple users, running ad-hoc, experimental programs

Condor Week 2006 Vision - Layered Platform, Open APIs Batch Scheduling (Condor, SGE, SLURM, …) Distributed Store (HDFS, Lustre, Ibrix, …) Programming Models (MPI, DAG, MW, MR…) Applications (Crawl, Index, …)

Condor Week 2006 Development philosophy Adopt, Collaborate, Extend Open source commodity software Open APIs for interoperability Identify and use existing robust platform components Engage community and participate in developing nascent and emerging solutions

Hadoop and Condor

Condor Week 2006 Hadoop Open source project developing –Distributed store –Implementation of Map/Reduce programming model –Led by Doug Cutting –Implemented in Java –Alpha (0.1) release available for download Apache distribution Genesis –Lucene and Nutch (Open source search) –Hadoop (factors out distributed compute/storage infrastructure)

Condor Week 2006 Hadoop DFS Distributed storage system –Files are divided into uniform sized blocks and distributed across cluster nodes –Block replication for failover –Checksums for corruption detection and recovery –DFS exposes details of block placement so that computes can be migrated to data Notable differences from mainstream DFS work –Single ‘storage + compute’ cluster vs. Separate clusters –Simple I/O centric API vs. Attempts at POSIX compliance

Condor Week 2006 Hadoop DFS Architecture Master Slave architecture DFS Master “Namenode” –Manages all filesystem metadata –Controls read/write access to files –Manages block replication DFS Slaves “Datanodes” –Serve read/write requests from clients –Perform replication tasks upon instruction by namenode

Condor Week 2006 Hadoop DFS Architecture Client I/O Namenode Metadata (Name, replicas, …): /home/sameerp/foo, 3, … /home/sameerp/docs, 4, … Client Datanodes Rack 1Rack 2 Metadata ops

Condor Week 2006 Benchmarks

Condor Week 2006 Deployment Research cluster of 600 nodes –Billion+ web pages –Several months worth of logs –10s of TB of data –Multiple-users running ad-hoc research computations Crawl experiments, various kinds of log analysis, … –Commodity Platform: Intel/AMD, Linux, locally attached SATA drives Testbed for open source approach Still early days, deployment exposed many bugs Future releases to –First stabilize at current size –Then scale to nodes

Condor Week 2006 Hadoop-Condor interactions DFS makes data locations available to applications Applications generate job descriptions (class- ads) to schedule jobs close to data Extensions to enable Hadoop programming models to run in scheduler universe –Master/Worker, MPI universe like meta-scheduling Condor enables sharing among applications –Priority, accounting, quota mechanisms to manage resource allocation among users and apps

Condor Week 2006 Hadoop-Condor interactions Condor HDFS 4 a 3 b 2 c 1 d Scheduler universe apps Data locations (d,e) Classads (Schedule on d,e) 1 e Resource allocation

Condor Week 2006 The end THE END