Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
WORKFLOW IN MOBILE ENVIRONMENT. WHAT IS WORKFLOW ?  WORKFLOW IS A COLLECTION OF TASKS ORGANIZED TO ACCOMPLISH SOME BUSINESS PROCESS.  EXAMPLE: Patient.
1 The Google File System Reporter: You-Wei Zhang.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Grid Computing I CONDOR.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Kyung Hee University 1/41 Introduction Chapter 1.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
1 MMORPG Servers. 2 MMORPGs Features Avatar Avatar Levels Levels RPG Elements RPG Elements Mission Mission Chatting Chatting Society & Community Society.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
BIG DATA/ Hadoop Interview Questions.
Introduction to Distributed Platforms
Software Systems Development
HDFS Yarn Architecture
Chapter 10 Data Analytics for IoT
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
#01 Client/Server Computing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Replication Middleware for Cloud Based Storage Service
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
GARRETT SINGLETARY.
Basic Grid Projects – Condor (Part I)
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Distributed Systems
#01 Client/Server Computing
Presentation transcript:

Distributed Computing Overviews

Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop – HDFS and map reduce

What is Distributed Computing/System? Distributed computing –A field of computing science that studies distributed system. –The use of distributed systems to solve computational problems. Distributed system –Wikipedia There are several autonomous computational entities, each of which has its own local memory. The entities communicate with each other by message passing. –Operating System Concept The processors communicate with one another through various communication lines, such as high-speed buses or telephone lines. Each processor has its own local memory.

What is Distributed Computing/System? Distributed program –A computing program that runs in a distributed system Distributed programming –The process of writing distributed program

What is Distributed Computing/System? Common properties –Fault tolerance When one or some nodes fails, the whole system can still work fine except performance. Need to check the status of each node –Each node play partial role Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input. –Resource sharing Each user can share the computing power and storage resource in the system with other users –Load Sharing Dispatching several tasks to each nodes can help share loading to the whole system. –Easy to expand We expect to use few time when adding nodes. Hope to spend no time if possible.

Why Distributed Computing? The nature of application Performance –Computing intensive The task could consume a lot of time on computing. For example, π –Data intensive The task that deals with a lot mount or large size of files. For example, Facebook, LHC(Large Hadron Collider). Robustness –No SPOF (Single Point Of Failure) –Other nodes can execute the same task executed on failed node.

Common Architectures Communicate and coordinate works among concurrent processes –Processes communicate by sending/receiving messages –Synchronous/Asynchronous

Common Architectures Master/Slave architecture –Master/slave is a model of communication where one device or process has unidirectional control over one or more other devices Database replication –Source database can be treated as a master and the destination database can treated as a slave. Client-server –web browsers and web servers

Common Architectures Data-centric architecture –Using a standard, general-purpose relational database management system  customized in-memory or file- based data structures and access method –Using dynamic, table-driven logic in  logic embodied in previously compiled programs –Stored procedures  logic running in middle-tier application servers –Shared databases as the basis for communicating between parallel processes  direct inter-process communication via message passing function

Best Practice Data Intensive or Computing Intensive –Data size and the amount of data The attribute of data you consume Computing intensive –We can move data to the nodes where we can execute jobs Data Intensive –We can separate/replicate data to difference nodes, then we can execute our tasks on these nodes –Reduce data replication when executing tasks Master nodes need to know data location No data loss when incidents happen –SAN (Storage Area Network) –Data replication on different nodes Synchronization –When splitting tasks to different nodes, how can we make sure these tasks are synchronized?

Best Practice Robustness –Still safe when one or partial nodes fail –Need to recover when failed nodes are online. No further or few action is needed Condor – restart daemon –Failure detection When any nodes fails, master nodes can detect this situation. –Eg: Heartbeat detection –App/Users don’t need to know if any partial failure happens. Restart tasks on other nodes for users

Best Practice Network issue –Bandwidth Need to think of bandwidth when copying files from one node to other nodes if we would like to execute the task on the nodes if no data in these nodes. Scalability –Easy to expand Hadoop – configuration modification and start daemon Optimization –What can we do if the performance of some nodes is not good? Monitoring the performance of each node –According to any information exchange like heartbeat or log Resume the same task on another nodes

Best Practice App/User –shouldn’t know how to communicate between nodes –User mobility – user can access the system from some point or anywhere Grid – UI (User interface) Condor – submit machine

Case study - Condor Condor –Computing intensive jobs –Queuing policy Match task and computing nodes –Resource Classification Each resource can advertise its attributes and master can classify according to this

Case study - Condor From

Case study - Condor Role –Central Manger The collector of information, and the negotiator between resources and resource requests –Execution machine Responsible for executing condor tasks –Submit machine Responsible for submitting condor tasks –Checkpoint servers Responsible for storing all checkpoint files for the tasks

Case study - Condor Robustness –One execution machine fails We can execute the same task on other nodes. –Recovery Only need to restart the daemon when the failed nodes are online

Case study - Condor Resource sharing –Each condor user can share computing power with other condor users. Synchronization –Users need to take care by themselves Users can execute MPI job in a condor pool but need to think of the issues of synchronization and Deadlock. Failure detection –Central manager can know when nodes fails Based on update notification sent by nodes Scalability –Only execute few commands when new nodes are online.

Case study - Hadoop HDFS –Namenode: manages the file system namespace and regulates access to files by clients. determines the mapping of blocks to DataNodes. –Data Node : manage storage attached to the nodes that they run on save CRC codes send heartbeat to namenode. Each data is split as a chunk and each chuck is stored on some data nodes. –Secondary Namenode responsible for merging fsImage and EditLog

Case study - Hadoop

Map-reduce Framework –JobTracker Responsible for dispatch job to each tasktracker Job management like removing and scheduling. –TaskTracker Responsible for executing job. Usually tasktracker launch another JVM to execute the job.

Case study - Hadoop From Hadoop - The Definitive Guide

Case study - Hadoop Data replication –Data are replicated to different nodes Reduce the possibility of data loss Data locality. Job will be sent to the node where data are. Robustness –One datanode fails We can get data from other nodes. –One tasktracker failed We can start the same task on different node –Recovery Only need to restart the daemon when the failed nodes are online

Case study - Hadoop Resource sharing –Each hadoop user can share computing power and storage space with other hadoop users. Synchronization –No synchronization Failure detection –Namenode/Jobtracker can know when datanode/tasktracker fails Based on heartbeat

Case study - Hadoop Scalability –Only execute few commands when new nodes are online. Optimization –A speculative task is launched only when a task takes too much time on one node. The slower task will be killed when the other one has been finished

Reference pframe.htmhttp:// pframe.htm Tom White - Hadoop - The Definitive Guide Silberschatz Galvin - Operating System Concepts

Backup slides

Message passing - Synchronous Vs. Asynchronous

Case study – Condor (All related daemons) condor_master: keeping all the rest of the Condor daemons running on each machine condor_startd: represents a given resource and enforcing the policy that resource owners configure which determines under what conditions remote jobs will be started, suspended, resumed, vacated, or killed. When the startd is ready to execute a Condor job condor_starter: spawns the remote Condor job on a given machine condor_schedd: represents resource requests to the Condor pool. condor_shadow condor_collector: collecting all the information about the status of a Condor pool condor_negotiator: execute all the match-making within the Condor system condor_kbdd: notify condor_startd when machine owner condor_ckpt_server: store and retrieve checkpoint files condor_quill: builds and manages a database that represents a copy of the Condor job queue

Case study – Condor (All related daemons) condor_had: implementation of high availability of a pool's central manager through monitoring the communication of necessary daemons condor_replication: assists the condor_had daemon by keeping an updated copy of the pool's state condor_transferer: accomplish the task of transferring a file condor_lease_manager: leases in a persistent manner. Leases are represented by ClassAds condor_rooster: wakes hibernating machines based upon configuration details condor_shared_port: listen for incoming TCP packets