Distributed Computing using CloudLab

Slides:



Advertisements
Similar presentations
Platforms: Unix and on Windows. Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows.
Advertisements

CPSC 441 TUTORIAL – JANUARY 16, 2012 TA: MARYAM ELAHI INTRODUCTION TO C.
Practical techniques & Examples
Tutorial on MPI Experimental Environment for ECE5610/CSC
1 Hadoop HDFS Install Hadoop HDFS with Ubuntu
Processes CSCI 444/544 Operating Systems Fall 2008.
PC Cluster Setup on Linux Fedora Core 5 High Performance Computing Lab Department of Computer Science and Information Engineering Tunghai University, Taichung,
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Unix & Windows Processes 1 CS502 Spring 2006 Unix/Windows Processes.
Hadoop Setup. Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Installing and Setting up mongoDB replica set PREPARED BY SUDHEER KONDLA SOLUTIONS ARCHITECT.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Electronic Visualization Laboratory, University of Illinois at Chicago MPI on Argo-new Venkatram Vishwanath Electronic Visualization.
Sharif University of technology, Parallel Processing course, MPI & ADA Server Introduction By Shervin Daneshpajouh.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Chapter 3 & 6 Root Status and users File Ownership Every file has a owner and group –These give read,write, and execute priv’s to the owner, group, and.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Apache Mahout. Prerequisites for Building MAHOUT Java JDK 1.6 Maven 3.0 or higher ( ). Subversion (optional)
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
1 Running MPI on “Gridfarm” Bryan Carpenter February, 2005.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Message Passing Interface Using resources from
GLite WN Installation Giuseppe LA ROCCA INFN Catania ACGRID-II School 2-14 November 2009 Kuala Lumpur - Malaysia.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
First South Africa Grid Training WORKER NODE Albert van Eck University of the Free State 25 July, 2008.
WORKER NODE Alfonso Pardo EPIKH School, System Admin Tutorial Beijing, 2010 August 30th – 2010 September 3th.
WordPress and Etherpad with BlueMix and Docker. Our aim is to run on BlueMix containers (now in beta) these two famous services In the BlueMix dashboard,
South African Grid Training WORKER NODE Albert van Eck UFS - ICTS 17 November 2009 Slides by GIUSEPPE PLATANIA.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Hands on training session for core skills
Hadoop Architecture Mr. Sriram
Introduction to Distributed Platforms
MPI Basics.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Chapter 10 Data Analytics for IoT
Presented by: - Yogesh Kumar
Getting Data into Hadoop
MPI Message Passing Interface
Hands-On Hadoop Tutorial
Pyspark 최 현 영 컴퓨터학부.
Introduction to GNU/Linux (Fedora) Command Line Interface
Three modes of Hadoop.
Lab 1 introduction, debrief
Capistrano David Grayson Las Vegas Ruby Meetup
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
The master node shows only one live data node when I am running multi node cluster in Big data.
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to HPC Workshop
INSTALLING AND SETTING UP APACHE2 IN A LINUX ENVIRONMENT
Processes in Unix, Linux, and Windows
The Basics of Apache Hadoop
Hands-On Hadoop Tutorial
Introduction to Apache
Hadoop install.
Hadoop Installation and Setup on Ubuntu
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
SUSE Linux Enterprise Desktop Administration
Processes in Unix, Linux, and Windows
Processes in Unix and Windows
Hola Hadoop.
Working in The IITJ HPC System
MapReduce: Simplified Data Processing on Large Clusters
Some codes for analysis and preparation for programming
Presentation transcript:

Distributed Computing using CloudLab Linh B. Ngo Clemson University

Introduction Distributed and Cluster Computing (CPSC 3620) Offered twice per academic year Average class size: 40-45 students Required junior-level class (typically taken at senior year) Contents: Infrastructure/System-oriented Performance/Efficiency High Performance Computing MPI Big Data Computing Hadoop MapReduce Apache Spark HPCCSystems

Computing Resources Palmetto Supercomputer CloudLab 2000+ nodes, open to all faculty/students No administrative access Cannot share nodes among students to support group assignment Preemption from node owners CloudLab Limited resources for large-scale study Administrative access Ease of collaboration No preemption

Computing Resources Combine both local computing resources and CloudLab Learning outcomes through CloudLab Administrative skills for distributed systems In-depth understanding of distributed systems Learning outcomes through Palmetto Basic understanding of parallel application development Impacts of scaling and efficiency on larger systems

Tutorial Set up environments for distributed computing on CloudLab MPI Two-node cluster OpenMPI Hadoop Three-node cluster Hortonwork Distribution

Both nodes should have the same configuration, no network connection is needed (due to public IP)

You can save multiple versions of your topology Instantiate the version that you want to launch The launching procedure will be similar to the OpenStack tutorial

MPI on CloudLab On each node: sudo apt-get update sudo apt-get install libibnetdisc-dev sudo nano /etc/environment PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/mpiuser/.openmpi/bin" LD_LIBRARY_PATH="/lib:/usr/lib:/usr/local/lib:/home/mpiuser/.openmpi/lib/“ sudo adduser mpiuser

MPI on CloudLab sudo adduser mpiuser ssh mpiuser@localhost wget https://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.1.tar.gz tar xzf openmpi-1.8.1.tar.gz cd openmpi-1.8.1 ./configure --prefix=“/home/mpiuser/.openmpi” make make install

MPI on CloudLab ssh-keygen –t rsa cd .ssh cp id_rsa.pub authorized_keys ssh-copy-id -i id_rsa.pub <hostname of the other node> nano nodelist <hostname of first node> <hostname of second node> …

Example program #include <stdio.h> #include <unistd.h> #include <sys/utsname.h> #include <mpi.h> int main(int argc, char *argv[]){ int rank, size; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); struct utsname uts; uname(&uts); printf("%d at %s\n",rank,uts.nodename); MPI_Finalize(); return 0; }

MPI on CloudLab mpicc gethostname.c –o gethostname scp gethostname mpiuser@<the other node>:/home/mpiuser scp nodelist mpiuser@<the other node>:/home/mpiuser mpirun –np 2 –machinefile nodelist ./gethostname mpirun –np 2 –machinefile nodelist --map-by node ./gethostname

Assignment Ideas Develop a work queue using various allocation strategies: Normal Cyclic Dynamic Setup MPI cluster with nodes on separate sites, reduce network connection, and evaluate performance on different allocation strategies

Hadoop on CloudLab Enterprise Hadoop Hortonworks http://hortonworks.com/hdp/downloads/

Hadoop on CloudLab Both nodes should have the same configuration on bare metal PC, no network connection is needed (due to public IP)

On each node SSH onto the node from Palmetto Change to root: sudo su – Execute the following commands: chkconfig --list ntpd chkconfig ntpd on service ntpd start chkconfig iptables off /etc/init.d/iptables stop setenforce 0

On each node Setup Ambari download server On namenode On datanode wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.2/ambari.repo -O /etc/yum.repos.d/ambari.repo On namenode yum –y install ambari-server yum –y install ambari-agent On datanode

On namenode Set up ambari server: Select default for all questions ambari-server setup Select default for all questions Select 1 for JDK version When all done, start ambari server ambari-server start

On each node Using vim to edit /etc/ambari-agent/conf/ambari-agent.ini Change: hostname=<hostname of namenode as shown in list view of CloudLab> Start Ambari Agent ambari-agent start

Ambari Server (admin/admin)

Ambari Server (admin/admin)

Ambari Server (admin/admin)

Ambari Server (admin/admin)

Ambari Server (admin/admin)

Assuming you had ambari agents up and running …

Ambari Server (admin/admin) HDFS YARN+MapReduce2 Tez ZooKeeper Ambari Metrics

Ambari Server (admin/admin)

Ambari Server (admin/admin)

Edit configuration as you see fit

Deploy …

Warning due to lack of space and failed checks (ignore)

HDFS

YARN

Tutorial sudo su hdfs hdfs dfs –mkdir /user/<username> hdfs dfs –chown <username>:<username> /user/<username> exit to <username> hdfs dfs –ls /user/ git clone https://github.com/clemsoncoe/Introduction-to-Hadoop-data.git cd Introduction-to-Hadoop-data hdfs dfs –put gutenberg-shakespeare.txt /user/<username>/ yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-2.7.1.2.3.6.0-3796.jar wordcount gutenberg-Shakespeare.txt output/ hdfs dfs –ls output hdfs dfs –cat output/part-r-00000

Assignment Ideas Deploy a Hadoop cluster and upload a large data set (Airline on-time performance data: http://stat-computing.org/dataexpo/2009/the-data.html) Examine and investigate performance of Hadoop MapReduce as data nodes are killed/added to the cluster Examine performance of Hadoop MapReduce as data nodes are located on different sites