Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Search at Scale Hadoop, Katta & Solr the smores of Search.
Large Scale File Distribution Troy Raeder & Tanya Peters.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Scalability Module 6.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Recent Developments and Applications
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Installation and Integration of Virtual Clusters onto Pragma Grid NAIST Nara, Japan Kevin Lam 06/28/13.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
High Availability NFS on Linux Winson Wang Hewlett-Packard Company Cupertino, CA Tel:
Block1 Wrapping Your Nugget Around Distributed Processing.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Operating System 2 Overview. OPERATING SYSTEM OBJECTIVES AND FUNCTIONS.
© 2012 IBM Corporation Platform Computing 1 IBM Platform Cluster Manager Data Center Operating System April 2013.
Distributed System Concepts and Architectures Services
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
NCEP ESMF GFS Global Spectral Forecast Model Weiyu Yang, Mike Young and Joe Sela ESMF Community Meeting MIT, Cambridge, MA July 21, 2005.
Distributed System Services Fall 2008 Siva Josyula
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Construction methods and monitoring in meta-cluster systems LIT, JINR Korenkov V.V, Mitsyn V.V, Chkhaberidze D.V, Belyakov D.V.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Experimental Perspectives on Lasso-related Algorithms on Parallel Computing Frameworks
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Apache Ignite Compute Grid Research Corey Pentasuglia.
Map reduce Cs 595 Lecture 11.
Applied Operating System Concepts
OpenPBS – Distributed Workload Management System
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Network Load Balancing Functionality
Practical aspects of multi-core job submission at CERN
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
The Basics of Apache Hadoop
Cloud Computing.
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient mapping Using IBM MPI parameters Issues with the filesystem I/O issues

Variation of the execution time 2 Four faulty nodes influence the performance of the cluster Including all the nodes the execution time was between 75 and 330 seconds Exclude the following nodes #BSUB –R “select[hname!=100]” #BSUB –R “select[hname!=102]” #BSUB –R “select[hname!=104]” #BSUB –R “select[hname!=105]” The execution time without the above nodes is around to seconds. The reboot of the nodes seems to solved the problem.

Using -O2 optimization flag 3 The usage of –O3 was crashing the model (without -g). We added the –fp-model strict flag to take care the floating point operations and disable optimizations that can influence the results. Not efficient results, but not trusted also

Issues with the hardware?? 4 Issue with the node 98

Issues with the filesystem 5 In some cases no binary files are created with the output of the model Error: failed.LSF error (res): Stale NFS file handle Think about automatic submit of failed jobs?

I/O issues 6 Using IBM MPI environment variables 6 Machine GB M/sec (using only one node from the switch) n Machine GB M/sec (using two nodes from the switch) n n I/O bottlenecks easy to occur with this network configuration

Using IBM MPI environment variables 7 MP_EAGER_LIMIT=128K MP_BINDPROC=yes MP_TASK_AFFINITY=core MP_RC_MAX_QP=4096

Remarks 8 Tested both NMMB versions with and without ESMF on the new machine On operational ESMF NMMB, disable ESMF_Log calls Synchronize the time across the nodes with NTP configuration We suggest to do some more tests to run simulations with more balanced distribution of the nodes across the switches