Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.

Slides:



Advertisements
Similar presentations
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3: Operating Systems Computer Science: An Overview Tenth Edition.
Advertisements

University of Southampton Electronics and Computer Science M-grid: Using Ubiquitous Web Technologies to create a Computational Grid Robert John Walters.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.
BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
SOFTWARE.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Computer System Architectures Computer System Software
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Eos Center-wide File Systems Chris Fuson Outline 1 Available Center-wide File Systems 2 New Lustre File System 3 Data Transfer.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3: Operating Systems Computer Science: An Overview Tenth Edition.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Operating Systems TexPREP Summer Camp Computer Science.
17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
Operating System 2 Overview. OPERATING SYSTEM OBJECTIVES AND FUNCTIONS.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Cray Inc. 11/28/2015 Cray Inc Slide 2 Cray Cray Adaptive Supercomputing Vision Cray moves to Linux-base OS Cray Introduces CX1 Cray moves.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
Application Software System Software.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
NICS RP Update TeraGrid Round Table March 10, 2011 Ryan Braby NICS HPC Operations Group Lead.
Alex Read, Dept. of Physics Grid Activities in Norway R-ECFA, Oslo, 15 May, 2009.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Department of Computer Science Operating Systems OPS621S Semester 2.
NICS Update Bruce Loftis 16 December National Institute for Computational Sciences University of Tennessee and ORNL partnership  NICS is the 2.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March
Background Computer System Architectures Computer System Software.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.
Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,
PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.
Benefits. CAAR Project Phases Each of the CAAR projects will consist of a: 1.three-year Application Readiness phase ( ) in which the code refactoring.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
BigPanDA Status Kaushik De Univ. of Texas at Arlington Alexei Klimentov Brookhaven National Laboratory OSG AHM, Clemson University March 14, 2016.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Advanced Computing Facility Introduction
Compute and Storage For the Farm at Jlab
HPC usage and software packages
Introduction to Operating System (OS)
TexPREP Summer Camp Computer Science
Introduction to High Performance Computing Using Sapelo2 at GACRC
A Possible OLCF Operational Model for HEP (2019+)
Workflow and HPC erhtjhtyhy Doug Benjamin Argonne National Lab.
Presentation transcript:

Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013

2Exploitation of ORNL Titan Titan Overview The Oak Ridge Leadership Computing Facility (OLCF) has completed the first phase of an upgrade of the Jaguar system that will result in a hybrid-architecture Cray XK7 system named Titan, with a peak theoretical performance of more than 20 petaflops. Titan will be the first major supercomputing system to utilize a hybrid architecture, or one that utilizes both conventional 16-core AMD Opteron CPUs and unconventional NIVIDIA Kepler graphics processing units (GPUs). The combination of CPUs and GPUs will allow Titan and future systems to overcome power and space limitations inherent in previous generations of high- performance computers.

Titan Specifications Exploitation of ORNL Titan3 Titan System Configuration Architecture:Cray XK7 Processor:16-Core AMD Cabinets:200 Nodes:18,688 AMD Opterons Cores/node:16 Total cores:299,008 Opteron Cores Memory/node:32GB + 6GB Memory/core:2GB Interconnect:Gemini GPUs:18,688 K20 Keplers Speed:20+ PF Square Footage4,352 sq feet

Current Status Announcement Beginning Friday, April 12th at 8:00 am, Titan was removed from production while the OLCF conducts the final tests to ensure that the system is ready for your science experiments. We expect Titan to be unavailable to users for final checkout and testing through mid- to late May. Further updates will be sent in the weekly messages and posted on the 4Exploitation of ORNL Titan

Titan Overview Operating System  Titan employs the Cray Linux Environment as its OS. This consists of a full-featured version of Linux on the login nodes, and a Compute Node Linux microkernel on compute nodes. The microkernel is designed to minimize partition overhead allowing scalable, low-latency global communications. File Systems  The OLCF's center-wide Lustre file system, named Spider, is available on Titan for computational work. With over 52,000 clients and (10) PB of disk space, it is the largest-scale Lustre file system in the world.Spider 5Exploitation of ORNL Titan

(from the) Titan User Guide Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Titan nodes as existing in one of three categories: login nodes, service nodes, or compute nodes. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. When a job is submitted to the batch system, the job submission script is first executed on a service node. Any job submitted to the batch system is handled in this way, including interactive batch jobs (e.g. via qsub -I). On Cray machines, when the aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance. Only User Work (Lustre) and Project Work (Lustre) storage areas are available to compute nodes. Login and service nodes use AMD's Istanbul-based processor, while compute nodes use the newer Interlagos-based processor. Interlagos-based processors include instructions not found on Istanbul- based processors, so … 6Exploitation of ORNL Titan

Timeline of ATLAS Involvement with OLCF  December 7, 2012 » RM is called by Ken Read (long ago an L3 grad student, now an ALICE physicist at ORNL and UTK) » Ken wants to explore ATLAS/ALICE/nEDM etc./Geant4/OLCF common interests. » nEDM = neutron Electric Dipole Moment experiment, one of several ORNL activities interested in heavy Geant4 usage  February 1, 2013 » RM submits “Implementing and Accelerating Geant4-Based Simulations on Titan” proposal to the ALCC (ASCR Leadership Computing Challenge) program » Request 10M core hours on Titan, plus 1 PB of disk storage, primarily focused on exploring use for ATLAS simulation. (About 1/3 of a US T2/year). 7Exploitation of ORNL Titan

Timeline of ATLAS Involvement with OLCF  February 11, 2012 » Meeting at ORNL: RM, Alexei, Sergey, Kaushik, SLAC G4 People, OLCF People, nEDM people » The OLCF People (e.g. Jack Wells, Director of Science at the ORNL National Center for Computational Sciences) strongly encourage ATLAS efforts to exploit Titan  March to May, 2013 » ATLAS and Geant4 people succeed in getting accounts on OLCF  May 2013 » Result of ALCC application expected  July 2013 » Start of 1-year allocation on Titan if ALCC is approved 8Exploitation of ORNL Titan

Plans  Leverage the existing (ASCR+HEP funded) “Big PanDA” project, plus ALCF experience to interface PanDA to Titan.  Get ATLAS simulation running (ignoring GPUs): » SLAC Geant4 group » Wei Yang plus SLAC student(s) » Need ATLAS simulation expertise  Start producing useful simulation (H  b bbar ?)  Explore GPU use: » Communications overheads mean that multithreading on the CPU side is required to mask the latency of the GPUs » SLAC G4 group is now driving G4 multithreading » Not clear when multithreaded G4 will be validated by ATLAS 9Exploitation of ORNL Titan