MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison.

Slides:



Advertisements
Similar presentations
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Advertisements

Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Computing Infrastructure
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Silicon Graphics, Inc. Poster Presented by: SGI Proprietary Technologies for Breakthrough Research Rosario Caltabiano North East Higher Education & Research.
Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.
Lesson 12 – NETWORK SERVERS Distinguish between servers and workstations. Choose servers for Windows NT and Netware. Maintain and troubleshoot servers.
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
Virtual Desktop Infrastructure Solution Stack Cam Merrett – Demonstrator User device Connection Bandwidth Virtualisation Hardware Centralised desktops.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Networking, Hardware Issues, SQL Server and Terminal Services Session VII.
F1031 COMPUTER HARDWARE CLASSES OF COMPUTER. Classes of computer Mainframe Minicomputer Microcomputer Portable is a high-performance computer used for.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
Server Types Different servers do different jobs. Proxy Servers Mail Servers Web Servers Applications Servers FTP Servers Telnet Servers List Servers Video/Image.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
SGI Proprietary SGI Update IDC HPC User Forum September, 2008.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Small File File Systems USC Jim Pepin. Level Setting  Small files are ‘normal’ for lots of people Metadata substitute (lots of image data are done this.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Planning and Designing Server Virtualisation.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
SoCal Infrastructure OptIPuter Southern California Network Infrastructure Philip Papadopoulos OptIPuter Co-PI University of California, San Diego Program.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Large Scale Parallel File System and Cluster Management ICT, CAS.
CCS Overview Rene Salmon Center for Computational Science.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.
Queensland University of Technology CRICOS No J VMware as implemented by the ITS department, QUT Scott Brewster 7 December 2006.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
COMMON INTERFACE FOR EMBEDDED SOFTWARE CONFIGURATION by Yatiraj Bhumkar Advisor Dr. Chung-E Wang Department of Computer Science CALIFORNIA STATE UNIVERSITY,
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
Macromolecular Crystallography Workshop 2004 Recent developments regarding our Computer Environment, Remote Access and Backup Options.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Mass Storage at SARA Peter Michielse (NCF) Mark van de Sanden, Ron Trompert (SARA) GDB – CERN – January 12, 2005.
Tackling I/O Issues 1 David Race 16 March 2010.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Background Computer System Architectures Computer System Software.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
12/19/01MODIS Science Team Meeting1 MODAPS Status and Plans Edward Masuoka, Code 922 MODIS Science Data Support Team NASA’s Goddard Space Flight Center.
Compute and Storage For the Farm at Jlab
WP18, High-speed data recording Krzysztof Wrona, European XFEL
NL Service Challenge Plans
Introduction to Networks
Migration Strategies – Business Desktop Deployment (BDD) Overview
QMUL Site Report by Dave Kant HEPSYSMAN Meeting /09/2019
Cluster Computers.
Presentation transcript:

MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison

Resources for Researchers CPU cycles Memory Storage space Network Software Compilers Models Visualization programs

Original MURI hardware 16 P III processors Storage server with 0.5 TB Gigabit networking Purpose: Provide working environment for collaborative development. Enable running of large multiprocessor MM5 model. Gain experience working with clustered systems.

Capabilities and Limitations Successfully ran initial MM5 model runs, algorithm development (fast model), and modeling of GIFTS optics (FTS simulator). MM5 model runs for 140 by 140 domains. One 270 by 270 run with very limited time steps. OpenPBS system scheduling hundreds of jobs. Idle CPU time given to FDTD raytracing. Expanded to 28 processors using funding from B. Baum, IPO, and others. However, MM5 model runtime limited domain size and storage space limited number of output time steps.

CY2003 Upgrade NASA provided funding for 11 Dual-Pentium4 processor nodes 4GB DDR-RAM 2.4GHz CPUs Expressly purposed for running large IHOP field program simulations (400 by 400 grid point domain).

Cluster “Mark 2” Gains: Larger scale model runs and instrument simulations as needed for IHOP Terabytes of experimental and simulation data online through NAS, hosted RAID arrays Limitations to further work at even larger scale Interconnect limitations slowed large model runs 32-bit memory limitation on huge model set-up jobs for MM5 and WRF Increasing number of small storage arrays

3 Years of Cluster Work Inexpensive Adding CPUs to the system Costly Adding users to the system Adding storage to the system Easily understood Matlab Not so well-understood Distributed system (computing, storage) capabilities

Along comes DURIP H.L.Huang / R.Garcia DURIP proposal awarded May Purpose: Provide hardware for next generation research and education programs. Scope: Identify computing and storage systems to serve the need to expand simulation, algorithm research, data assimilation and limited operational product generation experiments.

Selecting Computing Hardware Cluster options for numerical modeling were evaluated and found to require significant time investment. Purchased SGI Altix fall of 2004 after extensive test runs with WRF and MM Itanium2 processors running Linux 192GB of RAM 5TB of FC/SATA disk Recently upgraded to 32 CPUs, 10TB storage.

SGI Altix Capabilities Large, contiguous RAM allows 1600 by 1600 grid point domain (> CONUS area at 4 km res). Largest so far is 1070 by NUMAlink interconnect provides fast turn around for model runs Presents itself as a single 32-CPU Linux machine Intel compilers for ease of porting and optimizing Fortran/C on 32-bit and 64-bit hardware.

Storage Class: Home Directory Small size for source code (preferably also held under CVS control) and critical documents Nightly incremental backups Quota enforcement Current implementation Local disks on cluster head Backup by TC

Storage Class: Workspace Optimized for speed Automatic flushing of unused files No insurance against disk failure Users expected to move important results to Long-term Storage Current implementation RAID5 or RAID0 drive arrays within the cluster systems

Storage Class: Long-term Large amount of space Redundant, preferably back-up to tape Managed directory system, preferably with metadata Current implementation Lots of project-owned NAS devices with partial redundancy (RAID5) NFS spaghetti Ad-hoc tape backup

DURIP phase 2: Storage Long term storage scaling and management goals: Reduce or eliminate NFS ‘spaghetti’ Include hardware phase-in / phase-out strategy in purchase decision Acquire the hardware to seed a Storage Area Network (SAN) in the Data Center, improving uniformity and scalability Reduce overhead costs (principally human time) Work closely with Technical Computing group on system setup and operations for a long-term facility

Immediate Options Red Hat GFS Size limitations and hardware/software mix-and- match; Support costs make up for free source code. HP Lustre More likely to be a candidate for workspace. Expensive. SDSC SRB (Storage Resource Broker) Stability, documentation, and maturity at time of testing found to be inadequate. Apple Xsan Plays well with third-party storage hardware. Straightforward to configure and maintain. Affordable.

Dataset Storage Purchase Plan 64-bit storage servers and meta-data server Qlogic Fibre channel switch Move data between hosts, drive arrays SAN software to provide distributed filesystem Focusing on Apple Xsan for 1-3 year span Follow up with 1-year assessment with option of re-competing Storage arrays Competing Apple XRAID, Western Scientific Tornado

Target System for 2006 Scalable dataset storage accessible from clusters, workstations, and supercomputer Backup strategy Update existing cluster nodes to ROCKS Simplified management and improve uniformity Proven on other clusters deployed by SSEC Retire/repurpose slower cluster nodes Reduce bottlenecks to workspace disk Improve ease of use and understanding

Long-term Goals 64-bit shared memory system scaled to huge job requirements (Altix) Complementary compute farm migrating to x86-64 (Opteron) hardware Improved workspace performance Scalable storage with full metadata for long- term and published datasets Software development tools for multiprocessor algorithm development