Download presentation
Presentation is loading. Please wait.
Published byIra Pierce Modified over 9 years ago
1
National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19, 2003
2
Scientific Computing Climate Chemistry Physics Nano-Science Genomics Molecular Modeling Materials Simulation of Large Systems Algorithms Development
3
System Configuration 184 Compute Nodes 16 GPFS Nodes 4 Service Nodes 3 Login Nodes 1 Network/Admin Nodes 24.7 TB Formatted SSA 13 Homes @ ~500 GB Scratch @ ~13 TB 4 Nodes @ 64 GB 64 Nodes @ 32 GB 140 Nodes @ 16 GB
4
System Utilization Hours
5
Job Size Breakdown Hours Scaling Efforts
6
Large Jobs Percent Scaling Efforts 50%
7
System Expanded March 2003 The System Doubled Difficult Decision: –Change in operating model, single large scale production system –Cable length limitations required existing hardware to be relocated –Integration with minimal disruption of service
8
System Configuration 380 Compute Nodes 20 GPFS Nodes 8 Service Nodes 6 Login Nodes 2 Network/Admin Nodes 44.7 TB SSA Disk ~33 TB Scratch +106% +25% +100% +80% +153%
9
SCSI Disks 2 x 36.4 GB SCSI drives Mirrored for availability 36.4 GB available space rootvg (36.4 GB) 36.4 GB
10
SSA Disks Hot Spare hdisk x hdisk y hdisk z 16 drives per drawer RAID 5 for RAS Each node twintailed to five other nodes node in the same frame 3 Groups per drawer
11
Networking Login Node Network Node Jumbo Frame Production Jumbo Frame Production
12
Fun Facts 39,936 DIMMS 7.7 TB Memory 832 SCSI Disks 29.6 TB SCSI Disks 6,656 Processors 35 Miles of Cable 30 Gigabit Adapters 210 SSA Adapters 3,440 SSA Disks 65.4 TB raw SSA
13
System Utilization Hours
14
Job Size Breakdown Hours
15
New Batch Configuration premium regular low interactive debug pre_128 pre_32 pre_1 reg_128 reg_32 reg_1 reg_1l interactive debug low Class Of Service Job Class high low Priority
16
System Utilization Hours
17
Job Size Breakdown Hours
18
Large Jobs allocation depletion Percent 50%
19
Job Efficiency Hours
20
Performance Variation Performance variation problem detected. Original nodes appeared to performed slower than nodes added into the system. Hardware swapped between original nodes and new nodes, no improvement. Accounting showed occurrence of specific commands significantly higher on original nodes. Four problem management definitions found to be deactivated but still executing constantly on original nodes. Analysis performed by NERSC’s David Skinner
21
FY04 System Utilization Hours
22
FY04 Job Size Breakdown Hours
23
FY04 Large Jobs 50% Percent
24
Job Efficiency
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.