Expansion Plans for the Brookhaven Computer Center HEPIX – St. Louis November 7, 2007 Tony Chan - BNL.

Slides:



Advertisements
Similar presentations
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Advertisements

Computer Room Requirements for High Density Rack Mounted Servers Rhys Newman Oxford University.
Team 1: Aaron, Austin, Dan, Don, Glenn, Mike, Patrick.
Computer Room Provision in Atlas and R89 Graham Robinson.
The CDCE BNL HEPIX – LBL October 28, 2009 Tony Chan - BNL.
Efficient Resource Management for Cloud Computing Environments Andrew J. Younge, Gregor von Laszewski, Lizhe Wang, Sonia Lopez-Alarcon, Warren Carithers.
Building a Sustainable Data Center Matthew Holmes Johnson County Community College.
Return of the Large Data Center. Computing Trends Computing power is now cheap, power hungry, and hot. Supercomputers are within reach of all R1 universities.
Operating Dedicated Data Centers – Is It Cost-Effective? CHEP Amsterdam Tony Wong - Brookhaven National Lab.
Virtual Machines. Virtualization Virtualization deals with “extending or replacing an existing interface so as to mimic the behavior of another system”
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
A New Building Data Center Upgrade capacity and technology step 2011 – May the 4 th– Hepix spring meeting Darmstadt (de) Pascal Trouvé (Facility Manager.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Data Centre Power Trends UKNOF 4 – 19 th May 2006 Marcus Hopwood Internet Facilitators Ltd.
Accelerating Your Success™ 1 Green IT April 22, 2009.
SLAC National Accelerator Laboratory Site Report A National Lab in Transition Randy Melen, Deputy CIO Computing Division, Operations Directorate SLAC National.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Virtual Machine and its Role in Distributed Systems.
New Data Center at BNL– Status Update HEPIX – CERN May 6, 2008 Tony Chan - BNL.
Progress Energy Corporate Data Center Rob Robertson February 17, 2010 of North Carolina.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Most organization’s data centers that were designed before 2000 were we built based on technologies did not exist or were not commonplace such as: >Blade.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
Infrastructure Improvements 2010 – November 4 th – Hepix – Ithaca (NY)
U.S. ATLAS Tier 1 Planning Rich Baker Brookhaven National Laboratory US ATLAS Computing Advisory Panel Meeting Argonne National Laboratory October 30-31,
Brookhaven Analysis Facility Michael Ernst Brookhaven National Laboratory U.S. ATLAS Facility Meeting University of Chicago, Chicago 19 – 20 August, 2009.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
Nikhef/(SARA) tier-1 data center infrastructure
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
High Performance Computing (HPC) Data Center Proposal Imran Latif, Facility Project Manager Scientific & Enterprise Computing Data Centers at BNL 10/14/2015.
Energy Efficient Data Centers Update on LBNL data center energy efficiency projects June 23, 2005 Bill Tschudi Lawrence Berkeley National Laboratory
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Datacenter Energy Efficiency Research: An Update Lawrence Berkeley National Laboratory Bill Tschudi July 29, 2004.
ClinicalSoftwareSolutions Patient focused.Business minded. Slide 1 Opus Server Architecture Fritz Feltner Sept 7, 2007 Director, IT and Systems Integration.
Power and Cooling at Texas Advanced Computing Center Tommy Minyard, Ph.D. Director of Advanced Computing Systems 42 nd HPC User Forum September 8, 2011.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Multi-core CPU’s April 9, Multi-Core at BNL First purchase of AMD dual-core in 2006 First purchase of Intel multi-core in 2007 –dual-core in early.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t Power and Cooling Challenges at CERN IHEPCCC Meeting April 24 th 2007 Tony Cass.
CD-doc-650 Fermilab Computing Division Physical Infrastructure Requirements for Computers FY04 – 07 (1/4/05)
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
November 28, 2007 Dominique Boutigny – CC-IN2P3 CC-IN2P3 Update Status.
CD-doc-650 Fermilab Computing Division Physical Infrastructure Requirements for Computers FY04 – 07 (1/4/05)
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
BLADE HEMAL RANA BLADE TECHNOLOGIES PRESENTED BY HEMAL RANA COMPUTER ENGINEER GOVERNMENT ENGINEERING COLLEGE,MODASA.
Enterprise Vitrualization by Ernest de León. Brief Overview.
Extreme Scale Infrastructure
CHEP 2016 – San Francisco Tony Wong Brookhaven National Laboratory
The Lean and Green by Design
Enabling High Efficient Power Supplies for Servers
Green IT Focus: Server Virtualization
Consolidation, Virtualization and DR
Lattice QCD Computing Project Review
LCG Deployment in Japan
CERN Data Centre ‘Building 513 on the Meyrin Site’
Western Analysis Facility
LQCD Computing Operations
IT Equipment Efficiency
IT Equipment Efficiency
Virtualization Dr. S. R. Ahmed.
Presentation transcript:

Expansion Plans for the Brookhaven Computer Center HEPIX – St. Louis November 7, 2007 Tony Chan - BNL

Background Brookhaven National Lab (BNL) is a U.S. gov`t funded multi-disciplinary research laboratory. Brookhaven National Lab (BNL) is a U.S. gov`t funded multi-disciplinary research laboratory. RACF formed in the mid-90`s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90`s. RACF formed in the mid-90`s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90`s. RACF supports HENP and HEP scientific computing efforts and also provides various general services (backup, , web, off-site data transfer, Grid, etc). RACF supports HENP and HEP scientific computing efforts and also provides various general services (backup, , web, off-site data transfer, Grid, etc).

Background (cont.) Growing operational complexity: local  global resource Growing operational complexity: local  global resource Increasing staffing levels to handle additional responsibilities  nearly 40 FTE Increasing staffing levels to handle additional responsibilities  nearly 40 FTE Almost 9 million SI2K in computing capacity Almost 9 million SI2K in computing capacity Over 3 PB of disk storage capacity Over 3 PB of disk storage capacity Over 7 PB of tape storage capacity Over 7 PB of tape storage capacity

Staff Growth at the RACF

The Growth of the Linux Farm

Total Distributed Storage Capacity

Evolution of Space Usage Capacity of current data center Intel dual and quad-core deployed

Evolution of Power Usage Existing UPS Capacity

Evolution of Power Costs Unexpected decrease in cost/kW-hr Estimates assume 8 cents/kW-hr

How did we get in trouble? Bought more than planned because of unexpected favorable prices Bought more than planned because of unexpected favorable prices Failed to prepare adequately for increase in power/cooling needs – items that require long delivery times Failed to prepare adequately for increase in power/cooling needs – items that require long delivery times Inefficient use of available infrastructure (4,500 square feet of total space and 1 MW of UPS-backed power) Inefficient use of available infrastructure (4,500 square feet of total space and 1 MW of UPS-backed power) Increasing cost (5  7-10 cents) per kW-hr since 2003 Increasing cost (5  7-10 cents) per kW-hr since 2003 Running out of space, power & cooling in current facility Running out of space, power & cooling in current facility

What are we doing about it? More efficient use of current data center resources More efficient use of current data center resources Emphasize power efficiency in new purchases Emphasize power efficiency in new purchases Renovating 2,000 sq. ft. and adding 300 KW of power for RACF – availability in October 2008 Renovating 2,000 sq. ft. and adding 300 KW of power for RACF – availability in October 2008 Building an additional 7,000 sq. ft. and 1.5 MW of power – availability in summer 2009 Building an additional 7,000 sq. ft. and 1.5 MW of power – availability in summer 2009

Improvements to Current Data Center Better layout to maximize floor space Better layout to maximize floor space Additional rack-top cooling units for “hot spots” Additional rack-top cooling units for “hot spots” Additional PDU/UPS (up to 240 kW) units to complement existing UPS Additional PDU/UPS (up to 240 kW) units to complement existing UPS Use of 3-phase power (208V/30A) to maximize usage of PDU`s Use of 3-phase power (208V/30A) to maximize usage of PDU`s

Rack-Top Cooling Units

Data Center Layout in 2007

Power Efficiency Deploy multi-core processors Deploy multi-core processors Investigate blade servers Investigate blade servers DC-powered servers DC-powered servers Virtualization Virtualization Mobile Data Centers Mobile Data Centers Other power saving techniques Other power saving techniques

Multi-core processors First purchase of AMD (Opteron 265) dual-core in 2006 First purchase of AMD (Opteron 265) dual-core in % power savings when compared to previous generation of single-core Intel Xeon processors (3.4 GHz) 20% power savings when compared to previous generation of single-core Intel Xeon processors (3.4 GHz) First purchase of Intel quad-core (Xeon E5335) in 2007 First purchase of Intel quad-core (Xeon E5335) in 2007 Improved SI2K/Watt should translate to more power savings Improved SI2K/Watt should translate to more power savings

SI2K per Watt Improvements AMD dual-core deployed Intel dual and quad-core deployed

Blade Servers Better SI2K/Watt than 1-U servers (51.1 vs. 40.9, according to IBM’s power calculator) Better SI2K/Watt than 1-U servers (51.1 vs. 40.9, according to IBM’s power calculator) Increased density and power requirements (up to 17.5 kW/rack) is a big problem Increased density and power requirements (up to 17.5 kW/rack) is a big problem Plan to test blades with real-life applications from ATLAS Plan to test blades with real-life applications from ATLAS Hardware capacity (disk, RAM, etc) is a drawback for blades when compared to 1-U servers Hardware capacity (disk, RAM, etc) is a drawback for blades when compared to 1-U servers

DC-powered servers DC-powered servers made by a few suppliers DC-powered servers made by a few suppliers Steep up-front costs for a DC power distribution system – suitable only for large installations or new buildings Steep up-front costs for a DC power distribution system – suitable only for large installations or new buildings Alternative is to plug rectifiers between AC source and server rack Alternative is to plug rectifiers between AC source and server rack A DC-powered server with rectifier from Rackable yielded only 5% savings  not very significant A DC-powered server with rectifier from Rackable yielded only 5% savings  not very significant

Virtualization Virtualization may help to maximize cluster usage and minimize need for more hardware Virtualization may help to maximize cluster usage and minimize need for more hardware Collapse multiple applications on fewer servers and cluster them for failover protection  power and space savings Collapse multiple applications on fewer servers and cluster them for failover protection  power and space savings Extensive evaluation of Xen and Vmware at BNL for the past year Extensive evaluation of Xen and Vmware at BNL for the past year Initial deployment beginning now Initial deployment beginning now Not a cure-all  not recommended for certain applications Not a cure-all  not recommended for certain applications

Mobile Data Centers Used by large financial institutions for high-demand, short- duration needs – Project Blackbox by Sun, other suppliers Used by large financial institutions for high-demand, short- duration needs – Project Blackbox by Sun, other suppliers Shipping container with 2000 cores – mobile and easy deployment Shipping container with 2000 cores – mobile and easy deployment Not seriously considered at BNL – issues with protection of sensitive data, integration with existing hardware, incompatible computing models Not seriously considered at BNL – issues with protection of sensitive data, integration with existing hardware, incompatible computing models Does not fully address our power/space problems Does not fully address our power/space problems

Other Power Savings Techniques New CPU chips have wake-on features (AMD’s PowerNow and Intel’s SpeedStep) – not used at BNL because most of our servers are utilized at the > 80% level New CPU chips have wake-on features (AMD’s PowerNow and Intel’s SpeedStep) – not used at BNL because most of our servers are utilized at the > 80% level Most suppliers provide low-efficiency (65%-75%) power supplies (PS) – high-efficiency PS (>85%) are available at a higher cost Most suppliers provide low-efficiency (65%-75%) power supplies (PS) – high-efficiency PS (>85%) are available at a higher cost Require metered rack PDU’s since 2006  measure and (later) collect historical power information to understand dynamic power load Require metered rack PDU’s since 2006  measure and (later) collect historical power information to understand dynamic power load

New Data Center Increased raised floor (12  36 inches) for higher air (cooling) flow Increased raised floor (12  36 inches) for higher air (cooling) flow Cable trays (above or below raised floor) for improved cable management and air (cooling) flow Cable trays (above or below raised floor) for improved cable management and air (cooling) flow Building properly designed (reinforced raised floor, large power and cooling capacities, 13-ft ceilings, proper ventilation and insulation, etc) Building properly designed (reinforced raised floor, large power and cooling capacities, 13-ft ceilings, proper ventilation and insulation, etc) Dedicated to meet RACF computing needs until 2014 Dedicated to meet RACF computing needs until 2014

Why We Need Better Cable Management Why We Need Better Cable Management

Data Center Layout in 2009

Beyond 2014 Long-range plan for a 25,000 square foot new data center to serve all of BNL Long-range plan for a 25,000 square foot new data center to serve all of BNL Part of BNL long-range plan, but not funded at present Part of BNL long-range plan, but not funded at present We expect RACF computing requirements to exceed existing data center capacity by 2014, including the new space available in 2009 We expect RACF computing requirements to exceed existing data center capacity by 2014, including the new space available in 2009

Beyond 2014 (cont.) Computing requirements for LEP and FNAL programs were underestimated by a factor of 10 early in their programs Computing requirements for LEP and FNAL programs were underestimated by a factor of 10 early in their programs Similar underestimates for RHIC program Similar underestimates for RHIC program It would be wise to have the data center infrastructure potential capacity to exceed “long-term” RHIC and ATLAS requirements, even if they are not available right away It would be wise to have the data center infrastructure potential capacity to exceed “long-term” RHIC and ATLAS requirements, even if they are not available right away

Summary The growth of the RACF in the past few years exposed severe infrastructure problems in the data center The growth of the RACF in the past few years exposed severe infrastructure problems in the data center Learned some valuable lessons on maximizing use of existing infrastructure (and some lessons on what NOT to do) Learned some valuable lessons on maximizing use of existing infrastructure (and some lessons on what NOT to do) Actively evaluating new technologies and approaches to operating a sustainable data center Actively evaluating new technologies and approaches to operating a sustainable data center Upgrading existing infrastructure to stretch facility until 2009 Upgrading existing infrastructure to stretch facility until 2009

Summary (cont.) Renovated space and additional power available in October 2008 – some breathing room Renovated space and additional power available in October 2008 – some breathing room New data center dedicated to RACF will be available in summer 2009 and is expected to meet our needs until 2014 New data center dedicated to RACF will be available in summer 2009 and is expected to meet our needs until 2014 New facility needed after 2014 New facility needed after 2014