IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop.

Slides:

Advertisements

Similar presentations

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Advertisements

Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

Duke Atlas Tier 3 Site Doug Benjamin (Duke University)

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

1 Objectives Discuss the Windows Printer Model and how it is implemented in Windows Server 2008 Install the Print Services components of Windows Server.

Enterprise Storage Our Journey Thus Far John D. Halamka MD CIO, Harvard Medical School and Beth Israel Deaconess Medical Center.

31/10/2000NT Domain - AD Migration - JLab 2000 NT DOMAIN - ACTIVE DIRECTORY MIGRATION Michel Jouvin LAL Orsay

Tier 3g Infrastructure Doug Benjamin Duke University.

1 Network Statistic and Monitoring System Wayne State University Division of Computing and Information Technology Information Technology.

Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.

XA R7.8 Upgrade Process and Technical Overview Ruth Anne Pharr Sr. IT Consultant, CISTECH Inc.

Chapter 19 Upgrading and Expanding Your PC. Getting Started FAQs: – Can I upgrade the processor in my PC? – Will adding RAM improve my PC’s performance?

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

Ceph Storage in OpenStack Part 2 openstack-ch,

Module 7: Fundamentals of Administering Windows Server 2008.

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.

Workshop KEK - CC-IN2P3 KEK new Grid system 27 – 29 Oct. CC-IN2P3, Lyon, France Day2 14: :55 (40min) Koichi Murakami, KEK/CRC.

RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.

SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.

IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.

Ethernet. Ethernet  Ethernet is the standard communications protocol embedded in software and hardware devices, intended for building a local area network.

ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

Module 9 Planning and Implementing Monitoring and Maintenance.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction TO Network Administration

Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

BEIJING-LCG Network Yan Xiaofei

R. Krempaska, October, 2013 Wir schaffen Wissen – heute für morgen Controls Security at PSI Current Status R. Krempaska, A. Bertrand, C. Higgs, R. Kapeller,

IHEP Site Status Qiulan Huang Computing Center, IHEP,CAS HEPIX FALL 2015.

T3g software services Outline of the T3g Components R. Yoshida (ANL)

IHEP Computing Site Report Shi, Jingyan Computing Center, IHEP.

OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.

QI Fazhi ／ IHEP CC HEPiX IPv6 F2F Meeting IPv6 Network Status in IHEP/China QI Fazhi Computing Center, IHEP July 4, 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

What is Cloud Computing 1. Cloud computing is a service that helps you to perform the tasks over the Internet. The users can access resources as they.

Scientific Linux Inventory Project (SLIP) Troy Dawson Connie Sieh.

IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.

Dynamic Extension of the INFN Tier-1 on external resources

Status of BESIII Distributed Computing

Status of BESIII Computing

The advances in IHEP Cloud facility

Belle II Physics Analysis Center at TIFR

Research on an universal Openstack upgrade solution

Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop

Building a Virtual Infrastructure

Cloud based Open Source Backup/Restore Tool

US CMS Testbed.

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Zero Clients and Virtual Desktops in Academic Environments

Networks Software.

Design Unit 26 Design a small or home office network

Getting Started.

Getting Started.

BusinessObjects IN Cloud ……InfoSol’s story

Presentation transcript:

IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop

2 Outline Infrastructure update Site status Errors we’ve suffered Next plan

3 Infrastructure Update -- Since Mar Local cluster – Cpu Cores New added: 1416 The amount cpu cores of local cluster: – Storage New added: 810TB – New device: DELL MD3860F - DDP disk arrays » With the expectation to decrease disk rebuild time The amount of storage: 4PB – Core switch Old Switch “Force 10” was replaced by the one borrowed from vender temporarily

4 Infrastructure Update -- Since Mar (cont.) EGI site – All grid services migrated to VM on new machines – All disks replaced by 4TB*24 array. – All storage servers replaced by new machines – Total disk capacity increased to 940TB – All old work nodes (1088 cpu cores) will be replaced by the new ones

5 Outline Infrastructure update Site Status Errors we suffered Next Plan

6 Scheduler -- HTCordor Small cluster created and managed by HTCondor last month – 400 cpu/cores – One experiment (JUNO) supported – Running well

7 EGI site BEIJING-LCG2 Cumulative CPU Time

8 EGI site (cont.)

9 Reliability and Availability

10 IHEP Cloud established Released on 18 th. November, 2014 Built on openstack Icehouse 8 physical machines: 224 vm capacity – 1 control node, 7 computing nodes – User applies and gets the VM on line Three types of VM provided – Provide user VM that same as login node DNS and IP management, and AFS account, puppet, NMS, ganglia … – Provide user VM with root right and no public IP – Provide administrator VM with root right and public IP Current Status – Active 172 VM, 628GB memory and 4.7TB disk – More resources will be added

11 Outline Infrastructure update Site Status Errors we suffered Next Plan

12 Errors we’ve suffered – core switch Service/Device Online: 2006 ~ , So Long Time… Capability problems – Port density – Backplane switching capability Despite of the problems, it had been running stable Suffered Reliability problems in 16*10G cards by the end of last year – No Switching – Loss Packets – No backup cards from the vendor

13 Errors we’ve suffered -- core switch (cont.) Why ? Force10 E1200 should be ended Unstable Network  Unstable Computing Choices Too many choices for vendors Least expensive technically acceptable Production Environment Evolution Production & Evolution vendors and devices Ruijie Networks: Core Switch / RJ18010K,TOR/RJ6200 Huawei Networks: Core Switch / HUAWEI12810 Brocade DELL-Force10 Ruijie won the bid

14 Errors we’ve suffered -- AFS AFS deployment …… afsdb1afsdb2afsdb3 afsfs03 afsfs04Login nodes(16) Conputing nodes(~10000) …… Desktop PC Central Switch(Force 10) 1G switch 10G switch 1. Master database: afsdb1 2. Slave database:afsdb2,afsdb3 3. Fileserver:afsdb2,afsdb3,afsfs03,afsfs04 4. Total size: 7.8TB 1. AFS client installed in all login nodes 2. Tokens when login using PAM 3. UID and GID stored in /etc/passwd file, no password in /etc/shadow 1. AFS cache set 10GB 2. Jobs scheduled to computing nodes access software lib in AFS

15 Errors we’ve suffered – AFS (cont.) AFS status – Used as home directory and software barn All HEP libraries in AFS have copies to make sure data availability Errors we suffered – AFS file service crashed down irregularly (Due to its old version?) Inconsistent replica leaded jobs failure when job read the wrong replica Failed to release replica volume when the fileserver service crashed Solutions – Add monitoring to put AFS file service under surveillance – Plan to upgrade Openafs during summer maintenance Upgrade testing is being ongoing

16 Errors we’ve suffered - network card Most of work nodes had been upgraded from SL5.5 to SL6.5 last year – The drivers of some network cards didn’t work properly Jobs failed due to un-stability of those network cards – Errors disappeared after driver upgrades

17 Outline Infrastructure update Site status Errors we’ve suffered Next plan

18 Next plan -- HTCondor Monitoring and optimization – Integrated with our job monitoring and accounting tool – Optimization need to be done Scale would be expanded

19 Next plan -- Storage Hardware – 5 years old disk array (740TB) will be replaced by a new set of servers connected with DELL DDP disk arrays

20 Next plan -- Storage (cont.) Software – 3 years old disk array(780TB) will be reconfigured to a dual-replication file system powered by gLusterfs Test has been done More stable – To avoid predictable incompatibility with newly purchased hardware, all the file system severs and clients will be upgraded to Lustre 2.x over SL 6 this year. Currently Lustre 1.8 over SL5, fall behind the Linux trend

21 Next Plan -- Monitoring Flume + Kibanna deployed to collect logs generated by servers and work nodes – Logs collected from devices on line – Gave a lot of help when NIC error happened – Need to be optimized Nagios has been used as the main monitoring tool – One Nagios server is not enough to show the errors and their recoveries in time – New monitor plan is under going

22 Next plan -- puppet OS and software running on most of devices are installed and managed by puppet Performance problem – Long waiting time to upgrade software of servers Optimization has been done – Less than 40 min to upgrade machines – Need more optimization

23 Next plan -- network Double Core switches & Firewalls – High Capability, Stability, Easy Management… Better Backbone Bandwidth – 160Gbps(4X40Gbps) for Layer2 Switches – 2X10Gbps for Storage Nodes Clear Zones – Internal: Computing/Storage/AFS/DNS/Monitoring – DMZ: Public Servers/Login Nodes/… – Internet: Network Performance Measurement nodes Two networks – Data Network High Stability & Capability – Management Network High Stability

24 Next Plan -- network (cont.)

25 Next plan -- IHEP Cloud Integrated with local cluster Jobs submitted to dedicated queue of local cluster would be migrated and run at IHEP Cloud automatically – Keep the way same as that of pbs to run job – No any extra modification requested to user – VM could support short peak requirement from the experiment Resource of VM would be expanded or decreased depending on the amount of local cluster jobs Under development and will be released next month

26 Thank you! Question?