Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop

Slides:

Advertisements

Similar presentations

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop.

SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.

Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

BESIII distributed computing and VMDIRAC

YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.

IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

NTU Cloud 2010/05/30. System Diagram Architecture Gluster File System – Provide a distributed shared file system for migration NFS – A Prototype Image.

A Silvio Pardi on behalf of the SuperB Collaboration a INFN-Napoli -Campus di M.S.Angelo Via Cinthia– 80126, Napoli, Italy CHEP12 – New York – USA – May.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

IHEP Site Status Qiulan Huang Computing Center, IHEP,CAS HEPIX FALL 2015.

A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.

OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Next Generation of Apache Hadoop MapReduce Owen

Building Virtual Scientific Computing Environment with Openstack Yaodong Cheng, CC-IHEP, CAS ISGC 2015.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

Farming Andrea Chierici CNAF Review Current situation.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.

Australia Site Report Lucien Boland Goncalo Borges Sean Crosby

CEPC software & computing study group report

WLCG IPv6 deployment strategy

Status of BESIII Distributed Computing

Experience of Lustre at QMUL

Status of BESIII Computing

The advances in IHEP Cloud facility

The Beijing Tier 2: status and plans

Volunteer Computing for Science Gateways

AWS Integration in Distributed Computing

Elastic Computing Resource Management Based on HTCondor

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Belle II Physics Analysis Center at TIFR

Research on an universal Openstack upgrade solution

Integration of Openstack Cloud Resources in BES III Computing Cluster

Working With Azure Batch AI

Dag Toppe Larsen UiB/CERN CERN,

Report of Dubna discussion

ATLAS Cloud Operations

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Provisioning 160,000 cores with HEPCloud at SC17

How to enable computing

EPAM Cloud Orchestration

Experience of Lustre at a Tier-2 site

Luca dell’Agnello INFN-CNAF

ATLAS Sites Jamboree, CERN January, 2017

BEIJING-LCG2 Site Report

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Advanced Computing Facility Introduction

11th France China Particle Physics Laboratory workshop (FCPPL2018)

Status and prospects of computing at IHEP

This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.

RHUL Site Report Govind Songara, Antonio Perez,

Presentation transcript:

Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop Status of IHEP Site Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop

Outline Local computing cluster Grid Tier2 site IHEP Cloud Internet and domestic Network Next Plan

Local computing cluster Support BESIII, Da-ya Bay, JUNO, astrophysics experiments …… Computing ~13,500 CPU cores, 300 GPU cards Mainly managed by Torque/Maui 1/6 has been migrated to HTCondor HTCondor will replace Torque/Maui this year Storage 5PB LTO4 tapes managed by CASTOR 1 5.7 PB of Lustre. Another 2.5 PB will be added this year 734 TB of gLuster with replica feature 400TB of EOS 1.2 PB of other disk spaces

Local computing cluster (2) Automatic deployment: Puppet + Foreman Monitor: Icinga (Nagios) Ganglia Flume+ElasticSearch Annual resource utilization rate: ~48%

HTCondor Cluster (1/2) 153 blade servers with 2500 CPU cores New share policy to support 3 experiments: BES, JUNO, CMS CPU cores are contributed separately by different experiments Each experiment has its dedicated CPU cores and contributes some of CPU cores to one shared pool Dedicated CPU cores run jobs belonging to owner’s experiment Shared CPU cores run jobs from all experiments to fit peak computing requirement Dedicated CPU: 888 Shared CPU: 178 Resource utility has been promoted a lot: ~80% Scale to 6000 CPU cores in October Maximum used CPU: 1192

HTCondor Cluster (2/2) Central Database Store device information the only place where scheduler, monitoring , puppet get their configuration info Work nodes info includes supported experiments, user priority, current status etc. Worker node info published via the Apache server Work nodes info is published in real time (<5m) HTCondor machines update their configuration via crond Cluster will still work properly even if both database and Apache crash Failure nodes detected by the monitor system would be excluded automatically

Hadoop data analysis platform Built for more than one year Old system: 84 CPU Cores, 35TB disk New system: 120 CPU Cores, 150TB disk Support BESIII, cosmic ray experiments Test results Cosmic ray experiment application: medea++ 3.10.02 CPU utilization on Gluster much lower than HDFS HDFS job running time is about one-third of Gluster/Lustre FS Lustre/Gluster HDFS HDFS Gluster

EOS deployment Two instances One for batch computing (LHAASO experiment) 3 servers with 86 X 4T SATA disks 3 dell disk array box (raid6) Each server has 10Gb network link One for user and public services IHEPBox based on Owncloud: user file synchronization Web server and other public services 4 servers Each server with 12 disks and 1Gb link Replication mode Plan to support more applications

Grid Tier 2 Site (1) CMS and Atlas Tier2 Site CPU: 888 Cores Intel E2680V3: 696 Cores Intel X5650 192 Cores Batch: Torque/Maui Storage: 940 TB DPM: 400TB 4TB X 24slots With Raid 6 5 Array boxes dCache: 540TB 4TB X 24slots With Raid 6. 8 Array box. 3TB X 24slots With Raid 6. 1 Array box Network 5 Gbps link to Europe visa Orient-plus and 10 Gbps to US

Grid Tier 2 Site (2) Site Statistics last half year 29,445,996 CPU Time (HEPSPEC06 in hours)

BESIII Distributed Computing During August summer maintenance, DIRAC server has been successfully upgraded from v6r13 to v6r15 Prepare to support multi-core jobs in the near future VMDirac has been upgraded to 2.0, which greatly simplifies the procedure to adopt new cloud sites New monitoring system has been put into production, which gives a clear view of real-time site status Total resources: 14 sites, ~3000 CPU cores, ~500TB storage Support CEPC and other experiments # Site Name CPU Cores Storage 1 CLOUD.IHEP.cn 210 214 TB 9 GRID.JINR.ru 100 ~ 200 30 TB 2 CLUSTER.UCAS.cn 152 10 GRID.INFN-Torino.it 200 60 TB 3 CLUSTER.USTC.cn 200 ~ 600 24 TB 11 CLOUD.TORINO.it 128 4 CLUSTER.PKU.cn 100 12 CLUSTER.SDU.cn 5 CLUSTER.WHU.cn 120 ~ 300 39 TB 13 CLUSTER.BUAA.cn 6 CLUSTER.UMN.us 768 50 TB 14 GRID.INFN-ReCas.it 50 7 CLUSTER.SJTU.cn 15 CLOUD.CNIC.cn 8 CLUSTER.IPAS.tw 300 16 CLUSTER.NEU.tr

IHEP Cloud IHEP private cloud Based on Openstack Resources Provide cloud services for IHEP users and computing demand Based on Openstack Launched in May 2014 Upgraded from Icehouse to Kilo in 2016 Single sign on with IHEP UMT (oauth2.0) Resources User self-service Virtual Machines service 21 computing nodes, 336 CPU cores Dynamic Virtual computing cluster 28 computing nodes, 672 CPU cores Support BESIII, JUNO, LHAASO, … Will add 768 CPU cores this year More detail in Cui Tao’s talk

Resource allocation policies VCondor Submit Job Condor Pool Resource allocation policies Check Load Join pool Fetch Jobs Request Resource allocation VQuota VCondor Start/delete VMs VM VM Openstack, OpenNebula, … Github: https://github.com/hep-gnu/VCondor

Internet and Domestic Network IHEP-EUR: 10Gbps IHEP-USA: 10Gbps IHEP-Asia: 2.5Gbps IHEP-Univ: 10Gbps SDN@IHEP: Zeng Shan’s talk

High Performance Cluster A new heterogeneous hardware platform : CPU, Intel Xeon Phi, GPU Parallel programming supports : MPI, OpenMP, CUDA, OpenCL … Potential usage cases : simulation, partial wave analysis … SLURM as the scheduler Test bed is ready: version 16.05 Virtual machines: 1 control node, 4 computing nodes Physical servers: 1 control node, 26 computing nodes(2 GPU servers included) Undergoing scheduler evaluation Two scheduler algorithms evaluated: sched/backfill, sched/builtin. Undergoing integration with DIRAC Network architecture & technologies InfiniBand network for HPC test bed was already built

High Performance Cluster architecture Remote Sites Users Job 700TB (EOS/NFS) Login Farm Job Job Job Ethernet 10/40 Gb/s Job Job 112 Gb/s SX6036 80 Gb/s Brocade 8770 Mellanox SX6012 InfiniBand FDR 56Gb/s Job Local Cluster 150 GPU Cards 1000 CPU Cores 50 Xeon Phi Cards

Next step ~2.5PB (available storage) will be added Migration from PBS to HTCondor will be completed by the end of this year IHEP will provide routine maintenance service to more remote sites Modify CASTOR 1 to integrate new LTO6 tape HPC cluster will be provided next year

Thank you! chyd@ihep.ac.cn