11th France China Particle Physics Laboratory workshop (FCPPL2018)

Slides:



Advertisements
Similar presentations
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Advertisements

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Division Report Computing Center CHEN Gang Computing Center Oct. 24, 2013 October 24 ,
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
Point-to-point Architecture topics for discussion Remote I/O as a data access scenario Remote I/O is a scenario that, for the first time, puts the WAN.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.
Grid Computing 4 th FCPPL Workshop Gang Chen & Eric Lançon.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.
CEPC software & computing study group report
Bob Jones EGEE Technical Director
Accessing the VI-SEEM infrastructure
Dynamic Extension of the INFN Tier-1 on external resources
Status of WLCG FCPPL project
Status of BESIII Distributed Computing
Status of BESIII Computing
The advances in IHEP Cloud facility
Distributed Computing in IHEP
The Beijing Tier 2: status and plans
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Volunteer Computing for Science Gateways
AWS Integration in Distributed Computing
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Data Analytics and CERN IT Hadoop Service
StoRM: a SRM solution for disk based storage systems
Overview of the Belle II computing
Belle II Physics Analysis Center at TIFR
Integration of Openstack Cloud Resources in BES III Computing Cluster
Diskpool and cloud storage benchmarks used in IT-DSS
Report of Dubna discussion
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
Data Challenge with the Grid in ATLAS
Accounting at the T1/T2 Sites of the Italian Grid
Long-term Grid Sustainability
Introduction to Data Management in EGI
Daniel Murphy-Olson Ryan Aydelott1
Readiness of ATLAS Computing - A personal view
Grid Computing.
UK Status and Plans Scientific Computing Forum 27th Oct 2017
ATLAS Sites Jamboree, CERN January, 2017
Computing at CEPC Xiaomei Zhang Xianghu Zhao
Network Requirements Javier Orellana
Ákos Frohner EGEE'08 September 2008
Introduction to HEPiX Helge Meinhard, CERN-IT
Computing Infrastructure for DAQ, DM and SC
Research Data Archive - technology
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Management of Virtual Execution Environments 3 June 2008
Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017
Patrick Dreher Research Scientist & Associate Director
Grid Computing 6th FCPPL Workshop
Wide Area Workload Management Work Package DATAGRID project
Status and prospects of computing at IHEP
Hao Hu, Luo Qi, Fazhi Qi IHEP 22 Mar. 2018
Production Manager Tools (New Architecture)
Presentation transcript:

11th France China Particle Physics Laboratory workshop (FCPPL2018) TECHNOLOGIES FOR DATA PROCESSING PLATFORMS FOR HEP EXPERIMENTS PROGRESS REPORT Fazhi QI On Behalf of IHEP CC: CHEN Gang, LI Weidong, QI Fazhi, WANG Lu, ZHANG Xiaomei, CHEN Yaodong, YAN Tian, SHI Jingyan, DU Ran, ZENG Shan, LI Haibo CC-IN2P3:Fabio HERNANDEZ, Ghita RAHAL, Vanessa HAMAR,Fabien WERNLI, Mattieu PUEL CPPM: Andreï TSAREGORODTSEV 22-25 May 2018, Marseille

Background Guiding principle Partners Funding to explore technologies of potential interest for the data processing needs of HEP experiments Partners CPPM, IN2P3 computing center, IHEP computing center Funding IHEP and IN2P3 through FCPPL 2017 call CNRS-NSFC joint program for international collaboration

Topics Topics of Interest Exploring alternative ways for managing filesystem metadata Automatic data migration based on machine learning Experimentation with building blocks for inter-site bulk data transfer DIRAC-based Computing platform for IHEP experiments High-Performance computing platforms Networking

Exploring alternative ways for managing filesystem metadata

Exploring alternative ways for managing filesystem metadata Current existing storage systems issues Metadata and file operations are tightly coupled, difficult to scale for a closed system Local data and remote data are managed separately Traditional RAID technology requires significant time for data recovery in case of host failure EOS is an open disk storage system developed by CERN All LHC and most non-LHC experiments data stored in EOS at CERN, more than 250 PB EOS characteristics multi-protocol access secure access – strong authentication multi-user management

Exploring alternative ways for managing filesystem metadata(EOS in IHEP) EOS for batch computing 2 instances: LHAASO, HXMT FUSE-based client access ~1.7 PB capacity ~60 million files ~1.3 million directories +1 PB capacity in 2018 Q3 EOS for IHEPBox A cloud disk system made up of OwnCloud and EOS ~160 TB capacity ~8.5 million files ~1 million directories Computing (AFS account) Web service (web account) LHAASO EOS HMXT EOS Public EOS X Two separate clusters, based on different account system

Automatic Data Migration based on Machine Learning

Automatic Data Migration based on Machine Learning Currently both Lustre and EOS filesystems support hierarchical storage management. Data movement between the storage pools is triggered by manually-defined rules or according to a predefined list of files to transfer. Input historical access sequence of a certain file Remote Sites TAPE SAS/SAS SSD Machine Learning Model trained on historical records Output Next access location and mode of the file

Automatic Data Migration based on Machine Learning (cont.) Historical access pattern EOS log, Lustre changelog, Instrumenting scripts organized in counter vector sequence of file operation Machine Learning model Training samples come from billions of historical records Based on DNN model Currently collected 50 million EOS access records For the task of prediction of future access location (login nodes/computing nodes), the precision is 98%

Automatic Data Migration based on Machine Learning (cont.) Output of machine learning model Access Mode: read-only or read-write mixed Access Frequency: Hot or Cold Location: Local or remote Work in Progress ! Table 1: File Migration triggered by the predictions of ML model If the ML model can predict future access mode(Read only or Read Write mixed), access frequency(hot/cold), and access location(local or remote), data migration can be triggered automatically .   Frequency Future access mode Migration Action Migration Direction Migration Mode Big File Hot → Cold Read-Only Disk → Tape Down Delete replica on disk Mixed Keep in Disk Keep replica on disk Small File SSD → Disk Delete replica on SSD Keep replica on SSD Cold → Hot Disk → SSD UP Has Remote access local → remote out Delete local replica Read Out Keep local replica

Experimentation with building blocks for inter-site bulk data transfer

Experimentation with building blocks for inter-site bulk data transfer Goal: is HTTP suitable for bulk data transfer over high latency network links? Why HTTP? Standard, programmability of both client and servers in any relevant programming language, future-proof, ubiquitous, customisable semantics, ... Confidentiality and data integrity ensured using standard TLS Experience in LSST CC-IN2P3 will be exchanging significant amounts of data with the archive center of the instrument located at NCSA near Chicago, USA, daily during the foreseen 10 years of operations starting in 2022. Asynchronous reading of data stored in Swift for writing to GPFS showed a sustained throughput of 600 MB/sec.

Experimentation with building blocks for inter-site bulk data transfer (cont.) Results so far it is indeed possible to use a modern implementation of the HTTP protocol for bulk data transfer over high latency links with recent hardware, it is possible to use transport layer encryption (such as TLS) to guarantee data integrity and confidentiality without significant performance penalty from the operational point of view, it is interesting to decouple the storage system used for import/export of data (OpenStack Swift in our case) from the one used for feeding data to the applications executing locally in the compute nodes (GPFS in our case). Further details: PRACE white paper

DIRAC-based computing platform for IHEP experiments

DIRAC-based computing platform for IHEP experiments DIRAC instance at IHEP in production since several years The BESIII distributed computing system was set up and put into production in 2014 Meet peek needs of BESIII Extended to support multi-VO in 2015, e.g. JUNO, CEPC, etc. In 2017, jobs reach ~ 1M Several kind of resources, including cluster, grid, cloud 15 sites from USA, Italy, Russia, Turkey, Taiwan, China Universities(8)

DIRAC-based Computing platform for IHEP experiments (cont.) Multi-process/thread applications became a trend in IHEP experiments Work closely with CPPM on support of multi-core jobs in DIRAC Two ways of multi-core job scheduling has been successfully tried and implemented in the prototype in IHEP Independent pilots for single-core and multi-core jobs (A) Shared partitionable pilots for all the jobs (B) CPU utilization rate has been studied and measured jobs pilot (A) (B)

High-Performance computing platforms

High-Performance computing platforms Both IHEP and CC-IN2P3 deployed pilot high-performance computing platforms IHEP (2018): 2800 CPU cores(125 work nodes), 600TFlops(80 NVIDIA V100 GPUs in 10 hosts, Double Precision), InfiniBand EDR interconnection, Integrated to “HEP job tool” for job submission and query CC-IN2P3: 160 CPU cores, 40 NVIDIA TESLA K80 GPUs in 10 hosts, InfiniBand interconnection programming environment: CUDA, OpenCL, OpenMP, MPI Early adopters getting familiar with those platforms The platforms are integrated to the workload management systems of both sites Sites understanding how to operate them and attracting users to these new facilities Slurm + Docker Experience to be shared among sites operators, platforms are integrated to the workload management systems of both sites

High-Performance computing platforms

Networking

SDN @ Security Thoughts Status Minimize the impact on the existing network for network and information security using new architecture Service Chain features and bypass security devices(firewall) for Science Data Traffic Status Capturing the network traffic with SDN features Developing the security applications to analysis the network flows

LHCONE in China

LHCONE in China (II) Dedicated physical links for LHCONE from the existing CNGI links Two dedicated links (10Gbps IPv4 + 10Gbps IPv6) have been set up for LHCONE between IHEP and CSTNet The VRF peers to TEIN have been created by CERNET2 Deployed VRF+IBGP in IHEP IBGP: Deploy Internal route forwarding policies in IHEP Currently, the IPs of CMS and ATLAS servers in IHEP have been announced in LHCONE Status Scientific traffic from IHEP to Europe (CMS and ATLAS experiments) are all going on LHCONE Plan Will setup VRF peer to Internet2, ASGC in Hongkong (HKIX) Will encourage the other universities in China to join in LHCONE

People

People LI Jianhui, director of the Big Data division of the CAS’ Computer Network Information Center (CNIC) and CHEN Gang, QI Fazhi and WANG Lu visited CC-IN2P3 in March 2017. Dr LI made a presentation about the initiatives at CAS related to big data research infrastructures. Andreï TSAREGORODTSEV and Fabio HERNANDEZ attended the 10th FCPPL workshop held in Beijing in March 2017. ZHANG Xiaomei, who is responsible for IHEP’s DIRAC-based grid platform, attended the 7th DIRAC Users Workshop in Poland in May 2017 and presented a report about the usage of DIRAC for IHEP experiments A delegation of 7 representatives of CAS visited CC-IN2P3 in September 2017, in the framework of a visit to several European research e-infrastructures. A one-day workshop was organized at CC-IN2P3 for discussing topics relevant to this subject, with a participation of more than 20 experts. 8th DIRAC users’ workshop being held at CC-IN2P3 this week

Perspectives

Perspectives Project submitted to FCPPL 2018 call Topics DIRAC: improved integration of cloud and high performance computing resources Machine Learning: understanding if and how machine learning technologies could help us extract information from the data we are collecting at the computer centers Security: Using SDN technologies to collect and analysis the network traffic and to understand the network behaviors