Presentation is loading. Please wait.

Presentation is loading. Please wait.

The SKA Science Data Processor and Regional Centres

Similar presentations


Presentation on theme: "The SKA Science Data Processor and Regional Centres"— Presentation transcript:

1 The SKA Science Data Processor and Regional Centres
Paul Alexander Leader the Science Data Processor Consortium

2 SKA: A Leading Big Data Challenge for 2020 decade
Digital Signal Processing (DSP) To Process in HPC 2020: 50 PBytes/day 2030: 10,000 PBytes/day Over 10’s to 1000’s kms Antennas Transfer antennas to DSP 2020: 5,000 PBytes/day 2030: 100,000 PBytes/day Over 10’s to 1000’s kms HPC Processing 2023: PFlop 2033+: EFlop High Performance Computing Facility (HPC)

3 SKY Image s B . s B 1 2 Standard interferometer Visibility:
Astronomical signal (EM wave) B . s Visibility: V(B) = E1 E2* = I(s) exp( i w B.s/c ) Resolution determined by maximum baseline qmax ~ l / Bmax Field of View (FoV) determined by the size of each dish qdish ~ l / D Detect & amplify B 1 2 Digitise & delay Correlate X X X X X X Integrate SKY Image Process Calibrate, grid, FFT

4 Image formation computationally intensive
Images formed by an inverse algorithm similar to that used in MRI Typical image sizes up to: 30k x 30k x 64k voxels Obvious tie up with ALMA, NGST, next generation of ground based optical telescopes, SKA has resolution advantage to look in more detail at objects detected in other wavebands. Will see things that are not visible in the optical – see next slide – this is the story about distant dust-enshrouded galaxies as seen in the sub-mm regime.

5 Work Flows: Imaging Processing Model
Correlator RFI excision and phase rotation

6 One SDP Two Telescopes Ingest (GB/s) SKA1_Low 500 SKA1_Mid 1000 In total need to deploy eventually a system which is close to 0.25 EFlop of processing

7 Scope of the SDP

8 The SDP System Multiple instances of the process data component, some real time some batch 1…N

9 Illustrative Computing Requirements
~250 PetaFLOP system ~200 PetaByte/s aggregate BW to fast working memory ~80 PetaByte Storage ~1 TeraByte/s sustained write to storage ~10 TeraByte/s sustained read from storage ~ FLOPS/byte read from storage ~2 Bytes/Flop memory bandwidth One double precision FLOP needs ~12 pJ in 28um CMOS, probably 6 pJ in new tech, same as moving one bit of data through HBM

10 SDP High-Level Architecture

11 What does SDP do? SDP is coupled to rest of the telescope
Try to make the coupling as loose as possible, but some time critical aspects For each observation: Controlled by a scheduling block Run a Real time (RT) process to ingest data and feed back information Schedule a batch processing for later Must manage resources to SDP keeps up on timesacle of approximately a week

12 Real-time activity

13 Batch activity

14 SDP Architecture and Architectural Drivers
How do we achieve this functionality Architecture to deliver at scale Some key considerations SDP is required to evolve over time to support changing work flows required to deliver the science Different Science experiments have different computational costs and resource demands Use for load balancing Expect evolution of system to more demanding experiments during the operational lifetime

15 Data Driven Architecture

16 Data Driven Architecture
Smaller FFT size at cost of data duplication

17 Data Driven Architecture
Further data parallelism in spatial indexing (UVW-space) Use to balance memory bandwidth per node Some overlap regions on target grids needed

18 Data Driven Architecture

19 How do we get performance and manage data volume?
Approach: Build on BigData Concepts "data driven”  graph-based processing approach receiving a lot of attention Inspired by Hadoop but for our complex data flow Graph-based approach Hadoop

20 Execution Engine: hierarchical
Relatively low-data rate and cadence of messaging Processing Level 1 Processing level 2 Cluster and data distribution controller Static data distribution exploiting inherent parallelism Staging: aggregation of data products

21 Execution Engine: hierarchical
Cluster manager e.g. Mesos-like Process controller (duplicated for resilience) Worker nodes Shared file store across data island Task-level parallelism to achieve scaling Processing Level 2 Data Island

22 Data flow for an observation

23 Execution Framework

24 Execution Framework

25 Execution Framework

26 Execution Framework

27 Execution Framework

28 What do we need Processing Level 1: Processing Level 2:
Custom framework to provide scale out Processing Level 2: Many similarities to Big-Data frameworks Need to support multiple execution frameworks Streaming processing – possibly Apache STORM Work flows with relatively large task size or where data-driven load balancing is straight forward – possibly development of prototype code ”Big Data” workflows with load balancing, resilience – possibly something like Apache SPARK Framework manages tasks Tasks need to be pure functions Most tasks domain specific, but need supporting infrastructure Processing libraries Interfaces to the execution framework Performance tuning on target hardware

29 Skills and work – e.g. for SPARK
What do we need Skills and work – e.g. for SPARK Need to develop for High Throughput High Throughput data analytics framework (HTDAF) New data models Links to external processes Memory management between framework and processes Likely to have significant applicability Shared file system needs to be very intelligent about data placement / management or by tightly coupled with the HTDAF Integration of in-memory file system / object store systems with the cluster file system / object store Software development in data analytics, distributed processing systems, JVM, …

30 Database services Tasks running within processing framework need metadata and to communicate updates / results High performance distributed database Prototyping using REDIS Needs to interface to processing requirements Schemas

31 File / object store Multiple file / object stores, one per “data island” Integration into execution framework Resource allocation integrated with cluster management system Complex resource management Migration / interface to long-term persistent store likely to be separate Hierarchical Storage Manager

32 Data lifecycle management
Manage data lifecycle across the cluster and through time Strong link to overall processing and observing schedule Performance is key Determine initial data placements Manage data movement and migration Local high performance obtained by light-weight layer (e.g. policies) on a file system

33 Hardware Platform

34 Complex Network

35 Performance Prototype Platform
Significant £400k STFC/UK investment in hardware prototype Currently being installed in Cambridge Example tests: automated deployment from containers (github) infrastructure management and orchestration simulate streamed visibility data from correlator to N ingest nodes basic calibration on ingest calibration and imaging with suitable modelling across 2 compute islands data products written to preservation (staging) basic quality assurance metrics visible distributed database performance some automated built in testing at various levels LMC interface prototyping (platform management and telemetry) Execution Framework - Big Data prototyping (Apache Spark, etc.) scheduling

36 Control layer Complex functionality Resilience required here
Interface to cluster manager for resource management Prototyping around adding functionality to OpenStack Built around various open-source products integrated into broader framework

37 Control and system components

38 Delivering Data Delivery System User Interfaces
Delivery System Service Manager Delivery System Interfaces Manager Tiered Data Delivery Manager Science data catalogue and Prepare Science Product This component manages the catalogue of data products and how they are assembled for distribution including data model conversion if needed

39 Domain specific components
Not discussed in detail here Some performance tuning Interfacing I/O system?

40 Other system Software Compute System Software
Compute Operating System software The O/S will be based on a widely-adopted community version of Linux such as Scientific Linux (Fermi Lab, et al) or CentOS (CERN, et al) Middleware Messaging Layer based on TANGO’s messaging layer and/or open source messaging layers (like ZeroMQ) is possible with effort Interface to TANGO TANGO is the selected product for overall monitor and control – interfacing of other components to TANGO will be required for e.g. monitor information from the cluster manager Middleware stack Based on widely adopted versions of community-supported software that will mature up to deployment, like OpenStack. Effort will be required during the construction period to develop enhancements and SDP-specific customisation. Effort will also be required to maintain enhancements, relevant to the hardware and software requirements. Close collaboration and convergence with large contributors (like CERN) in the open source community could lead to reduced long-term maintenance effort.

41 Other activities Expect SDP to be delivered by a consortium with strong link / involvement from SKAO and academic partners Consortium building and leadership Management Software Engineering and System Engineering Integration and acceptance Some on-going support activities after construction

42 Further Processing and Science Extraction at Regional Centres
Archive Estimates SKA1 Archive Growth Rate: ~ 45 – 120 Pbytes/yr Larger including discovery space: ~300 Pbytes/yr Data Products Standard products calibrated multi-dimensional sky images; time-series data catalogued data for discrete sources (global sky model) and pulsar candidates No derived products from science teams Further Processing and Science Extraction at Regional Centres

43 Regional Centres Global Headquarters
Data distribution and access for astronomers Provide regional data centres and High Performance Computing services Software development Support for key science teams Technology development and engineering support Significant compute in their own right In UK likely to be supported by layering on top of a coordinated compute infrastructure – working title UK-T0 Global Headquarters South African Site Australian Site Regional Centre 1 Regional Centre 2 Regional Centre 3 Data Analytics support Science analysis Source extraction Bayesian analysis Joining of catalogues Visualisation

44 END


Download ppt "The SKA Science Data Processor and Regional Centres"

Similar presentations


Ads by Google