Download presentation
Presentation is loading. Please wait.
Published byDerrick Moore Modified over 9 years ago
1
Big Data Imperial June 2013 Dr Paul Calleja Director HPCS The SKA The worlds largest big-data project
2
Big Data Imperial June 2013 HPCS activities & focus Dell HPC Solution Centre Academic / Industrial HPC Cloud Cambridge HPC Service
3
Big Data Imperial June 2013 Next generation radio telescope 100 x more sensitive 1000000 X faster 5 square km of dish over 3000 km The next big science project Currently the worlds most ambitious IT Project Cambridge lead the computational design HPC compute design HPC storage design HPC operations Square Kilometre Array - SKA
4
Big Data Imperial June 2013 SKA location Needs a radio-quiet site Very low population density Large amount of space Two sites: Western Australia Karoo Desert RSA A Continental sized Radio Telescope
5
Big Data Imperial June 2013 What is radio astronomy XXXXXX SKY Image Detect & amplify Digitise & delay Correlate Process Calibrate, grid, FFT Integrate s B 12 Astronomical signal (EM wave)
6
Big Data Imperial June 2013 Why SKA – Key scientific drivers Are we alone ??? Cosmic Magnetism Evolution of galaxies Pulsar survey gravity waves Exploring the dark ages
7
Big Data Imperial June 2013 SKA timeline 2019Operations SKA 1 2024: Operations SKA 2 2019-2023Construction of Full SKA, SKA 2 €1.5B 2016-201910% SKA construction, SKA 1 €300M 2012Site selection 2012 - 2015Pre-Construction: 1 yr Detailed design€90M PEP 3 yr Production Readiness 2008 - 2012System design and refinement of specification 2000 - 2007Initial concepts stage 1995 - 2000Preliminary ideas and R&D
8
Big Data Imperial June 2013 SKA project structure SKA Board Director General Work Package Consortium 1 Work Package Consortium n Advisory Committees (Science, Engineering, Finance, Funding …) Advisory Committees (Science, Engineering, Finance, Funding …) … … Project Office (OSKAO) Locally funded
9
Big Data Imperial June 2013 Work package breakdown UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…) UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…) 1.System 2.Science 3.Maintenance and support /Operations Plan 4.Site preparation 5.Dishes 6.Aperture arrays 7.Signal transport 8.Data networks 9.Signal processing 10.Science Data Processor 11.Monitor and Control 12. Power SPO
10
Big Data Imperial June 2013 SKA data flow 16 Tb/s4 Pb/s 24 Tb/s 20 Gb/s 1000Tb/s
11
Big Data Imperial June 2013 Science data processor pipeline 10 Pflop 1 Eflop 100 Pflop Software complexity 3200 GB/s 200 Pflop 2.5 Eflop … Incoming Data from collectors Switch Buffer store Switch Buffer store Bulk Store Correlator Beamformer UV Processor Imaging: Non-Imaging: Corner Turning Course Delays Fine F-step/ Correlation Visibility Steering Observation Buffer Gridding Visibilities Imaging Image Storage Corner Turning Course Delays Beamforming/ De-dispersion Beam Steering Observation Buffer Time-series Searching Search analysis Object/timing Storage HPC science processing Image Processor 128,000GB/s 1 Eflop 3 EB SKA 2 SKA 1 300 PB 135 PB 5.40 EB
12
Big Data Imperial June 2013 The SKA SDP compute facility will be at the time of deployment one of the largest HPC systems in existence Operational management of large HPC systems is challenging at the best of times - When HPC systems are housed in well established research centres with good IT logistics and experienced Linux HPC staff The SKA SDP will be housed in a desert location with little surrounding IT infrastructure, with poor IT logistics and little prior HPC history at the site Potential SKA SDP exascale systems are likely to consist of 100,000 nodes occupy 800 cabinets and consume 30 MW. This is very large – around 5 times the size of today largest supercomputer –Titan Cray at Oakridge national labs. The SKA SDP HPC operations will be very challenging SKA Exascale computing in the desert
13
Big Data Imperial June 2013 Although the operational aspects of the SKA SDP exacscale facility are challenging they are tractable if dealt with systematically and in collaboration with the HPC community. The challenge is tractable
14
Big Data Imperial June 2013 We can describe the operational aspects by functional element Machine room requirements ** SDP data connectivity requirements SDP workflow requirements System service level requirements System management software requirements** Commissioning & acceptance test procedures System administration procedure User access procedures Security procedure Maintenance & logistical procedures ** Refresh procedure System staffing & training procedures ** SKA HPC operations – functional elements
15
Big Data Imperial June 2013 Machine room infrastructure for exascale HPC facilities is challenging 800 racks, 1600M squared 30MW IT load ~40 Kw of heat per rack Cooling efficiency and heat density management is vital Machine infrastructure at this scale is in the £150M bracket with a design and implementation time sale of 2-3 years The power cost alone at todays cost is £30M per year Desert location presents particular problems for data centre Hot ambient temperature - difficult for compressor less cooling Lack of water- difficult for compressor less cooling Very dry air- difficult for humidification Remote location- difficult for DC maintenance Machine room requirements
16
Big Data Imperial June 2013 System management software is the vital element in HPC operations System management software today does not scale to exascale Worldwide coordinated effort to develop system management software for exascale Elements of system management software stack:- Power management ** Network management Storage management Workflow management OS Runtime environment ** Security management System resilience ** System monitoring ** System data analytics ** Development tool System management software
17
Big Data Imperial June 2013 Current HPC technology MBTF for hardware and system software result in failure rates of ~ 2 nodes per week on a cluster a ~600 nodes. It is expected that SKA exascale systems could contain ~100,000 nodes Thus expected failure rates of 300 nodes per week could be realistic During system commissioning this will be 3 or 4 X Fixing nodes quickly is vital otherwise the system will soon degrade into a non functional state The manual engineering processes for fault detection and diagnosis on 600 will not scale to 100,000 nodes. This needs to be automated by the system software layer Scalable maintenance procedures need to be developed between HPC system administrators, system software and smart hands in the DC Vendor hardware replacement logistics need to cope with high turn around rates Maintenance logistics
18
Big Data Imperial June 2013 Providing functional staffing levels and experience at remote desert location will be challenging Its hard enough finding good HPC staff to run small scale HPC systems in Cambridge – finding orders of magnitude more staff to run much more complicated systems in a remote desert location will be very Challenging Operational procedures using a combination of remote system administration staff and DC smart hands will be needed. HPC training programmes need to be implemented to skill up way in advance The HPCS in partnership SA National HPC provider and SKA organisation is already in the process of building out pan African HPC training activities Staffing levels and training
19
Big Data Imperial June 2013 Early Cambridge SKA solution - EDSAC 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.