Big Data - Efficient SW Processing

Big Data - Efficient SW Processing
April 3rd 2017

Agenda What is Big Data? What makes Big Data a Thing?
Does SKA Qualifies as a Big Data System? Overview of the Slow Transients Pipeline Resources Available & Implementation Challenges Very Simple Example on Parallel Computing Limits to Parallelisation Is Big Data the same as High-Performance Computing?

What is Big Data? A relatively new buzzword arising from the need to extract knowledge from the immense data volumes created by the ever increasing digitization of business and social life. Big Data can be seen as three Vs: Volume In the order of terabytes, petabytes or more Velocity Real-time or near real-time processing Variety Social networks, e-business logs, sensor information, etc.

What makes Big Data a Thing?
Network Capacity Over the 90s and 2000s, expanded at an incredible pace; companies raced to install land and submarine communication cables delivering a huge excess of bandwidth enabling interconnectivity as never seen before. Storage Capacity Storage capacity followed the pace of bandwidth; new discoveries such as giant magneto-resistance increased data density by orders of magnitude while data centres kept growing equally fast Processing Capacity The stand-alone CPU reached a point where it could scale no more as before, but the multi-core and ever faster networks made the processing power capable to match data size evolution.

Does SKA Qualifies as a Big Data System?
Exabytes/year from telescope array 2x300 petabyte/year through submarine links Real-time closed-loop control Complex events processed every second or less

Overview of the Slow Transients Pipeline
Slow transients last for hours, days or more than a week, but “slow” is misleading Samples must be processed fast – e.g. within a second

Resources Available & Implementation Challenges
The essential features of high-performance computing, even if at small scale, at are the hand of almost everyone Multi-CPU and/or multi-core machines Parallel computing frameworks Optimised math libraries There are important implementation challenges however: Partitioning of data sets Work distribution and orchestration Error recovery Scalability and management Usability of parallel computing technologies

Technology used in the Slow Transients Pipeline
Parallelisation Technology Threading Building Blocks (TBB) Programming Languages C++11 Python Mathematical Libraries Armadillo LAPACK OpenBLAS FFTW

Very Simple Example on Parallel Computing
Given a random data set of size 12 we can calculate the average of The entire data set or else, the average of the average of non-overlapping subsets of sizes: 2, 3, 4 and 6 The algorithm can be parallelised respective among: 1, 6, 4, 3 or 2 computational cores, in a single machine or a cluster On the other hand, this is not feasible for the standard deviation STDEV of the sub-averages is the same as STDEV of the initial set Algorithms depending from the full data set cannot be parallelised Dependency from previous iteration state also breaks parallelisation

Limits to Parallelisation
Parallelisation is good but gains are all but linear, even when they seem linear as in the previous case Time is required to partition the data set, distribute it and collect the results from the fractions – e.g. MapReduce time in Big Data We may also be limited about how many partitions we can allocate at once – e.g. 1, 2 and 4 cores are feasible but 3 are not The most important aspect to keep always in mind is that “all computers wait at the same speed”!

Is Big Data the same as High-Performance Computing?
Big Data and HPC are more alike than different, but have different roots – Big Data is more trendy now! Big Data is an offspring of the Internet while HPC comes from the supercomputer world Big Data uses technologies such a Hadoop and Spark, while HPC heavily relies in MPI Big Data is more about detecting patterns out of user data while HPC is more about scientific computing The basic mean however are the same: map partitions, distribute workloads, collect partial results, calculate aggregated result.

Concluding Thoughts There is a fundamental difference between "as fast as possible" and "in real-time". A multicore machine will not execute a SW any faster unless the SW is programmed to use multi-core. The challenge is not in dealing with 2 or 2,000,000 cores, it is in dealing with one or more than one. A parallel computing framework will not break the problem for you, it will only allow you do that.

SKA Pushing New Boundaries of the Known and Uncovering the Unknown

SKA Ecossystem SKA Radio-telescope Complementary Space Observatory
Centre for Processing & Transit (PT) SKA Radio-telescope First Level Processing Centre (ZA) Complementary Ground Observatory Complementary Space Observatory Processing Centre (other location) Telescope Control System (ZA) Virtual Observatory Astronomy Events Data Base Primary Data Flow (exabytes/year) Secondary Data Flow (2x300PBytes Year Total) (10-15PBytes Year para PT) Transient’s Coordinates (feedback for control) Transient’s Coordinates (feedback for complementary observation means) Monitoring & Control Pipeline I/O Data

Slow Transients Pipeline
Subtract Global Sky Model Visibilities Residual Visibilities Gridding FFT Find Sources Image Transient Sources Kernels SKA Radio-telescope First Level Processing Centre (ZA)

Extra thoughts Recursion is hard to grasp because in to understand recursion one first needs to understand recursion. There is a fundamental difference between "as fast as possible" and "in real-time". A multicore machine will not execute a SW any faster unless the SW is programmed to use multi-core. The challenge is not in dealing with 2 or 2,000,000 cores, it is in dealing with one or more than one. A parallel computing framework will not break the problem for you, it will only allow you do that.

Big Data - Efficient SW Processing

Similar presentations

Presentation on theme: "Big Data - Efficient SW Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data - Efficient SW Processing

Similar presentations

Presentation on theme: "Big Data - Efficient SW Processing"— Presentation transcript:

Similar presentations

About project

Feedback