Big Data - Efficient SW Processing

Slides:

Advertisements

Similar presentations

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

R and HDInsight in Microsoft Azure

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

GENIUS kick-off - November 2013 GENIUS kick-off meeting WP400 – Tools for data exploitation X. Luri.

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.

Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.

Next Generation of Apache Hadoop MapReduce Owen

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

The Science Data Processor and Regional Centre Overview Paul Alexander UK Science Director the SKA Organisation Leader the Science Data Processor Consortium.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

These slides are based on the book:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Big Data is a Big Deal!.

Modeling Big Data Execution speed limited by: Model complexity

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.

Introduction to Distributed Platforms

Operating System.

Distributed Network Traffic Feature Extraction for a Real-time IDS

ECRG High-Performance Computing Seminar

Pathology Spatial Analysis February 2017

Spark Presentation.

Grid Computing.

The University of Adelaide, School of Computer Science

Spatial Analysis With Big Data

University of Technology

Performance Evaluation of Adaptive MPI

SpatialHadoop: A MapReduce Framework for Spatial Data

Chapter 1: Introduction

Ministry of Higher Education

Introduction to Spark.

湖南大学-信息科学与工程学院-计算机与科学系

CMPT 733, SPRING 2016 Jiannan Wang

Communication and Memory Efficient Parallel Decision Tree Construction

CS110: Discussion about Spark

Scalable Parallel Interoperable Data Analytics Library

CLUSTER COMPUTING.

Clouds from FutureGrid’s Perspective

Parallel Analytic Systems

Overview of big data tools

Parallel Computation Patterns (Reduction)

Introduction to Operating Systems

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Introduction to Operating Systems

COMP60621 Fundamentals of Parallel and Distributed Systems

Subject Name: Operating System Concepts Subject Number:

Aperture Array Simulations

Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing

Vrije Universiteit Amsterdam

FUNDAMENTAL CONCEPTS OF PARALLEL PROGRAMMING

CMPT 733, SPRING 2017 Jiannan Wang

COMP60611 Fundamentals of Parallel and Distributed Systems

Emulator of Cosmological Simulation for Initial Parameters Study

Lecture 29: Distributed Systems

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Big Data - Efficient SW Processing April 3rd 2017

Agenda What is Big Data? What makes Big Data a Thing? Does SKA Qualifies as a Big Data System? Overview of the Slow Transients Pipeline Resources Available & Implementation Challenges Very Simple Example on Parallel Computing Limits to Parallelisation Is Big Data the same as High-Performance Computing?

What is Big Data? A relatively new buzzword arising from the need to extract knowledge from the immense data volumes created by the ever increasing digitization of business and social life. Big Data can be seen as three Vs: Volume In the order of terabytes, petabytes or more Velocity Real-time or near real-time processing Variety Social networks, e-business logs, sensor information, etc.

What makes Big Data a Thing? Network Capacity Over the 90s and 2000s, expanded at an incredible pace; companies raced to install land and submarine communication cables delivering a huge excess of bandwidth enabling interconnectivity as never seen before. Storage Capacity Storage capacity followed the pace of bandwidth; new discoveries such as giant magneto-resistance increased data density by orders of magnitude while data centres kept growing equally fast Processing Capacity The stand-alone CPU reached a point where it could scale no more as before, but the multi-core and ever faster networks made the processing power capable to match data size evolution.

Does SKA Qualifies as a Big Data System? Exabytes/year from telescope array 2x300 petabyte/year through submarine links Real-time closed-loop control Complex events processed every second or less

Overview of the Slow Transients Pipeline Slow transients last for hours, days or more than a week, but “slow” is misleading Samples must be processed fast – e.g. within a second

Resources Available & Implementation Challenges The essential features of high-performance computing, even if at small scale, at are the hand of almost everyone Multi-CPU and/or multi-core machines Parallel computing frameworks Optimised math libraries There are important implementation challenges however: Partitioning of data sets Work distribution and orchestration Error recovery Scalability and management Usability of parallel computing technologies

Technology used in the Slow Transients Pipeline Parallelisation Technology Threading Building Blocks (TBB) Programming Languages C++11 Python Mathematical Libraries Armadillo LAPACK OpenBLAS FFTW

Very Simple Example on Parallel Computing Given a random data set of size 12 we can calculate the average of The entire data set or else, the average of the average of non-overlapping subsets of sizes: 2, 3, 4 and 6 The algorithm can be parallelised respective among: 1, 6, 4, 3 or 2 computational cores, in a single machine or a cluster On the other hand, this is not feasible for the standard deviation STDEV of the sub-averages is the same as STDEV of the initial set Algorithms depending from the full data set cannot be parallelised Dependency from previous iteration state also breaks parallelisation

Limits to Parallelisation Parallelisation is good but gains are all but linear, even when they seem linear as in the previous case Time is required to partition the data set, distribute it and collect the results from the fractions – e.g. MapReduce time in Big Data We may also be limited about how many partitions we can allocate at once – e.g. 1, 2 and 4 cores are feasible but 3 are not The most important aspect to keep always in mind is that “all computers wait at the same speed”!

Is Big Data the same as High-Performance Computing? Big Data and HPC are more alike than different, but have different roots – Big Data is more trendy now! Big Data is an offspring of the Internet while HPC comes from the supercomputer world Big Data uses technologies such a Hadoop and Spark, while HPC heavily relies in MPI Big Data is more about detecting patterns out of user data while HPC is more about scientific computing The basic mean however are the same: map partitions, distribute workloads, collect partial results, calculate aggregated result.

Concluding Thoughts There is a fundamental difference between "as fast as possible" and "in real-time". A multicore machine will not execute a SW any faster unless the SW is programmed to use multi-core. The challenge is not in dealing with 2 or 2,000,000 cores, it is in dealing with one or more than one. A parallel computing framework will not break the problem for you, it will only allow you do that.

SKA Pushing New Boundaries of the Known and Uncovering the Unknown

SKA Ecossystem SKA Radio-telescope Complementary Space Observatory Centre for Processing & Transit (PT) SKA Radio-telescope First Level Processing Centre (ZA) Complementary Ground Observatory Complementary Space Observatory Processing Centre (other location) Telescope Control System (ZA) Virtual Observatory Astronomy Events Data Base Primary Data Flow (exabytes/year) Secondary Data Flow (2x300PBytes Year Total) (10-15PBytes Year para PT) Transient’s Coordinates (feedback for control) Transient’s Coordinates (feedback for complementary observation means) Monitoring & Control Pipeline I/O Data

Slow Transients Pipeline Subtract Global Sky Model Visibilities Residual Visibilities Gridding FFT Find Sources Image Transient Sources Kernels SKA Radio-telescope First Level Processing Centre (ZA)

Extra thoughts Recursion is hard to grasp because in to understand recursion one first needs to understand recursion. There is a fundamental difference between "as fast as possible" and "in real-time". A multicore machine will not execute a SW any faster unless the SW is programmed to use multi-core. The challenge is not in dealing with 2 or 2,000,000 cores, it is in dealing with one or more than one. A parallel computing framework will not break the problem for you, it will only allow you do that.