Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Big Data Open Source Software and Projects Data Access Patterns and Introduction to using HPC-ABDS I590 Data Science Curriculum August Geoffrey.
Big Data Open Source Software and Projects ABDS in Summary VI: Layer 6 Part 2 Data Science Curriculum March Geoffrey Fox
Hadoop Ecosystem Overview
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
1 1 Hybrid Cloud Solutions (Private with Public Burst) Accelerate and Orchestrate Enterprise Applications.
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Computing Research Testbeds as a Service: Supporting large scale Experiments and Testing SC12 Birds of a Feather November.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
OMOP CDM on Hadoop Reference Architecture
Private Public FG Network NID: Network Impairment Device
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
PROTECT | OPTIMIZE | TRANSFORM
Smart Building Solution
Department of Intelligent Systems Engineering
Next Generation IoT and Data-based Grid
Big Data A Quick Review on Analytical Tools
Structure of Problems and its Relation to Software and Hardware
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Status and Challenges: January 2017
HPC 2016 HIGH PERFORMANCE COMPUTING
Characteristics of Future Big Data Platforms
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Smart Building Solution
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
HPCCloud 3.0: Big Data on Clouds and HPC
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Theme 4: High-performance computing for Precision Health Initiative
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
FutureGrid Computing Testbed as a Service
Digital Science Center Overview
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
13th Cloud Control Workshop, June 13-15, 2018
A Tale of Two Convergences: Applications and Computing Platforms
Research in Digital Science Center
Clouds from FutureGrid’s Perspective
HPC Cloud and Big Data Testbed
Twister2: Design and initial implementation of a Big Data Toolkit
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Big-Data Analytics with Azure HDInsight
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce) Raw Data Information Wisdom Knowledge Data Decisions Analytics Pub-Sub System Orchestration / Dataflow / Workflow

Analytics System Orchestration / Dataflow / Workflow Data Ingest Storm Archival Storage – Accumulo Streaming Processing (Bolts) Batch Processing (MapReduce) Raw Data Information Wisdom Knowledge Data Decisions Analytics Pub-Sub System Orchestration / Dataflow / Workflow

Big Data HPC

HPC-ABDS IntegratedSoftware Big Data ABDS HPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus Libraries Mllib/Mahout, R, Python Matlab, Eclipse, Apps High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, SQL, SparQL Fortran, C/C++ Streaming Storm, Kafka, Kinesis Parallel Runtime MapReduce MPI/OpenMP/OpenCL Coordination Zookeeper Caching Memcached Data Management Hbase, Neo4J, MySQL iRODS Data Transfer Sqoop GridFTP Scheduling Yarn Slurm File Systems HDFS, Object Stores Lustre Formats Thrift, Protobuf FITS, HDF Virtualization OpenStack Docker, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS

HPC-ABDS IntegratedSoftware Big Data ABDS HPCCloud 3.0 HPC, Cluster 17. Orchestration Beam, Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries MLlib/Mahout, TensorFlow, CNTK, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce MPI/OpenMP/OpenCL 2. Coordination Zookeeper 12. Caching Memcached 11. Data Management Hbase, Accumulo, Neo4J, MySQL iRODS 10. Data Transfer Sqoop GridFTP 9. Scheduling Yarn, Mesos Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf FITS, HDF 5. IaaS OpenStack , Docker Linux, Bare-metal, SR-IOV Infrastructure Intelligent CLOUDS HPC Clusters, Classic SUPERCOMPUTERS CUDA, Exascale Runtime

HPC-ABDS IntegratedSoftware HPC-ABDS Stack HPC,Cluster 17. Orchestration Beam, Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries SPIDAL, MLlib/Mahout, TensorFlow, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service Twister2, App Engine, Elastic Beanstalk HPC Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python 14B. Streaming Heron, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, Spark, Harp MPI/OpenMP/OpenCL 2. Coordination Zookeeper 12. Caching Memcached 11. Data Management Hbase, Accumulo, Neo4J, MySQL iRODS 10. Data Transfer Sqoop, Data Transfer DTP GridFTP 9. Scheduling Yarn, Mesos, Kubernetes Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf FITS, HDF 5. IaaS OpenStack , Docker, KVM Linux, Bare-metal, SR-IOV Infrastructure Intelligent CLOUDS HPC Clusters, Global AI Supercomputer Classic Supercomputers CUDA, Exascale Runtime

Initial Convergence Software Big Data ABDS HPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna Libraries MLlib/Mahout, R, Python ScaLAPACK, PETSc, Matlab High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python Streaming Storm, Kafka, Kinesis Parallel Runtime Hadoop, MapReduce MPI/OpenMP/OpenCL Coordination Zookeeper Caching Memcached Data Management Hbase, Accumulo, Neo4J, MySQL iRODS Data Transfer Sqoop GridFTP Scheduling Mesos, Aurora, Yarn Slurm File Systems HDFS, Object Stores Lustre Formats Thrift, Protobuf FITS, HDF IaaS OpenStack , Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS SUPERCOMPUTERS CUDA, Exascale Runtime

4 Forms of MapReduce Correspond to first 4 of Identified Architectures   (1) Map Only (4) Point to Point or Map-Communication (3) Iterative Map Reduce or Map-Collective (2) Classic MapReduce Input map reduce Iterations Output Local Graph BLAST Analysis Local Machine Learning Pleasingly Parallel High Energy Physics (HEP) Histograms Distributed search Recommender Engines Expectation maximization Clustering e.g. K-means Linear Algebra, PageRank Classic MPI PDE Solvers and Particle Dynamics Graph Problems MapReduce and Iterative Extensions (Spark, Twister) MPI, Giraph Integrated Systems such as Hadoop + Harp with Compute and Communication model separated Correspond to first 4 of Identified Architectures

(6) Shared memory Map Communicates (5) Map Streaming   maps brokers Events (6) Shared memory Map Communicates Map & Communicate  Shared Memory

6 Data Analysis Architectures Difficult to parallelize asynchronous parallel Graph Algorithms Classic Hadoop in classes 1) 2) BLAST Analysis Local Machine Learning Pleasingly Parallel High Energy Physics (HEP) Histograms Web search Recommender Engines Expectation maximization Clustering Linear Algebra, PageRank Classic MPI PDE Solvers and Particle Dynamics Graph Streaming images from Synchrotron sources, Telescopes, IoT MapReduce and Iterative Extensions (Spark, Twister) MPI, Giraph Apache Storm Harp – Enhanced Hadoop Maps are Bolts

Kmeans Clustering Time Secs Efficiency # Cores

Software-Defined Distributed System (SDDS) as a Service includes SDDS-aaS Tools Provisioning Image Management IaaS Interoperability NaaS, IaaS tools Expt management Dynamic IaaS NaaS DevOps Dynamic Orchestration and Dataflow Software (Application Or Usage) SaaS Use HPC-ABDS Class Usages e.g. run GPU & multicore Applications Control Robot CloudMesh is a SDDSaaS tool that uses Dynamic Provisioning and Image Management to provide custom environments for general target systems Involves (1) creating, (2) deploying, and (3) provisioning of one or more images in a set of machines on demand http://mycloudmesh.org/ Platform PaaS Cloud e.g. MapReduce HPC e.g. PETSc, SAGA Computer Science e.g. Compiler tools, Sensor nets, Monitors Infra structure IaaS Software Defined Computing (virtual Clusters) Hypervisor, Bare Metal Operating System Network NaaS Software Defined Networks OpenFlow GENI

Figure 3: Dual Convergence System C D C D Data Management Model for Big Data and Simulation Figure 3: Dual Convergence System

C Data