Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Big Data Open Source Software and Projects Data Access Patterns and Introduction to using HPC-ABDS I590 Data Science Curriculum August Geoffrey.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary VI: Layer 6 Part 2 Data Science Curriculum March Geoffrey Fox
Integrating the Apache Stack with HPC for Big Data
Big Data Open Source Software and Projects Aspects of Big Data Applications I590 Data Science Curriculum August Geoffrey Fox
Cloudmesh: Software Defined Distributed Systems as a Service SDDSaaS Workshop on the Development of a Next-Generation, Interoperable, Federated Network.
Hadoop Ecosystem Overview
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski),
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
BigDat 2015: International Winter School on Big Data
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
1 1 Hybrid Cloud Solutions (Private with Public Burst) Accelerate and Orchestrate Enterprise Applications.
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
Computing Research Testbeds as a Service: Supporting large scale Experiments and Testing SC12 Birds of a Feather November.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
OMOP CDM on Hadoop Reference Architecture
Connected Infrastructure
Private Public FG Network NID: Network Impairment Device
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Smart Building Solution
Department of Intelligent Systems Engineering
Big Data A Quick Review on Analytical Tools
Structure of Problems and its Relation to Software and Hardware
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Status and Challenges: January 2017
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Spark Presentation.
Smart Building Solution
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Theme 4: High-performance computing for Precision Health Initiative
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
FutureGrid Computing Testbed as a Service
Digital Science Center Overview
Introduction to Spark.
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
Data Science Curriculum March
Tutorial Overview February 2017
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Clouds from FutureGrid’s Perspective
Overview of big data tools
Department of Intelligent Systems Engineering
Big-Data Analytics with Azure HDInsight
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Presentation transcript:

Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce) Raw Data Information Wisdom Knowledge Data Decisions Pub-Sub System Orchestration / Dataflow / Workflow Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Data Ingest Storm Archival Storage – Accumulo Streaming Processing (Bolts) Batch Processing (MapReduce) Raw Data Information Wisdom Knowledge Data Decisions Pub-Sub System Orchestration / Dataflow / Workflow

Big DataHPC

HPC-ABDS Integrated Software Big Data ABDSHPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus Libraries Mllib/Mahout, R, PythonMatlab, Eclipse, Apps High Level Programming Pig, Hive, Drill Domain-specific Languages Platform as a ServiceApp Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, SQL, SparQL Fortran, C/C++ StreamingStorm, Kafka, Kinesis Parallel RuntimeMapReduce MPI/OpenMP/OpenCL CoordinationZookeeper CachingMemcached Data ManagementHbase, Neo4J, MySQLiRODS Data TransferSqoopGridFTP SchedulingYarnSlurm File SystemsHDFS, Object StoresLustre FormatsThrift, Protobuf FITS, HDF VirtualizationOpenStackDocker, SR-IOV InfrastructureCLOUDSSUPERCOMPUTERS

HPC-ABDS Integrated Software Big Data ABDSHPC, Cluster 17. Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries MLlib/Mahout, R, PythonScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, DrillDomain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python 14B. StreamingStorm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce MPI/OpenMP/OpenCL 2. CoordinationZookeeper 12. CachingMemcached 11. Data Management Hbase, Accumulo, Neo4J, MySQLiRODS 10. Data TransferSqoopGridFTP 9. SchedulingYarnSlurm 8. File SystemsHDFS, Object StoresLustre 1, 11A FormatsThrift, Protobuf FITS, HDF 5. IaaSOpenStack, DockerLinux, Bare-metal, SR-IOV InfrastructureCLOUDSSUPERCOMPUTERS CUDA, Exascale Runtime

Initial Convergence Software Big Data ABDSHPC, Cluster Orchestration Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna Libraries MLlib/Mahout, R, PythonScaLAPACK, PETSc, Matlab High Level Programming Pig, Hive, DrillDomain-specific Languages Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python StreamingStorm, Kafka, Kinesis Parallel Runtime Hadoop, MapReduce MPI/OpenMP/OpenCL CoordinationZookeeper CachingMemcached Data Management Hbase, Accumulo, Neo4J, MySQLiRODS Data TransferSqoopGridFTP SchedulingMesos, Aurora, YarnSlurm File SystemsHDFS, Object StoresLustre FormatsThrift, Protobuf FITS, HDF IaaSOpenStack, DockerLinux, Bare-metal, SR-IOV InfrastructureCLOUDSSUPERCOMPUTERS CUDA, Exascale Runtime

4 Forms of MapReduce (1) Map Only ( 4) Point to Point or Map-Communication (3) Iterative Map Reduce or Map-Collective (2) Classic MapReduce Input map reduce Input map reduce Iterations Input Output map Local Graph BLAST Analysis Local Machine Learning Pleasingly Parallel High Energy Physics (HEP) Histograms Distributed search Recommender Engines Expectation maximization Clustering e.g. K-means Linear Algebra, PageRank Classic MPI PDE Solvers and Particle Dynamics Graph Problems MapReduce and Iterative Extensions (Spark, Twister)MPI, Giraph Integrated Systems such as Hadoop + Harp with Compute and Communication model separated Correspond to first 4 of Identified Architectures

(5) Map Streaming maps brokers Events (6) Shared memory Map Communicates Map & Communicate Shared Memory

6 Data Analysis Architectures BLAST Analysis Local Machine Learning Pleasingly Parallel High Energy Physics (HEP) Histograms Web search Recommender Engines Expectation maximization Clustering Linear Algebra, PageRank Classic MPI PDE Solvers and Particle Dynamics Graph Streaming images from Synchrotron sources, Telescopes, IoT MapReduce and Iterative Extensions (Spark, Twister)MPI, GiraphApache Storm Difficult to parallelize asynchronous parallel Graph Algorithms Harp – Enhanced Hadoop Maps are Bolts Classic Hadoop in classes 1) 2)

Kmeans Clustering Time Secs Efficiency # Cores

Infra structure IaaS  Software Defined Computing (virtual Clusters)  Hypervisor, Bare Metal  Operating System Platform PaaS  Cloud e.g. MapReduce  HPC e.g. PETSc, SAGA  Computer Science e.g. Compiler tools, Sensor nets, Monitors Software-Defined Distributed System (SDDS) as a Service includes Network NaaS  Software Defined Networks  OpenFlow GENI Software (Application Or Usage) SaaS  Use HPC-ABDS  Class Usages e.g. run GPU & multicore  Applications  Control Robot SDDS-aaS Tools  Provisioning  Image Management  IaaS Interoperability  NaaS, IaaS tools  Expt management  Dynamic IaaS NaaS  DevOps CloudMesh is a SDDSaaS tool that uses Dynamic Provisioning and Image Management to provide custom environments for general target systems Involves (1) creating, (2) deploying, and (3) provisioning of one or more images in a set of machines on demand 17 Dynamic Orchestration and Dataflow