HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.

Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August Geoffrey Fox

Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox

Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox

Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox

Big Data Open Source Software and Projects ABDS in Summary XVII: Layer 13 Part 2 Data Science Curriculum March Geoffrey Fox

Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox

Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,

Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox

Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox

BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.

Presenter: Yang Ruan Indiana University Bloomington

Data Science at Digital Science October Geoffrey Fox Judy Qiu

Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

SALSA HPC Group School of Informatics and Computing Indiana University.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.

Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox

Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING

51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.

Big Data Open Source Software and Projects ABDS in Summary XII: Level 13 I590 Data Science Curriculum August Geoffrey Fox

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.

Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox

SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Geoffrey Fox Panel Talk: February

Department of Intelligent Systems Engineering

Status and Challenges: January 2017

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Distinguishing Parallel and Distributed Computing Performance

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Department of Intelligent Systems Engineering

Digital Science Center I

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

DACIDR for Gene Analysis

Tutorial Overview February 2017

AI First High Performance Big Data Computing for Industry 4.0

Data Science for Life Sciences Research & the Public Good

Scalable Parallel Interoperable Data Analytics Library

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Distinguishing Parallel and Distributed Computing Performance

Overview of big data tools

Towards High Performance Data Analytics with Java

Big Data Open Source Software and Projects ABDS in Summary I

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

PHI Research in Digital Science Center

Big Data, Simulations and HPC Convergence

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

I590 Data Science Curriculum August

Presentation transcript:

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Enhanced Apache Big Data Stack ABDS ~120 Capabilities >40 Apache Green layers have strong HPC Integration opportunities Goal Functionality of ABDS Performance of HPC

Broad Layers in HPC-ABDS Workflow-Orchestration Application and Analytics High level Programming Basic Programming model and runtime SPMD, Streaming, MapReduce, MPI Inter process communication Collectives, point to point, publish-subscribe In memory databases/caches Object-relational mapping SQL and NoSQL, File management Data Transport Cluster Resource Management (Yarn, Slurm, SGE) File systems(HDFS, Lustre …) DevOps (Puppet, Chef …) IaaS Management from HPC to hypervisors (OpenStack) Cross Cutting Message Protocols Distributed Coordination Security & Privacy Monitoring

Getting High Performance on Data Analytics (e.g. Mahout, R …) On the systems side, we have two principles The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization HPC including MPI has striking success in delivering high performance with however a fragile sustainability model There are key systems abstractions which are levels in HPC-ABDS software stack where Apache approach needs careful integration with HPC Resource management Storage Programming model -- horizontal scaling parallelism Collective and Point to Point communication Support of iteration Data interface (not just key-value) In application areas, we define application abstractions to support Graphs/network Geospatial Images etc.

4 Forms of MapReduce (a) Map Only (d) Loosely Synchronous (c) Iterative MapReduce (b) Classic MapReduce Input map reduce Iterations Output Pij BLAST Analysis Parametric sweep Pleasingly Parallel High Energy Physics (HEP) Histograms Distributed search Classic MPI PDE Solvers and particle dynamics Domain of MapReduce and Iterative Extensions Science Clouds MPI Giraph Expectation maximization Clustering e.g. Kmeans Linear Algebra, Page Rank MPI is Map followed by Point to Point or Collective Communication – as in style c) plus d)

HPC-ABDS Hourglass HPC ABDS System (Middleware) 120 Software Projects High performance Applications HPC-ABDS Hourglass 120 Software Projects System Abstractions/standards Data format Storage HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point communication Support of iteration Application Abstractions/standards Graphs, Networks, Images, Geospatial …. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) or High performance Mahout, R, Matlab …..

Integrating Yarn with HPC

We are sort of working on Use Cases with HPC-ABDS Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over “all” images Use Case 44 Radar Images. Running on Amazon

Features of Harp Hadoop Plug in Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

Architecture MapReduce Applications Map-Collective Applications YARN MapReduce V2 Harp MapReduce Applications Map-Collective Applications Application Framework Resource Manager

Performance on Madrid Cluster (8 nodes) Increasing Communication Identical Computation Note compute same in each case as product of centers times points identical

Mahout and Hadoop MR – Slow due to MapReduce Python slow as Scripting Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives MPI fastest as C not Java Increasing Communication Identical Computation

Performance of MPI Kernel Operations Pure Java as in FastMPJ slower than Java interfacing to C version of MPI

Use case 28: Truthy: Information diffusion research from Twitter Data Building blocks: Yarn Parallel query evaluation using Hadoop MapReduce Related hashtag mining algorithm using Hadoop MapReduce: Meme daily frequency generation using MapReduce over index tables Parallel force-directed graph layout algorithm using Twister (Harp) iterative MapReduce

Use case 28: Truthy: Information diffusion research from Twitter Data Two months’ data loading for varied cluster size Scalability of iterative graph layout algorithm on Twister Hadoop-FS not indexed

Pig Performance Hadoop Harp-Hadoop Pig +HD1 (Hadoop) Pig + Yarn

Lines of Code Pig Kmeans Hadoop Kmeans Pig IndexedHBase Pig Kmeans Hadoop Kmeans Pig IndexedHBase meme-cooccur-count IndexedHBase Java ~345 780 152 ~434 Pig 10 Python / Bash ~40 28 Total Lines 395 162 462

DACIDR for Gene Analysis (Use Case 19,20) Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for pleasingly parallel applications, and Twister (replacing by Yarn) for iterative MapReduce applications Sequences – Cluster  Centers Add Existing data and find Phylogenetic Tree Pairwise Clustering All-Pair Sequence Alignment Streaming Visualization Multidimensional Scaling Simplified Flow Chart of DACIDR

Summarize a million Fungi Sequences Spherical Phylogram Visualization Spherical Phylogram from new MDS method visualized in PlotViz RAxML result visualized in FigTree.

Lessons / Insights Integrate (don’t compete) HPC with “Commodity Big data” (Google to Amazon to Enterprise data Analytics) i.e. improve Mahout; don’t compete with it Use Hadoop plug-ins rather than replacing Hadoop Enhanced Apache Big Data Stack HPC-ABDS has 120 members – please improve! HPC-ABDS+ Integration areas include file systems, cluster resource management, file and object data management, inter process and thread communication, analytics libraries, Workflow monitoring