HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XVII: Layer 13 Part 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Presenter: Yang Ruan Indiana University Bloomington
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.
Big Data Open Source Software and Projects ABDS in Summary XII: Level 13 I590 Data Science Curriculum August Geoffrey Fox
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Department of Intelligent Systems Engineering
Status and Challenges: January 2017
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Distinguishing Parallel and Distributed Computing Performance
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
DACIDR for Gene Analysis
Tutorial Overview February 2017
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Distinguishing Parallel and Distributed Computing Performance
Overview of big data tools
Towards High Performance Data Analytics with Java
Big Data Open Source Software and Projects ABDS in Summary I
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
PHI Research in Digital Science Center
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Enhanced Apache Big Data Stack ABDS ~120 Capabilities >40 Apache Green layers have strong HPC Integration opportunities Goal Functionality of ABDS Performance of HPC

Broad Layers in HPC-ABDS Workflow-Orchestration Application and Analytics High level Programming Basic Programming model and runtime SPMD, Streaming, MapReduce, MPI Inter process communication Collectives, point to point, publish-subscribe In memory databases/caches Object-relational mapping SQL and NoSQL, File management Data Transport Cluster Resource Management (Yarn, Slurm, SGE) File systems(HDFS, Lustre …) DevOps (Puppet, Chef …) IaaS Management from HPC to hypervisors (OpenStack) Cross Cutting Message Protocols Distributed Coordination Security & Privacy Monitoring

Getting High Performance on Data Analytics (e.g. Mahout, R …) On the systems side, we have two principles The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization HPC including MPI has striking success in delivering high performance with however a fragile sustainability model There are key systems abstractions which are levels in HPC-ABDS software stack where Apache approach needs careful integration with HPC Resource management Storage Programming model -- horizontal scaling parallelism Collective and Point to Point communication Support of iteration Data interface (not just key-value) In application areas, we define application abstractions to support Graphs/network  Geospatial Images etc.

4 Forms of MapReduce   (a) Map Only (d) Loosely Synchronous (c) Iterative MapReduce (b) Classic MapReduce Input map reduce Iterations Output Pij BLAST Analysis Parametric sweep Pleasingly Parallel High Energy Physics (HEP) Histograms Distributed search Classic MPI PDE Solvers and particle dynamics Domain of MapReduce and Iterative Extensions Science Clouds MPI Giraph Expectation maximization Clustering e.g. Kmeans Linear Algebra, Page Rank MPI is Map followed by Point to Point or Collective Communication – as in style c) plus d)

HPC-ABDS Hourglass HPC ABDS System (Middleware) 120 Software Projects High performance Applications HPC-ABDS Hourglass 120 Software Projects System Abstractions/standards Data format Storage HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point communication Support of iteration Application Abstractions/standards Graphs, Networks, Images, Geospatial …. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) or High performance Mahout, R, Matlab …..

Integrating Yarn with HPC

We are sort of working on Use Cases with HPC-ABDS Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over “all” images Use Case 44 Radar Images. Running on Amazon

Features of Harp Hadoop Plug in Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

Architecture MapReduce Applications Map-Collective Applications YARN MapReduce V2 Harp MapReduce Applications Map-Collective Applications Application Framework Resource Manager

Performance on Madrid Cluster (8 nodes) Increasing Communication Identical Computation Note compute same in each case as product of centers times points identical

Mahout and Hadoop MR – Slow due to MapReduce Python slow as Scripting Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives MPI fastest as C not Java Increasing Communication Identical Computation

Performance of MPI Kernel Operations Pure Java as in FastMPJ slower than Java interfacing to C version of MPI

Use case 28: Truthy: Information diffusion research from Twitter Data Building blocks: Yarn Parallel query evaluation using Hadoop MapReduce Related hashtag mining algorithm using Hadoop MapReduce: Meme daily frequency generation using MapReduce over index tables Parallel force-directed graph layout algorithm using Twister (Harp) iterative MapReduce

Use case 28: Truthy: Information diffusion research from Twitter Data Two months’ data loading for varied cluster size Scalability of iterative graph layout algorithm on Twister Hadoop-FS not indexed

Pig Performance Hadoop Harp-Hadoop Pig +HD1 (Hadoop) Pig + Yarn

Lines of Code Pig Kmeans Hadoop Kmeans Pig IndexedHBase   Pig Kmeans Hadoop Kmeans Pig IndexedHBase meme-cooccur-count IndexedHBase Java ~345 780 152 ~434 Pig 10 Python / Bash ~40 28 Total Lines 395 162 462

DACIDR for Gene Analysis (Use Case 19,20) Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for pleasingly parallel applications, and Twister (replacing by Yarn) for iterative MapReduce applications Sequences – Cluster  Centers Add Existing data and find Phylogenetic Tree Pairwise Clustering All-Pair Sequence Alignment Streaming Visualization Multidimensional Scaling Simplified Flow Chart of DACIDR

Summarize a million Fungi Sequences Spherical Phylogram Visualization Spherical Phylogram from new MDS method visualized in PlotViz RAxML result visualized in FigTree.

Lessons / Insights Integrate (don’t compete) HPC with “Commodity Big data” (Google to Amazon to Enterprise data Analytics) i.e. improve Mahout; don’t compete with it Use Hadoop plug-ins rather than replacing Hadoop Enhanced Apache Big Data Stack HPC-ABDS has 120 members – please improve! HPC-ABDS+ Integration areas include file systems, cluster resource management, file and object data management, inter process and thread communication, analytics libraries, Workflow monitoring