Scalable Parallel Interoperable Data Analytics Library

Slides:

Advertisements

Similar presentations

Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.

Advertisements

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science

Data Science at Digital Science October Geoffrey Fox Judy Qiu

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Geoffrey Fox Panel Talk: February

Panel: Beyond Exascale Computing

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

Digital Science Center

SPIDAL Analytics Performance February 2017

Digital Science Center II

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Early Results of Deep Learning on the Stampede2 Supercomputer

Department of Intelligent Systems Engineering

Accelerating Machine Learning with Model-Centric Approach on Emerging Architectures Many-Task Computing on Clouds, Grids, and Supercomputers Workshop.

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Tutorial: Big Data Algorithms and Applications Under Hadoop

Status and Challenges: January 2017

Extreme Big Data Examples

HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS

Spark Presentation.

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Department of Intelligent Systems Engineering

Interactive Website (

Some Remarks for Cloud Forward Internet2 Workshop

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Department of Intelligent Systems Engineering

Digital Science Center I

HPSA18: Logistics 7:00 am – 8:00 am Breakfast

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

High Performance Big Data Computing in the Digital Science Center

Convergence of HPC and Clouds for Large-Scale Data enabled Science

NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski),

Data Science Curriculum March

Tutorial Overview February 2017

Department of Intelligent Systems Engineering

Early Results of Deep Learning on the Stampede2 Supercomputer

AI First High Performance Big Data Computing for Industry 4.0

Data Science for Life Sciences Research & the Public Good

Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Ch 4. The Evolution of Analytic Scalability

Stochastic Optimization Maximization for Latent Variable Models

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Hybrid Programming with OpenMP and MPI

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

$1M a year for 5 years; 7 institutions Active:

PHI Research in Digital Science Center

Panel on Research Challenges in Big Data

Big Data, Simulations and HPC Convergence

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Intelligent Systems Engineering Department, Indiana University

Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.

Research in Digital Science Center

Convergence of Big Data and Extreme Computing

Presentation transcript:

Scalable Parallel Interoperable Data Analytics Library Judy Qiu, Bingjing Zhang, Peng Bo, Thomas Wiggins, IPCC@IU ABSTRACT Harp Collective Communication Harp-LDA Strong Scaling Test SUMMARY SPIDAL (Scalable Parallel Interoperable Data Analytics Library) is a community infrastructure built upon the HPC-ABDS concept for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. Illustrating this principle, we have shown that previous standalone enhanced versions of MapReduce can be replaced by Harp (a Hadoop plug-in) that offers both data abstractions useful for high performance iteration and communication using best available (MPI) approaches that are portable to HPC and Cloud. This iterative solver would enable robustness, scalability, productivity, and sustainability for Data Analytics Link Mahout, MLlib, DAAL on HPC-Cloud as well as Deep Learning. Juliet is a cluster with Intel Haswell architecture (30 nodes each with 64 threads and connected with InfiniBand) Our experiments use 3,775,554 Wikipedia documents. The training model includes 1 million words and 10K topics. Alpha and beta are both set to 0.01. Harp-LDA is 56% faster compared with Yahoo!LDA on the Juliet cluster. The convergence speed of Harp-LDA on iterations also improved. Strong scaling test shows Harp-LDA can scale well on Juliet (Intel Haswell cluster). For the Convergence of LDA Topic Model: The convergence is fast. Harp-LDA reaches a similar level of perplexity at iter#200 as Yahoo!LDA does at iter#250-#300. Harp-LDA performs more model synchronizations in each iteration, which may lead to better accuracy and converging speed. BACKGROUND We have tested Harp on Latent Dirichlet Allocation with Gibbs sampling over a wikipedia dataset where model data is very large. The sparsity of the model is a major challenge, and there is a tradeoff between locking and data replication in shared memory architecture to support concurrent threading. We apply multi-level synchronous and asynchronous optimizations and can achieve better parallel efficiency on the Intel Haswell cluster with 30 nodes (64 threads per node), compared to IU’s Big Red II supercomputer with 128 nodes (32 threads per node). CONCLUSIONS Memory usage and optimizations are essential for large-scale data analytics. Harp is an open source project from Indiana University developed as a library that plugs into Hadoop, enabling users to run iterative algorithms on both clouds and supercomputers. With the high number of processors in upcoming Intel Xeon Phi architectures, synchronization becomes a critical issue for the efficiency of global machine learning algorithms. Performance Comparison between Harp-LDA and Yahoo! LDA Convergence of LDA Topic Model MATERIALS AND METHODS REFERENCE Harp is a Hadoop plug-in made to abstract communication by transforming map-reduce programming models into map-collective models, reducing extraneous iterations and improving performance. It has the following features: Hierarchal data abstraction Collective communication model Pool-based memory management BSP-style Computation Parallelism Fault tolerance support with checkpointing [1] Bingjing Zhang, Yang Ruan, Judy Qiu, Harp: Collective Communication on Hadoop, accepted to the proceedings of IEEE International Conference on Cloud Engineering (IC2E2015), March 9-13, 2015. [2] Harp project. Available at http://salsaproj.indiana.edu/harp/index.html [3] Big Red II. Available at https://kb.iu.edu/d/bcqt