Scalable Parallel Interoperable Data Analytics Library

Slides:



Advertisements
Similar presentations
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
Advertisements

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Digital Science Center
SPIDAL Analytics Performance February 2017
Digital Science Center II
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Early Results of Deep Learning on the Stampede2 Supercomputer
Department of Intelligent Systems Engineering
Accelerating Machine Learning with Model-Centric Approach on Emerging Architectures Many-Task Computing on Clouds, Grids, and Supercomputers Workshop.
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Tutorial: Big Data Algorithms and Applications Under Hadoop
Status and Challenges: January 2017
Extreme Big Data Examples
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Spark Presentation.
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Interactive Website (
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
HPSA18: Logistics 7:00 am – 8:00 am Breakfast
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski),
Data Science Curriculum March
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
Early Results of Deep Learning on the Stampede2 Supercomputer
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Ch 4. The Evolution of Analytic Scalability
Stochastic Optimization Maximization for Latent Variable Models
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Hybrid Programming with OpenMP and MPI
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Intelligent Systems Engineering Department, Indiana University
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Presentation transcript:

Scalable Parallel Interoperable Data Analytics Library Judy Qiu, Bingjing Zhang, Peng Bo, Thomas Wiggins, IPCC@IU ABSTRACT Harp Collective Communication Harp-LDA Strong Scaling Test SUMMARY SPIDAL (Scalable Parallel Interoperable Data Analytics Library) is a community infrastructure built upon the HPC-ABDS concept for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. Illustrating this principle, we have shown that previous standalone enhanced versions of MapReduce can be replaced by Harp (a Hadoop plug-in) that offers both data abstractions useful for high performance iteration and communication using best available (MPI) approaches that are portable to HPC and Cloud. This iterative solver would enable robustness, scalability, productivity, and sustainability for Data Analytics Link Mahout, MLlib, DAAL on HPC-Cloud as well as Deep Learning. Juliet is a cluster with Intel Haswell architecture (30 nodes each with 64 threads and connected with InfiniBand) Our experiments use 3,775,554 Wikipedia documents. The training model includes 1 million words and 10K topics. Alpha and beta are both set to 0.01. Harp-LDA is 56% faster compared with Yahoo!LDA on the Juliet cluster. The convergence speed of Harp-LDA on iterations also improved. Strong scaling test shows Harp-LDA can scale well on Juliet (Intel Haswell cluster). For the Convergence of LDA Topic Model: The convergence is fast. Harp-LDA reaches a similar level of perplexity at iter#200 as Yahoo!LDA does at iter#250-#300.   Harp-LDA performs more model synchronizations in each iteration, which may lead to better accuracy and converging speed. BACKGROUND We have tested Harp on Latent Dirichlet Allocation with Gibbs sampling over a wikipedia dataset where model data is very large. The sparsity of the model is a major challenge, and there is a tradeoff between locking and data replication in shared memory architecture to support concurrent threading. We apply multi-level synchronous and asynchronous optimizations and can achieve better parallel efficiency on the Intel Haswell cluster with 30 nodes (64 threads per node), compared to IU’s Big Red II supercomputer with 128 nodes (32 threads per node). CONCLUSIONS Memory usage and optimizations are essential for large-scale data analytics. Harp is an open source project from Indiana University developed as a library that plugs into Hadoop, enabling users to run iterative algorithms on both clouds and supercomputers. With the high number of processors in upcoming Intel Xeon Phi architectures, synchronization becomes a critical issue for the efficiency of global machine learning algorithms. Performance Comparison between Harp-LDA and Yahoo! LDA Convergence of LDA Topic Model MATERIALS AND METHODS REFERENCE Harp is a Hadoop plug-in made to abstract communication by transforming map-reduce programming models into map-collective models, reducing extraneous iterations and improving performance. It has the following features: Hierarchal data abstraction Collective communication model Pool-based memory management BSP-style Computation Parallelism Fault tolerance support with checkpointing [1] Bingjing Zhang, Yang Ruan, Judy Qiu, Harp: Collective Communication on Hadoop, accepted to the proceedings of IEEE International Conference on Cloud Engineering (IC2E2015), March 9-13, 2015. [2] Harp project. Available at http://salsaproj.indiana.edu/harp/index.html [3] Big Red II. Available at https://kb.iu.edu/d/bcqt