Scalable Parallel Interoperable Data Analytics Library Judy Qiu, Bingjing Zhang, Peng Bo, Thomas Wiggins, IPCC@IU ABSTRACT Harp Collective Communication Harp-LDA Strong Scaling Test SUMMARY SPIDAL (Scalable Parallel Interoperable Data Analytics Library) is a community infrastructure built upon the HPC-ABDS concept for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. Illustrating this principle, we have shown that previous standalone enhanced versions of MapReduce can be replaced by Harp (a Hadoop plug-in) that offers both data abstractions useful for high performance iteration and communication using best available (MPI) approaches that are portable to HPC and Cloud. This iterative solver would enable robustness, scalability, productivity, and sustainability for Data Analytics Link Mahout, MLlib, DAAL on HPC-Cloud as well as Deep Learning. Juliet is a cluster with Intel Haswell architecture (30 nodes each with 64 threads and connected with InfiniBand) Our experiments use 3,775,554 Wikipedia documents. The training model includes 1 million words and 10K topics. Alpha and beta are both set to 0.01. Harp-LDA is 56% faster compared with Yahoo!LDA on the Juliet cluster. The convergence speed of Harp-LDA on iterations also improved. Strong scaling test shows Harp-LDA can scale well on Juliet (Intel Haswell cluster). For the Convergence of LDA Topic Model: The convergence is fast. Harp-LDA reaches a similar level of perplexity at iter#200 as Yahoo!LDA does at iter#250-#300. Harp-LDA performs more model synchronizations in each iteration, which may lead to better accuracy and converging speed. BACKGROUND We have tested Harp on Latent Dirichlet Allocation with Gibbs sampling over a wikipedia dataset where model data is very large. The sparsity of the model is a major challenge, and there is a tradeoff between locking and data replication in shared memory architecture to support concurrent threading. We apply multi-level synchronous and asynchronous optimizations and can achieve better parallel efficiency on the Intel Haswell cluster with 30 nodes (64 threads per node), compared to IU’s Big Red II supercomputer with 128 nodes (32 threads per node). CONCLUSIONS Memory usage and optimizations are essential for large-scale data analytics. Harp is an open source project from Indiana University developed as a library that plugs into Hadoop, enabling users to run iterative algorithms on both clouds and supercomputers. With the high number of processors in upcoming Intel Xeon Phi architectures, synchronization becomes a critical issue for the efficiency of global machine learning algorithms. Performance Comparison between Harp-LDA and Yahoo! LDA Convergence of LDA Topic Model MATERIALS AND METHODS REFERENCE Harp is a Hadoop plug-in made to abstract communication by transforming map-reduce programming models into map-collective models, reducing extraneous iterations and improving performance. It has the following features: Hierarchal data abstraction Collective communication model Pool-based memory management BSP-style Computation Parallelism Fault tolerance support with checkpointing [1] Bingjing Zhang, Yang Ruan, Judy Qiu, Harp: Collective Communication on Hadoop, accepted to the proceedings of IEEE International Conference on Cloud Engineering (IC2E2015), March 9-13, 2015. [2] Harp project. Available at http://salsaproj.indiana.edu/harp/index.html [3] Big Red II. Available at https://kb.iu.edu/d/bcqt