Interactive Website (

Slides:



Advertisements
Similar presentations
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Advertisements

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Building faster data applications on spark* clusters using Intel® DAAL.
Scaling up R computation with high performance computing resources.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Apache Ignite Compute Grid Research Corey Pentasuglia.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Big Data Analytics and HPC Platforms
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
TensorFlow– A system for large-scale machine learning
Dimensionality Reduction and Principle Components Analysis
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
SPIDAL Analytics Performance February 2017
Digital Science Center II
Early Results of Deep Learning on the Stampede2 Supercomputer
Department of Intelligent Systems Engineering
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Accelerating Machine Learning with Model-Centric Approach on Emerging Architectures Many-Task Computing on Clouds, Grids, and Supercomputers Workshop.
Status and Challenges: January 2017
Enabling machine learning in embedded systems
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Research in Digital Science Center
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Twister2: A High-Performance Big Data Programming Environment
I590 Data Science Curriculum August
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
Apache Spark Vibhatha Abeykoon Tyler Balson Gregor von Laszewski
Data Science Curriculum March
Department of Intelligent Systems Engineering
Tutorial Overview February 2017
CMPT 733, SPRING 2016 Jiannan Wang
Department of Intelligent Systems Engineering
Early Results of Deep Learning on the Stampede2 Supercomputer
AI First High Performance Big Data Computing for Industry 4.0
Summary Background Introduction in algorithms and applications
13th Cloud Control Workshop, June 13-15, 2018
A Tale of Two Convergences: Applications and Computing Platforms
Scalable Parallel Interoperable Data Analytics Library
HPC Cloud and Big Data Testbed
HPML Conference, Lyon, Sept 2018
10th IEEE/ACM International Conference on Utility and Cloud Computing
Indiana University, Bloomington
Tutorial: Harp-DAAL for High Performance Big Data Machine Learning
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Introduction to Twister2 for Tutorial
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Intelligent Systems Engineering Department, Indiana University
Research in Digital Science Center
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Interactive Website (https://dexterrules.github.io/SC-Demo-17/SC-Demo.html)

HPC-ABDS is Cloud-HPC interoperable software with the performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. This concept is illustrated by Harp-DAAL. High Level Usability: Python Interface, well documented and packaged modules Middle Level Data-Centric Abstractions: Computation Model and optimized communication patterns Low Level optimized for Performance: HPC kernels Intel® DAAL and advanced hardware platforms such as Xeon and Xeon Phi Harp-DAAL

Datasets: 5 million points, 10 thousand centroids, 10 feature dimensions 10 to 20 nodes of Intel KNL7250 processors Harp-DAAL has 15x speedups over Spark MLlib Datasets: 500K or 1 million data points of feature dimension 300 Running on single KNL 7250 (Harp-DAAL) vs. single K80 GPU (PyTorch) Harp-DAAL achieves 3x to 6x speedups Datasets: Twitter with 44 million vertices, 2 billion edges, subgraph templates of 10 to 12 vertices 25 nodes of Intel Xeon E5 2670 Harp-DAAL has 2x to 5x speedups over state-of-the-art MPI-Fascia solution

Explore Algorithms Alternating least squares (ALS) PCA Alternating least squares (ALS) SubGraph Counting MatrixFactorization-SGD

Computation Models Model-Centric Synchronization Paradigm Distributed Memory broadcast reduce allgather allreduce regroup push & pull rotate sendEvent getEvent waitEvent Communication Operations Computation Models Model-Centric Synchronization Paradigm Computation Models Locking Rotation Asynchronous Allreduce Intel® DAAL, MKL High Performance Library Shared Memory Schedulers Dynamic Scheduler Static Scheduler Solutions to Big Data Problems

Example: K-means Clustering The Allreduce Computation Model Model Worker When the model size cannot be held in each machine’s memory When the model size is large but can still be held in each machine’s memory When the model size is small allgather regroup push & pull broadcast reduce allreduce rotate

result: sample images of the result clusters, one row for each cluster

Harp-DAAL: Prototype and Production Code Source codes became available on Github in February, 2017. Harp-DAAL follows the same standard of DAAL’s original codes Twelve Applications Harp-DAAL Kmeans Harp-DAAL MF-SGD Harp-DAAL MF-ALS Harp-DAAL SVD Harp-DAAL PCA Harp-DAAL Neural Networks Harp-DAAL Naïve Bayes Harp-DAAL Linear Regression Harp-DAAL Ridge Regression Harp-DAAL QR Decomposition Harp-DAAL Low Order Moments Harp-DAAL Covariance Harp-DAAL: Prototype and Production Code Available at https://dsc-spidal.github.io/harp