Research in Digital Science Center

Slides:



Advertisements
Similar presentations
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
Advertisements

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Data Science at Digital Science October Geoffrey Fox Judy Qiu
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Digital Science Center
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
SPIDAL Analytics Performance February 2017
Digital Science Center II
Department of Intelligent Systems Engineering
Status and Challenges: January 2017
Big Data and High-Performance Technologies for Natural Computation
Implementing parts of HPC-ABDS in a multi-disciplinary collaboration
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Engineered nanoBIO Node at Indiana University
Department of Intelligent Systems Engineering
Interactive Website (
Research in Digital Science Center
Engineered nanoBIO Node
Engineered nanoBIO Node
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
IEEE BigData 2016 December 5-8, Washington D.C.
Department of Intelligent Systems Engineering
Digital Science Center I
HPSA18: Logistics 7:00 am – 8:00 am Breakfast
Twister2: A High-Performance Big Data Programming Environment
I590 Data Science Curriculum August
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
Research in Intelligent Systems Engineering
NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski),
Data Science Curriculum March
HPC-enhanced IoT and Data-based Grid
Department of Intelligent Systems Engineering
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
13th Cloud Control Workshop, June 13-15, 2018
A Tale of Two Convergences: Applications and Computing Platforms
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
HPC Cloud and Big Data Testbed
High Performance Big Data Computing
10th IEEE/ACM International Conference on Utility and Cloud Computing
Digital Science Center III
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Introduction to Twister2 for Tutorial
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Research in Digital Science Center
High-Performance Big Data Computing
Big Data and High-Performance Technologies for Natural Computation
Research in Digital Science Center
Research in Digital Science Center
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Research in Digital Science Center Geoffrey Fox, January 19, 2018 Digital Science Center Department of Intelligent Systems Engineering gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/ Judy Qiu, David Crandall, Gregor von Laszewski, Dennis Gannon Supun Kamburugamuve, Pulasthi Wickramasinghe, Hyungro Lee, Jerome Mitchell Bo Peng, Langshi Chen, Kannan Govindarajan, Fugang Wang Internal collaboration. Biology, Physics, SICE Outside Collaborators in funded projects: Arizona, Kansas, Purdue, Rutgers, San Diego Supercomputer Center, SUNY Stony Brook, Virginia Tech, UIUC and Utah NIST and Fudan University

Digital Science Center Research Activities Building SPIDAL Scalable HPC machine Learning Library Applying current SPIDAL in Biology, Network Science (OSoMe), Pathology Harp HPC Machine Learning Framework (Qiu) Twister2 HPC Event Driven Distributed Programming model (replace Spark) Cloud Research and DevOps for Software Defined Systems (von Laszewski) Intel Parallel Computing Center @IU (Qiu, Gottlieb) Fudan-Indiana Universities’ Institute for Transformational High-Performance Big-Data Computing Work with NIST on Big Data Standards and non-proprietary Frameworks Engineered nanoBIO Node NSF EEC-1720625 with Purdue and UIUC Polar (Radar) Image Processing (Crandall); being used in production Data analysis of experimental physics scattering results IoTCloud. Cloud control of robots – licensed to C2RO (Montreal) Big Data on HPC Cloud

Digital Science Center/ISE Infrastructure Run computer infrastructure for Cloud and HPC research 16 K80 and 16 Volta GPU, 8 Haswell node Romeo used in Deep Learning Course E533 and Research (Volta have NVLink) 26 nodes Victor/Tempest Infiniband/Omnipath Intel Xeon Platinum 48 core nodes 64 node system Tango with high performance disks (SSD, NVRam = 5x SSD and 25xHDD) and Intel KNL (Knights Landing) manycore (68-72) chips. Omnipath interconnect 128 node system Juliet with two 12-18 core Haswell chips, SSD and conventional HDD disks. Infiniband Interconnect FutureSystems Bravo Delta Echo old but useful; 48 nodes All have HPC networks and all can run HDFS and store data on nodes Teach ISE basic and advanced Cloud Computing and bigdata courses E222 Intelligent Systems II (Undergraduate) E534 Big Data Applications and Analytics E516 Introduction to Cloud Computing E616 Advanced Cloud Computing Supported by Gary Miksik, Allan Streib Switch focus to Docker+Kubernetes Use Github for all non-FERPA course material. Have collected large number of open source written-up projects 9/14/2019

Engineered nanoBIO Node Indiana University: Intelligent Systems Engineering, Chemistry, Science Gateways Community Institute The Engineered nanoBIO node at Indiana University (IU) will develop a powerful set of integrated computational nanotechnology tools that facilitate the discovery of customized, efficient, and safe nanoscale devices for biological applications. Applications and Frameworks will be deployed and supported on nanoHUB. Use in Undergraduate and masters programs in ISE for Nanoengineering and Bioengineering ISE (Intelligent Systems Engineering) as a new department developing courses from scratch (67 defined in first 2 years) Research Experiences for Undergraduates throughout year Annual engineered nanoBIO workshop Summer Camps for Middle and High School Students Online (nanoHUB and YouTube) courses with accessible content on nano and bioengineering Research and Education tools build on existing simulations, analytics and frameworks: Physicell and CompuCell3D PhysiCell NP Shape Lab:

Ogres Application Analysis NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Ogres Application Analysis HPC-ABDS and HPC- FaaS Software Harp and Twister2 Building Blocks SPIDAL Data Analytics Library Software: MIDAS HPC-ABDS

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Qiu/Fox Core SPIDAL Parallel HPC Library with Collective Used QR Decomposition (QR) Reduce, Broadcast DAAL Neural Network AllReduce DAAL Covariance AllReduce DAAL Low Order Moments Reduce DAAL Naive Bayes Reduce DAAL Linear Regression Reduce DAAL Ridge Regression Reduce DAAL Multi-class Logistic Regression Regroup, Rotate, AllGather Random Forest AllReduce Principal Component Analysis (PCA) AllReduce DAAL DA-MDS Rotate, AllReduce, Broadcast Directed Force Dimension Reduction AllGather, Allreduce Irregular DAVS Clustering Partial Rotate, AllReduce, Broadcast DA Semimetric Clustering (Deterministic Annealing) Rotate, AllReduce, Broadcast K-means AllReduce, Broadcast, AllGather DAAL SVM AllReduce, AllGather SubGraph Mining AllGather, AllReduce Latent Dirichlet Allocation Rotate, AllReduce Matrix Factorization (SGD) Rotate DAAL Recommender System (ALS) Rotate DAAL Singular Value Decomposition (SVD) AllGather DAAL DAAL implies integrated on node with Intel DAAL Optimized Data Analytics Library (Runs on KNL!) 9/14/2019

Twister2: “Next Generation Grid - Edge – HPC Cloud” Original 2010 Twister paper has 928 citations; it was a particular approach to MapCollective iterative processing for machine learning Re-engineer current Apache Big Data and HPC software systems as a toolkit Support a serverless (cloud-native) dataflow event-driven HPC-FaaS (microservice) framework running across application and geographic domains. Support all types of Data analysis from Parallel Machine Learning to Edge computing Build on Cloud best practice but use HPC wherever possible to get high performance Smoothly support current paradigms Hadoop, Spark, Flink, Heron, MPI, DARMA … Use interoperable common abstractions but multiple polymorphic implementations. i.e. do not require a single runtime Focus on Runtime but this implies HPC-FaaS programming and execution model This defines a next generation Grid based on data and edge devices – not computing as in old Grid See long paper http://dsc.soic.indiana.edu/publications/Twister2.pdf

Components of Twister2 Dataflow coordination Points, Execution Semantics (plan map resources to execution unit), Parallel Computing Paradigm,(Dynamic/Static) Resource Allocation Task migration, Elasticity, Streaming and FaaS Events, Task Execution, Task Scheduling, Task Graph Messages, Dataflow Communication, BSP Communication, Map-Collective Static (Batch) Data access/store, Streaming Data, Distributed Data Set, Check Pointing Security