NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Slides:

Advertisements

Similar presentations

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

Advertisements

HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox

Software Architecture

Data Science at Digital Science October Geoffrey Fox Judy Qiu

Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)

Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.

Document Name CONFIDENTIAL Version Control Version No.DateType of ChangesOwner/ Author Date of Review/Expiry The information contained in this document.

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

GIS IN THE CLOUD Cloud computing furnishes scalable GIS technology that is maintained off premises and delivered on demand as services via the Internet.

Geoffrey Fox Panel Talk: February

Big Data Analytics and HPC Platforms

Hyungro Lee, Geoffrey C. Fox

CLOUD ARCHITECTURE Many organizations and researchers have defined the architecture for cloud computing. Basically the whole system can be divided into.

Organizations Are Embracing New Opportunities

Big Data is a Big Deal!.

SPIDAL Analytics Performance February 2017

Big Data Enterprise Patterns

Digital Science Center II

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Smart Building Solution

Introduction to Distributed Platforms

MIDAS- Molecular Dynamics Analysis Tutorial February 2017

Prepared by: Assistant prof. Aslamzai

Tutorial: Big Data Algorithms and Applications Under Hadoop

Status and Challenges: January 2017

Pathology Spatial Analysis February 2017

HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS

Smart Building Solution

Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Hadoop Clusters Tess Fulkerson.

Graduation Project Kick-off presentation - SET

Theme 4: High-performance computing for Precision Health Initiative

Some Remarks for Cloud Forward Internet2 Workshop

Department of Intelligent Systems Engineering

Digital Science Center I

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

Tutorial Overview February 2017

Department of Intelligent Systems Engineering

Data Science for Life Sciences Research & the Public Good

A Tale of Two Convergences: Applications and Computing Platforms

Research in Digital Science Center

Scalable Parallel Interoperable Data Analytics Library

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Introduction to Apache

Overview of big data tools

Department of Intelligent Systems Engineering

$1M a year for 5 years; 7 institutions Active:

Brian Matthews STFC EOSCpilot Brian Matthews STFC

PHI Research in Digital Science Center

Agenda Need of Cloud Computing What is Cloud Computing

Panel on Research Challenges in Big Data

Big Data, Simulations and HPC Convergence

Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.

Convergence of Big Data and Extreme Computing

Presentation transcript:

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS 11/7/2018 gcf@indiana.edu

Status of NSF 1443054 Project Big Data Application Analysis identifies features of data intensive applications that need to be supported in software and represented in benchmarks. This analysis was started for proposal and has been extended to support HPC-Simulations-Big Data convergence. The project is a collaboration between computer and domain scientists in application areas in Biomolecular Simulations, Network Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. HPC-ABDS as Cloud-HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack was a bold idea developed for proposal. We have successfully delivered and extended this approach, which is one of ideas described in Exascale Big Data report. 11/7/2018

Status of NSF 1443054 Project MIDAS integrating middleware that links HPC and ABDS now has several components including an architecture for Big Data analytics, an integration of HPC in communication and scheduling on ABDS; it also has rules to get high performance Java scientific code. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) now has 20 members with domain specific (general) and core algorithms. Benchmarks. We reached out to database community with keynote and paper at WBDB2015 Benchmarking Workshop. Language: SPIDAL Java runs as fast as C++ Designed and Proposed HPCCloud as hardware-software infrastructure supporting Big Data Big Simulation Convergence Big Data Management via Apache Stack ABDS Big Data Analytics using SPIDAL and other libraries 11/7/2018

Current Challenge: Allow Broad Deployment at Scale Classic Software Engineering: apply lessons like SPIDAL Java uniformly to each SPIDAL library member Package and testing for each component of SPIDAL and MIDAS Offer HPC-ABDS Capabilities as Platform as a Service with each capability specified as Ansible role giving function as a service Will allow HPC, Cloud and converged infrastructure HPCCloud Define good API’s to each function – current Apache libraries not well designed Demonstrate test applications as software as a service on virtual clusters Show that service providers (XSEDE) can deploy on variety of hardware Not clear if HPCCloud infrastructure available as must support roles from HPC and Big Data Comet one of best for us but need broader use of platforms to allow for example biomolecular simulations on Blue Waters 11/7/2018

HPC and/or Cloud 1.0 2.0 3.0 Cloud 1.0: IaaS PaaS Cloud 2.0: DevOps Cloud 3.0: Insight (Solution) as a Service from IBM; server-less computing; event driven function as a service HPC 1.0 and Cloud 1.0 separate ecosystems HPCCloud or HPC-ABDS: Take performance of HPC and functionality of Cloud (Big Data) systems HPCCloud 2.0 Use DevOps to invoke HPC or Cloud software on VM, Docker, HPC infrastructure HPCCloud 3.0 Automate Solution as a Service using HPC-ABDS on varied infrastructure suitable for HPC and Big Data Management and Analytics 11/7/2018

27 Ansible Roles and Re-use in 6 NIST use cases ID 6 NIST Use Cass Hadoop Mesos Spark Storm Pig Hive Drill HDFS HBase Mysql MongoDB RethinkDB Mahout D3, Tableau nltk MLlib Lucene/Solr OpenCV Python Java maven Ganglia Nagios spark supervisord zookeeper AlchemyAPI R 1 NIST Fingerprint Matching x 2 Human and Face Detection 3 Twitter Analysis 4 Analytics for Healthcare Data/Health Informatics 5 Spatial Big Data/Spatial Statistics/Geographic Information Systems 6 Data Warehousing and Data Mining count 5/17/2016

HPCCloud Convergence Architecture Running same HPC-ABDS software across all platforms but data management machines has different balance in I/O, Network and Compute from “model” (data analytics, simulation) machine Note data storage approach: HDFS v. Object Store v. Lustre style file systems is still rather unclear The Model behaves similarly whether from Big Data or Big Simulation. Data Management Model for Big Data and Big Simulation HPCCloud Operational Model matches hardware features with application requirements 11/7/2018

Future Challenge: Support and Encourage Deployment Even as the middleware and analytics are being developed and properly packaged we need to address Supported deployment User training proactive outreach to users and service providers. We need to consider traditional library as well as PaaS/FaaS and SaaS deployments. We will give a 6 hour tutorial on MIDAS and SPIDAL at a European winter school in February 2017 and this will increase our work in this area. 11/7/2018