NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Slides:



Advertisements
Similar presentations
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Advertisements

HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Software Architecture
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Document Name CONFIDENTIAL Version Control Version No.DateType of ChangesOwner/ Author Date of Review/Expiry The information contained in this document.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
GIS IN THE CLOUD Cloud computing furnishes scalable GIS technology that is maintained off premises and delivered on demand as services via the Internet.
Geoffrey Fox Panel Talk: February
Big Data Analytics and HPC Platforms
Hyungro Lee, Geoffrey C. Fox
CLOUD ARCHITECTURE Many organizations and researchers have defined the architecture for cloud computing. Basically the whole system can be divided into.
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
SPIDAL Analytics Performance February 2017
Big Data Enterprise Patterns
Digital Science Center II
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Smart Building Solution
Introduction to Distributed Platforms
MIDAS- Molecular Dynamics Analysis Tutorial February 2017
Prepared by: Assistant prof. Aslamzai
Tutorial: Big Data Algorithms and Applications Under Hadoop
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Smart Building Solution
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Hadoop Clusters Tess Fulkerson.
Graduation Project Kick-off presentation - SET
Theme 4: High-performance computing for Precision Health Initiative
Some Remarks for Cloud Forward Internet2 Workshop
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
Data Science for Life Sciences Research & the Public Good
A Tale of Two Convergences: Applications and Computing Platforms
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Introduction to Apache
Overview of big data tools
Department of Intelligent Systems Engineering
$1M a year for 5 years; 7 institutions Active:
Brian Matthews STFC EOSCpilot Brian Matthews STFC
PHI Research in Digital Science Center
Agenda Need of Cloud Computing What is Cloud Computing
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Presentation transcript:

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS 11/7/2018 gcf@indiana.edu

Status of NSF 1443054 Project Big Data Application Analysis identifies features of data intensive applications that need to be supported in software and represented in benchmarks. This analysis was started for proposal and has been extended to support HPC-Simulations-Big Data convergence. The project is a collaboration between computer and domain scientists in application areas in Biomolecular Simulations, Network Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. HPC-ABDS as Cloud-HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack was a bold idea developed for proposal. We have successfully delivered and extended this approach, which is one of ideas described in Exascale Big Data report. 11/7/2018

Status of NSF 1443054 Project MIDAS integrating middleware that links HPC and ABDS now has several components including an architecture for Big Data analytics, an integration of HPC in communication and scheduling on ABDS; it also has rules to get high performance Java scientific code. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) now has 20 members with domain specific (general) and core algorithms. Benchmarks. We reached out to database community with keynote and paper at WBDB2015 Benchmarking Workshop. Language: SPIDAL Java runs as fast as C++ Designed and Proposed HPCCloud as hardware-software infrastructure supporting Big Data Big Simulation Convergence Big Data Management via Apache Stack ABDS Big Data Analytics using SPIDAL and other libraries 11/7/2018

Current Challenge: Allow Broad Deployment at Scale Classic Software Engineering: apply lessons like SPIDAL Java uniformly to each SPIDAL library member Package and testing for each component of SPIDAL and MIDAS Offer HPC-ABDS Capabilities as Platform as a Service with each capability specified as Ansible role giving function as a service Will allow HPC, Cloud and converged infrastructure HPCCloud Define good API’s to each function – current Apache libraries not well designed Demonstrate test applications as software as a service on virtual clusters Show that service providers (XSEDE) can deploy on variety of hardware Not clear if HPCCloud infrastructure available as must support roles from HPC and Big Data Comet one of best for us but need broader use of platforms to allow for example biomolecular simulations on Blue Waters 11/7/2018

HPC and/or Cloud 1.0 2.0 3.0 Cloud 1.0: IaaS PaaS Cloud 2.0: DevOps Cloud 3.0: Insight (Solution) as a Service from IBM; server-less computing; event driven function as a service HPC 1.0 and Cloud 1.0 separate ecosystems HPCCloud or HPC-ABDS: Take performance of HPC and functionality of Cloud (Big Data) systems HPCCloud 2.0 Use DevOps to invoke HPC or Cloud software on VM, Docker, HPC infrastructure HPCCloud 3.0 Automate Solution as a Service using HPC-ABDS on varied infrastructure suitable for HPC and Big Data Management and Analytics 11/7/2018

27 Ansible Roles and Re-use in 6 NIST use cases ID 6 NIST Use Cass Hadoop Mesos Spark Storm Pig Hive Drill HDFS HBase Mysql MongoDB RethinkDB Mahout D3, Tableau nltk MLlib Lucene/Solr OpenCV Python Java maven Ganglia Nagios spark supervisord zookeeper AlchemyAPI R 1 NIST Fingerprint Matching x 2 Human and Face Detection 3 Twitter Analysis 4 Analytics for Healthcare Data/Health Informatics 5 Spatial Big Data/Spatial Statistics/Geographic Information Systems 6 Data Warehousing and Data Mining count 5/17/2016

HPCCloud Convergence Architecture Running same HPC-ABDS software across all platforms but data management machines has different balance in I/O, Network and Compute from “model” (data analytics, simulation) machine Note data storage approach: HDFS v. Object Store v. Lustre style file systems is still rather unclear The Model behaves similarly whether from Big Data or Big Simulation. Data Management Model for Big Data and Big Simulation HPCCloud Operational Model matches hardware features with application requirements 11/7/2018

Future Challenge: Support and Encourage Deployment Even as the middleware and analytics are being developed and properly packaged we need to address Supported deployment User training proactive outreach to users and service providers. We need to consider traditional library as well as PaaS/FaaS and SaaS deployments. We will give a 6 hour tutorial on MIDAS and SPIDAL at a European winter school in February 2017 and this will increase our work in this area. 11/7/2018