Department of Intelligent Systems Engineering

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox

Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.

Data Science at Digital Science October Geoffrey Fox Judy Qiu

Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)

Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.

High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference.

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox

HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Geoffrey Fox Panel Talk: February

Big Data Analytics and HPC Platforms

Panel: Beyond Exascale Computing

Hyungro Lee, Geoffrey C. Fox

Big Data is a Big Deal!.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

SPIDAL Analytics Performance February 2017

Big Data Enterprise Patterns

Digital Science Center II

Department of Intelligent Systems Engineering

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Status and Challenges: January 2017

HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS

Spark Presentation.

Big Data, Simulations and HPC Convergence

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

University of Technology

Department of Intelligent Systems Engineering

Some Remarks for Cloud Forward Internet2 Workshop

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Department of Intelligent Systems Engineering

Digital Science Center I

I590 Data Science Curriculum August

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

Tutorial Overview February 2017

Department of Intelligent Systems Engineering

AI First High Performance Big Data Computing for Industry 4.0

Data Science for Life Sciences Research & the Public Good

Hilton Hotel Honolulu Tapa Ballroom 2 June 26, 2017 Geoffrey Fox

Research in Digital Science Center

Scalable Parallel Interoperable Data Analytics Library

4 Education Initiatives: Data Science, Informatics, Computational Science and Intelligent Systems Engineering; What succeeds? National Academies Workshop.

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

EDUHIPC-18: WORKSHOP ON EDUCATION FOR HIGH-PERFORMANCE COMPUTING, 17 December 2018, Bengaluru, India Parallel & Distributed Computing (PDC) Using Low Cost,

Big Data Young Lee BUS 550.

Discussion: Cloud Computing for an AI First Future

Twister2: Design of a Big Data Toolkit

$1M a year for 5 years; 7 institutions Active:

PHI Research in Digital Science Center

PolarGrid and FutureGrid

Panel on Research Challenges in Big Data

Digital Science Center

Big Data, Simulations and HPC Convergence

Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

I590 Data Science Curriculum August

Presentation transcript:

Department of Intelligent Systems Engineering Big Data Panel Workshop on Big Data Foundations In conjunction with: 22nd International Conference on High Performance Computing (HiPC), Bengaluru, India Geoffrey Fox December 16, 2015 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington 2/17/2019

Panel Questions 1) What is HPC's role in Big Data computing? High performance and conversely Big Data brings sustainability and functionality to HPC 2) What are the new and exciting developments in the programming stack, application requirements, and design of large scale infrastructure for Big Data? HPC-ABDS 3) What are some of the flagship Big Data applications (current and future) that Big Data research in India should focus on? Streaming http://streamingsystems.org/ 4) What initiatives should be taken up by the government, academia, and industry in India to train and develop the next generation of data science professionals? Free and accessible sources of big data are hard to come by. The agencies and companies that have such Big data, do not currently share data. What ideas can be pursued to address this data unlocking problem? Data Science; largest IT masters at IU next semester (359). Work as a community to get benchmark/educational datasets 2/17/2019

High Performance Computing Apache Big Data Software Stack Green implies HPC Integration Meet-up Community Spark 107,000 in 233 groups Storm 9,400 Hadoop 40,000 and installed in 32% of company data systems 2013 Amazon: all dead! 2/17/2019

Big Data and (Exascale) Simulation Convergence I Our approach to Convergence is built around two ideas that avoid addressing the hardware directly as with modern DevOps technology it isn’t hard to retarget applications between different hardware systems. Rather we approach Convergence through applications and software. This talk has described the Convergence Diamonds Convergence that unify Big Simulation and Big Data applications and so allow one to more easily identify good approaches to implement Big Data and Exascale applications in a uniform fashion. This is summarized on Slides III and IV The software approach builds on the HPC-ABDS High Performance Computing enhanced Apache Big Data Software Stack concept described in Slide II (http://dsc.soic.indiana.edu/publications/HPC-ABDSDescribed_final.pdf, http://hpc-abds.org/kaleidoscope/ ) This arranges key HPC and ABDS software together in 21 layers showing where HPC and ABDS overlap. It for example, introduces a communication layer to allow ABDS runtime like Hadoop Storm Spark and Flink to use the richest high performance capabilities shared with MPI Generally it proposes how to use HPC and ABDS software together. 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence III Converge Applications: Separate data and model to classify Applications and Benchmarks across Big Data and Big Simulations to give Convergence Diamonds with many facets Indicated how to extend Big Data Ogres to Big Simulations by looking separately at model and data in Ogres Diamonds will have five views or collections of facets: Problem Architecture; Execution; Data Source and Style; Big Data Processing; Big Simulation Processing Facets cover data, model or their combination – the problem or application Note Simulation Processing View has similarities to old parallel computing benchmarks 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence IV Convergence Benchmarks: we will use benchmarks that cover the facets of the convergence diamonds i.e. cover big data and simulations; As we separate data and model, compute intensive simulation benchmarks (e.g. solve partial differential equation) will be linked with data analytics (the model in big data) IU focus SPIDAL (Scalable Parallel Interoperable Data Analytics Library) with high performance clustering, dimension reduction, graphs, image processing as well as MLlib will be linked to core PDE solvers to explore the communication layer of parallel middleware Maybe integrating data and simulation is an interesting idea in benchmark sets Convergence Programming Model Note parameter servers used in machine learning will be mimicked by collective operators invoked on distributed parameter (model) storage E.g. Harp as Hadoop HPC Plug-in There should be interest in using Big Data software systems to support exascale simulations Streaming solutions from IoT to analysis of astronomy and LHC data will drive high performance versions of Apache streaming systems 12/14/2015

Java MPI performs better than Threads 128 24 core Haswell nodes 200K Dataset Speedup 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence V Converge Language: Make Java run as fast as C++ (Java Grande) for computing and communication – see pre vious slide Surprising that so much Big Data work in industry but basic high performance Java methodology and tools missing Needs some work as no agreed OpenMP for Java parallel threads OpenMPI supports Java but needs enhancements to get best performance on needed collectives (For C++ and Java) Convergence Language Grande should support Python, Java (Scala), C/C++ (Fortran) 12/14/2015