Department of Intelligent Systems Engineering

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Big Data Analytics and HPC Platforms
Panel: Beyond Exascale Computing
Hyungro Lee, Geoffrey C. Fox
Big Data is a Big Deal!.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
SPIDAL Analytics Performance February 2017
Big Data Enterprise Patterns
Digital Science Center II
Department of Intelligent Systems Engineering
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Status and Challenges: January 2017
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Spark Presentation.
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
University of Technology
Department of Intelligent Systems Engineering
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Hilton Hotel Honolulu Tapa Ballroom 2 June 26, 2017 Geoffrey Fox
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
4 Education Initiatives: Data Science, Informatics, Computational Science and Intelligent Systems Engineering; What succeeds? National Academies Workshop.
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
EDUHIPC-18: WORKSHOP ON EDUCATION FOR HIGH-PERFORMANCE COMPUTING, 17 December 2018, Bengaluru, India Parallel & Distributed Computing (PDC) Using Low Cost,
Big Data Young Lee BUS 550.
Discussion: Cloud Computing for an AI First Future
Twister2: Design of a Big Data Toolkit
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
PolarGrid and FutureGrid
Panel on Research Challenges in Big Data
Digital Science Center
Big Data, Simulations and HPC Convergence
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

Department of Intelligent Systems Engineering Big Data Panel Workshop on Big Data Foundations In conjunction with: 22nd International Conference on High Performance Computing (HiPC), Bengaluru, India Geoffrey Fox December 16, 2015 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington 2/17/2019

Panel Questions 1) What is HPC's role in Big Data computing? High performance and conversely Big Data brings sustainability and functionality to HPC 2) What are the new and exciting developments in the programming stack, application requirements, and design of large scale infrastructure for Big Data? HPC-ABDS 3) What are some of the flagship Big Data applications (current and future) that Big Data research in India should focus on? Streaming http://streamingsystems.org/ 4) What initiatives should be taken up by the government, academia, and industry in India to train and develop the next generation of data science professionals? Free and accessible sources of big data are hard to come by. The agencies and companies that have such Big data, do not currently share data. What ideas can be pursued to address this data unlocking problem? Data Science; largest IT masters at IU next semester (359). Work as a community to get benchmark/educational datasets 2/17/2019

High Performance Computing Apache Big Data Software Stack Green implies HPC Integration Meet-up Community Spark 107,000 in 233 groups Storm 9,400 Hadoop 40,000 and installed in 32% of company data systems 2013 Amazon: all dead! 2/17/2019

Big Data and (Exascale) Simulation Convergence I Our approach to Convergence is built around two ideas that avoid addressing the hardware directly as with modern DevOps technology it isn’t hard to retarget applications between different hardware systems. Rather we approach Convergence through applications and software. This talk has described the Convergence Diamonds Convergence that unify Big Simulation and Big Data applications and so allow one to more easily identify good approaches to implement Big Data and Exascale applications in a uniform fashion. This is summarized on Slides III and IV The software approach builds on the HPC-ABDS High Performance Computing enhanced Apache Big Data Software Stack concept described in Slide II (http://dsc.soic.indiana.edu/publications/HPC-ABDSDescribed_final.pdf, http://hpc-abds.org/kaleidoscope/ ) This arranges key HPC and ABDS software together in 21 layers showing where HPC and ABDS overlap. It for example, introduces a communication layer to allow ABDS runtime like Hadoop Storm Spark and Flink to use the richest high performance capabilities shared with MPI Generally it proposes how to use HPC and ABDS software together. 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence III Converge Applications: Separate data and model to classify Applications and Benchmarks across Big Data and Big Simulations to give Convergence Diamonds with many facets Indicated how to extend Big Data Ogres to Big Simulations by looking separately at model and data in Ogres Diamonds will have five views or collections of facets: Problem Architecture; Execution; Data Source and Style; Big Data Processing; Big Simulation Processing Facets cover data, model or their combination – the problem or application Note Simulation Processing View has similarities to old parallel computing benchmarks 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence IV Convergence Benchmarks: we will use benchmarks that cover the facets of the convergence diamonds i.e. cover big data and simulations; As we separate data and model, compute intensive simulation benchmarks (e.g. solve partial differential equation) will be linked with data analytics (the model in big data) IU focus SPIDAL (Scalable Parallel Interoperable Data Analytics Library) with high performance clustering, dimension reduction, graphs, image processing as well as MLlib will be linked to core PDE solvers to explore the communication layer of parallel middleware Maybe integrating data and simulation is an interesting idea in benchmark sets Convergence Programming Model Note parameter servers used in machine learning will be mimicked by collective operators invoked on distributed parameter (model) storage E.g. Harp as Hadoop HPC Plug-in There should be interest in using Big Data software systems to support exascale simulations Streaming solutions from IoT to analysis of astronomy and LHC data will drive high performance versions of Apache streaming systems 12/14/2015

Java MPI performs better than Threads 128 24 core Haswell nodes 200K Dataset Speedup 12/14/2015

Things to do for Big Data and (Exascale) Simulation Convergence V Converge Language: Make Java run as fast as C++ (Java Grande) for computing and communication – see pre vious slide Surprising that so much Big Data work in industry but basic high performance Java methodology and tools missing Needs some work as no agreed OpenMP for Java parallel threads OpenMPI supports Java but needs enhancements to get best performance on needed collectives (For C++ and Java) Convergence Language Grande should support Python, Java (Scala), C/C++ (Fortran) 12/14/2015