Department of Intelligent Systems Engineering

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Digital Science Center
SPIDAL Analytics Performance February 2017
Digital Science Center II
Department of Intelligent Systems Engineering
Big Data A Quick Review on Analytical Tools
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
HPC 2016 HIGH PERFORMANCE COMPUTING
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Big Data, Simulations and HPC Convergence
Implementing parts of HPC-ABDS in a multi-disciplinary collaboration
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Interactive Website (
Distinguishing Parallel and Distributed Computing Performance
Tutorial February 2017 Software: MIDAS HPC-ABDS
Theme 4: High-performance computing for Precision Health Initiative
Big Data and Simulations: HPC and Clouds
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
IEEE BigData 2016 December 5-8, Washington D.C.
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski),
Data Science Curriculum March
Tutorial Overview February 2017
STREAM2016 Workshop Washington DC March
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Distinguishing Parallel and Distributed Computing Performance
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
SPIDAL and Deterministic Annealing
Distinguishing Parallel and Distributed Computing Performance
Digital Science Center III
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
$1M a year for 5 years; 7 institutions Active:
Big Data on Clouds and High Performance Computing
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
MDS and Visualization September Geoffrey Fox
Big Data, Simulations and HPC Convergence
High Performance Computing and Big Data
Using HPC-ABDS for Streaming Data
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

Department of Intelligent Systems Engineering Digital Science Center Research in High Performance Computing, Distributed Computing and Big Data PHI Geoffrey Fox October 28, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ http://dsc.soic.indiana.edu/publications/SPIDAL-DIBBSreport_July2016.pdf Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington

SPIDAL Project Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science NSF14-43054 started October 1, 2014 Indiana University (Fox, Qiu, Crandall, von Laszewski) Rutgers (Jha) Virginia Tech (Marathe) Kansas (Paden) Stony Brook (Wang) Arizona State (Beckstein) Utah (Cheatham) A co-design project: Software, algorithms, applications 11/8/2018

Co-designing Building Blocks Collaboratively Software: MIDAS HPC-ABDS Co-designing Building Blocks Collaboratively 11/8/2018

Main Components of SPIDAL Project Design and Build Scalable High Performance Data Analytics Library SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for: Domain specific data analytics libraries – mainly from project. Add Core Machine learning libraries – mainly from community. Performance of Java and MIDAS Inter- and Intra-node. NIST Big Data Application Analysis – features of data intensive Applications deriving 64 Convergence Diamonds. Application Nexus. HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Software Nexus MIDAS: Integrating Middleware – from project. Applications: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics, Streaming for robotics, streaming stock analytics Implementations: HPC as well as clouds (OpenStack, Docker) Convergence with common DevOps tool Hardware Nexus 11/8/2018

Hardware Clouds and HPC Prototype DSC 128 node Haswell Cluster 4 node 16GPU Cluster 64 node Knights Landing Cluster UITS 11/8/2018

HPC-ABDS 11/8/2018

HPC-ABDS SPIDAL Project Activities Green is MIDAS Black is SPIDAL Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated with Heron/Flink and Cloudmesh on HPC cluster Level 16: Applications: Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining Level 16: Algorithms: Generic and custom for applications SPIDAL Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop, Spark, Flink. Improve Inter- and Intra-node performance; science data structures Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark, Flink, Giraph) using HPC runtime technologies, Harp Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory Level 11: Data management: Hbase and MongoDB integrated via use of Beam and other Apache tools; enhance Hbase Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark, Hadoop; integrate Storm and Heron with Slurm Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability 11/8/2018

Java MPI performs better than FJ Threads 128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code Best FJ Threads intra node; MPI inter node Best MPI; inter and intra node MPI; inter/intra node; Java not optimized Speedup compared to 1 process per node on 48 nodes 11/8/2018

HTML5 web viewer WebPlotViz Supports visualization of 3D point sets (typically derived by mapping from abstract spaces) for streaming and non-streaming case Simple data management layer 3D web visualizer with various capabilities such as defining color schemes, point sizes, glyphs, labels Core Technologies MongoDB management Play Server side framework Three.js WebGL JSON data objects Bootstrap Javascript web pages Open Source http://spidal-gw.dsc.soic.indiana.edu/ ~10,000 lines of extra code Front end view (Browser) Plot visualization & time series animation (Three.js) Web Request Controllers (Play Framework) Upload Data Layer (MongoDB) Request Plots JSON Format Plots Upload format to JSON Converter Server MongoDB 11/8/2018

2D Vector Clustering with cutoff at 3 σ Orange Star – outside all clusters; yellow circle cluster centers Mass Spectrometer Peak Clustering. Charge 2 Sample with 10.9 million points and 420,000 clusters visualized in WebPlotViz 11/8/2018

446K sequences ~100 clusters Note distorted shapes probably due to imperfect distance measures e.g. position correlated with sequence length 11/8/2018

Spherical Phylograms MSA or SWG distances MSA RAxML result visualized in FigTree. SWG 11/8/2018

Relative Changes in Stock Values using one day values Expansion of previous data Mid Cap Energy S&P Dow Jones Finance S&P Mid Cap Dow Jones +10% Finance Origin 0% change Energy 11/8/2018 11/8/2018 13

O(N2) reduced to O(N) times cluster size O(N2) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut. Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection O(N2) green-green and purple-purple interactions have value but green-purple are “wasted” “clean” sample of 446K O(N2) reduced to O(N) times cluster size