SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
epiC: an Extensible and Scalable System for Processing Big Data
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Big Data Open Source Software and Projects ABDS in Summary XVII: Layer 13 Part 2 Data Science Curriculum March Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Collective Communication
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
Connections to Other Packages The Cactus Team Albert Einstein Institute
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Status and Challenges: January 2017
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
University of Technology
Department of Intelligent Systems Engineering
Interactive Website (
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
Data Science Curriculum March
Biology MDS and Clustering Results
Tutorial Overview February 2017
AI First High Performance Big Data Computing for Industry 4.0
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Scalable Parallel Interoperable Data Analytics Library
Parallel Applications And Tools For Cloud Computing Environments
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in data-driven scientific discovery and commercial services. An interesting principle is that HPC ideas should integrate well with Apache (and other) open source big data technologies (ABDS). ABDS seems a winner as it has a clear vitality and innovation with a sustainable software model. Our current catalog has identified 200 software subsystems divided into 17 layers. Illustrating this principle, I have shown that previous standalone enhanced versions of MapReduce can be replaced by a Hadoop plug-in that offers both data abstractions useful for high performance iteration and communication using best available (MPI) approaches that are portable to HPC and Cloud. This iterative solver would enable robustness, scalability, productivity, and sustainability for applications including Computer Vision, Pathology, Information Visualization, Network Science, Remote sensing, Physical Simulation, as well as many commercial applications. This variety of applications should allow tests of memory architecture, vectorization and parallelization approach on the different Intel systems. Judy Qiu, Indiana University

SALSASALSA Map-Collective Communication Model Parallelism Model Software Architecture Shuffle M M MM Collective Communication M M MM RR Map-Collective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications Map-Collective Applications Application Framework Resource Manager We generalize the Map-Reduce concept to Map-Collective, noting that large collectives (high performance data movement) are a distinguishing feature of data intensive and data mining applications. Hadoop Plugin (on Hadoop and Hadoop 2.2.0) REEF Architecture

SALSASALSA Vertex Table KeyValue Partition Array Commutable KeyValues Vertices, Edges, Messages Double Array Int Array Long Array Array Partition Struct Object Vertex Partition Edge Partition Array Table Message Partition KeyValue Table Byte Array Message Table Edge Table Broadcast, Send, Gather Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to- Vertex Broadcast, Send Table Partition Basic Types Hierarchical Data Abstraction and Collective Communication We create abstractions and connect to other communities so we can collaborate on common software building blocks.