Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.

Slides:

Advertisements

Similar presentations

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Advertisements

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

DISTRIBUTED COMPUTING

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Introduction to Load Balancing:

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Supporting Fault-Tolerance in Streaming Grid Applications

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Communication and Memory Efficient Parallel Decision Tree Construction

Data-Intensive Computing: From Clouds to GPU Clusters

Resource Allocation in a Middleware for Streaming Data

A Grid-Based Middleware for Scalable Processing of Remote Data

Resource Allocation for Distributed Streaming Applications

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal

Computer Science and Engineering Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems Online repositories are becoming more prominent for data storage

Computer Science and Engineering Emergence of Cloud and Utility Computing Group generating data –use remote resources for storing data –Already popular with SDSC/SRB Scientist interested in deriving results from data –use distinct but remote resources for processing Remote Data Analysis Paradigm Data, Computation, and User at Different Locations Unaware of location of other

Computer Science and Engineering Remote Data Analysis Advantages –Flexible use of resources –Do not overload data repository –No unnecessary data movement –Avoid caching process once data Challenge: Tedious details: –Data retrieval and caching –Use of parallel configurations –Use of heterogeneous resources –Performance Issues Can a Grid Middleware Ease Application Development for Remote Data Analysis and Yet Provide High Performance ?

Computer Science and Engineering Our Work FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Enable Development of Flexible and Scalable Remote Data Processing Applications Repository cluster Compute cluster Middleware user

Computer Science and Engineering Challenges Support use of parallel configurations –For hosting data and processing data Transparent data movement Integration with Grid/Web Standards Resource selection –Computing resources –Data replica Scheduling and Load Balancing Data Wrapping Issues

Computer Science and Engineering FREERIDE (G) Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { (I, d’) = process(d) R(I) = R(I) op d’ } ……. }

Computer Science and Engineering FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G ADR responsible for remote data retrieval SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring Load balancing Data integration

Computer Science and Engineering Evolution FREERIDE FREERIDE-G-ADR FREERIDE-G-SRBFREERIDE-G-GT Application Data ADR SRB Globus

Computer Science and Engineering Classic Data Mining Applications Tested here: Expectation Maximization clustering Kmeans clustering K Nearest Neighbor search Previously on FREERIDE: Apriori and FPGrowth frequent itemset miners RainForest decision tree construction

Computer Science and Engineering Scientific data processing applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Defect detection: Finds defects in molecular lattice datasets Categorizes defects into classes Middleware is useful for wide variety of algorithms

Computer Science and Engineering FREERIDE-G Grid Service Emergence of Grid leads to computing standards and infrastructures Data hosts component already integrated through SRB for remote data access OGSA – previous standard for Grid Services WSRF – standard that merges Web and Grid Service definitions Globus Toolkit is most fitting infrastructure for Grid Service conversion of FREERIDE-G

Computer Science and Engineering FREERIDE-G System Architecture

Computer Science and Engineering Compute Node More compute nodes than data hosts Each node: 1.Registers IO (from index) 2.Connects to data host While (chunks to process) 1.Dispatch IO request(s) 2.Poll pending IO 3.Process retrieved chunks

Computer Science and Engineering FREERIDE-G in Action SRB Agent SRB Master MCAT Data Host I/O Registration Connection establishment While (more chunks to process) I/O request dispatched Pending I/O polled Retrieved data chunks analyzed Compute Node

Computer Science and Engineering Implementation Challenges Interaction with Code Repository –Simplified Wrapper and Interface Generator –XML descriptors of API functions –Each API function wrapped in own class Integration with MPICH-G2 –Supports MPI –Deployed through Globus components (GRAM) –Hides potential heterogeneity in service startup and management

Computer Science and Engineering Experimental setup Organizational Grid: Data hosted on Opteron 250 cluster Processed on Opteron 254 cluster Connected using 2 10 GB optical fibers Goals: Demonstrate parallel scalability of applications Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms

Computer Science and Engineering Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18-20%. Vortex Detection with 14.8 GB dataset: 17-20%.

Computer Science and Engineering Previous FREERIDE-G Limitations Data required to be resident on a single cluster Storage data format has to be same as processing data format Processing has to be confined to a single cluster Load balancing and data integration module helps FREERIDE-G overcome limitations efficiently

Computer Science and Engineering Motivating scenario A B C D Data

Computer Science and Engineering Run-time Load Balancing 1.Based on unbalanced (initial) distribution figure out an partitioning scheme (across repositories) 2.Model for load balancing based on two components (across processing nodes): partitioning of data chunks, data transfer time. weights used to combine two components

Computer Science and Engineering Load Balancing Algorithm Foreach (chunk c) If ( no processing node assigned to c ) Determine Transfer Costs across all attributes //Compare processing costs across all nodes: Foreach (processing node p) Assume chunk is assigned to p, Update processing cost and transfer cost Calculate averages for all other processing nodes (except p) for processing cost and transfer cost Compute difference in both costs between p and averages Add weighted costs together to compute total cost if (cost < minimum cost) update minimum Assign c to processing node with minimum cost

Computer Science and Engineering Approach to Integration Bandwidth available also used to determine level of concurrency After all vertical views of chunks are re-distributed to destination repository nodes, use wrapper to convert data Automatically generated wrapper: Input – physical data layout (storage layout) Output – logical data layout (application view) X1X1 Y1Y1 Z1Z1 ……… XnXn YnYn ZnZn Horizontal View Vertical View

Computer Science and Engineering Experimental Setup Settings: Organizational Grid Wide Area Network Goals are to evaluate: Scalability Load balancing accuracy Adaptability to scenarios –compute bound, –I/O bound, –WAN setting

Computer Science and Engineering Setup 1: Organizational Grid Data hosted on Opteron 250’s Processed on Opteron 254’s 2 clusters connected through 2 10 GB optical fibers Both clusters within same city (0.5 mile apart) Evaluating: Scalability Adaptability Integration overhead Compute cluster (cse-ri) Repository cluster (bmi-ri)

Computer Science and Engineering Setup 2: WAN Data Repository: Opteron 250’s (OSU) Opteron 258’s (Kent St) Processed on Opteron 254’s No dedicated link between processing and repository clusters Evaluating: Scalability Adaptability Compute cluster (OSU) Repository cluster (Kent ST) Repository cluster (OSU)

Computer Science and Engineering Scalability in Organizational Grid Vortex Detection (14.8 GB) Linear speedup for even data node - compute node scale-up Near-linear speedup for uneven compute node scale-up Our approach outperforms initial (unbalanced) Comes close to (statically) balanced configuration

Computer Science and Engineering Evaluating Balancing Accuracy EM (25.6 GB)Kmeans (25.6 GB)

Computer Science and Engineering Conclusions FREERIDE-G – middleware for scalable processing of remote data Integrated with grid computing standards Supports load balancing and data integration Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Fine-tuned performance models Compliance with remote repository standards