A Grid-Based Middleware for Scalable Processing of Remote Data

Slides:



Advertisements
Similar presentations
Cyberinfrastructure for Coastal Forecasting and Change Analysis
Advertisements

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
DataGrid France 12 Feb – WP9 – n° 1 WP9 Earth Observation Applications.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Workload Management Workpackage
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
The Data Grid: Towards an architecture for Distributed Management
Parallel Data Laboratory, Carnegie Mellon University
Management of Virtual Execution Environments 3 June 2008
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Supporting Fault-Tolerance in Streaming Grid Applications
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Communication and Memory Efficient Parallel Decision Tree Construction
SDM workshop Strawman report History and Progress and Goal.
Ch 4. The Evolution of Analytic Scalability
Data-Intensive Computing: From Clouds to GPU Clusters
Laura Bright David Maier Portland State University
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Resource Allocation in a Middleware for Streaming Data
Resource Allocation for Distributed Streaming Applications
Chaitali Gupta, Madhusudhan Govindaraju
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal DataGrid Lab

Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab

Emergence of Grid Computing Provides access to: Resources Data otherwise inaccessible Aimed at providing a stable foundation for distributed services Comes at a price: Standards are needed to integrate distributed services Open Grid Service Architecture Web Service Resource Framework DataGrid Lab

Details can be very tedious: Data retrieval, movement and caching Remote Data Analysis Remote data analysis Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details DataGrid Lab

Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Development of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab

FREERIDE-G Middleware Realized: Active Data Repository based implementation Storage Resource Broker based implementation compliance with remote repository standards Performance modeling and prediction framework Future: Compliance with grid computing standards Support for data integration and load balancing DataGrid Lab

Publications Published: “Parallelizing EM Clustering Algorithm on a Cluster of SMPs”, Euro-Par’04. “Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware”, IPDPS’04 “Parallelizing a Defect Detection and Categorization Application”, IPDPS’05. “FREERIDE-G: Supporting Applications that Mine Remote Data Repositories”, ICPP’06. “A Performance Prediction Framework for Grid-Based Data Mining Applications”, IPDPS’07 In Submission: “A Middleware for Remote Mining of Data From SRB-based Servers”, submitted to SC’07 “Middleware for Data Mining Applications on Clusters and Grids, invited to special issue of Journal of Parallel and Distributed Computing (JPDC) on "Parallel Techniques for Information Extraction" DataGrid Lab

Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab

FREERIDE-G Architecture Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab

SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent) DataGrid Lab

SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage system Used to host data for FREERIDE-G Access provided through client API DataGrid Lab

Compute Node More compute nodes than data hosts Each node: Registers IO (from index) Connects to data host While (chunks to process) Dispatch IO request(s) Poll pending IO Process retrieved chunks DataGrid Lab

FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. DataGrid Lab

Classic Data Mining Applications Tested here: Expectation Maximization clustering Kmeans clustering K Nearest Neighbor search Previously on FREERIDE: Apriori and FPGrowth frequent itemset miners RainForest decision tree construction Middleware is useful for wide variety of algorithms DataGrid Lab

Scientific applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Defect detection: Finds defects in molecular lattice datasets Categorizes defects into classes DataGrid Lab

Experimental setup (LAN) Single cluster of dual processor 2.4 GHz Opteron 250 connected through Mellanox Infiniband (1 Gb) Goals: Evaluate data and compute node scalability Quantify overhead of using SRB (vs. ADR and TCP/IP) Active Data Repository implementation uses ADR for data retrieval and TCP/IP streams for data delivery to compute node (for processing) DataGrid Lab

LAN results Defect detection (16.4 GB) Kmeans (6.4 GB) Observations: Configurations with equal numbers of compute nodes and data hosts ADR outperforms SRB Configurations with more compute nodes, SRB wins Great scalability with respect to data host and compute nodes DataGrid Lab

Setup 2: Organizational Grid Data hosted on Opteron 250’s (as before) Processed on Opteron 254’s 2 clusters connected through 2 10 GB optical fibers Goals: Investigate effects of remote processing Data chunk effect on remote retrieval overhead Retrieval concurrency effect on overhead DataGrid Lab

Effects of chunk size Applications: Vortex detection (13.6 GB) kNN search (6.4 GB) Observations: Chunk size has no effect on execution time in LAN Enlarging chunk size in organization grid decreases remote retrieval overheads Difference accounted for in ability to overlap retrieval with processing DataGrid Lab

Effects of concurrent retrieval Applications: Defect detection (12.2 GB) Kmeans clustering (6.4GB) Observations: Retrieval and processing more overlapped All applications tested benefited from increase in concurrency level Remote retrieval overhead can be reduced effectively DataGrid Lab

Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab

Motivation for Performance Prediction Dataset replica selection complicated: Cluster size (number of partitions) Network speed between repository and compute clusters Compute cluster selection: Cluster size (processing parallelism) Network speed (data delivery) Processing structure Modeling performance is essential for selection of data replicas and compute resources DataGrid Lab

Approach and Processing Structure 3 phases of execution: Retrieval at data server Data delivery to compute node Parallel processing at compute node Special processing structure: Generalized reduction Texec = Tdisk + Tnetwork + Tcompute DataGrid Lab

Information Needed Dataset replica (repository cluster) size Compute cluster size Connecting network transfer rate Application profile 3 stage breakdown Reduction operation characteristics Reduction object size DataGrid Lab

Initial Phases Data retrieval and delivery phases Both phases scale with: Dataset size Number of storage nodes Data delivery also scales with: Network transfer rate Appropriate component of execution time breakdown is scaled using factor ratios DataGrid Lab

Data Processing Phase Processing data in 2 components: Parallel processing on each node scales with: number of cluster nodes dataset size Sequential: reduction object communication global reduction Common processing structure leveraged for accurate processing time prediction DataGrid Lab

Predicting for Different Resources Requires scaling factors: 3 phases of computation representative set of applications application being modeled doesn’t have to be part of representative set Average used for scaling: susceptible to outliers accuracy effected DataGrid Lab

Summarized Prediction Results Modeling reduction in detail pays off Correctly model changes in (very accurate): Dataset size Parallel configuration storage cluster compute cluster Network speed Underlying resources (less accurate) DataGrid Lab

Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab

FREERIDE-G Grid Service Emergence of Grid leads to computing standards and infrastructures Data hosts component already integrated through SRB for remote data access OGSA – previous standard for Grid Services WSRF – standard that merges Web and Grid Service definitions Globus Toolkit is most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab

Data Mining Service Provider FREERIDE-G User Parallelization Support Dataset & Repository Info FREERIDE-G User Web Server Configuration WSDL API functions FREERIDE-G configuration Parallelization Support Remote Retrieval Support Web Server Launcher Deployer for Data Servers Compute Nodes Resource Mgr CN 1 Data Server 1 CN 1 … … Data Server N CN M-1 CN M DataGrid Lab

Conversion Challenges Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper Integrating data mining service through WSDL interface Encapsulate service as a resource Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab

Need for Load Balancing & Integration Data integration: Physical layout different from application (logical) view Automatically generated wrapper to make connection between 2 views Load balancing: Horizontal vs. vertical views Uneven distribution of vertical views Different network transfer rates (for delivery of different views) DataGrid Lab

Run-time Load Balancing Based on unbalanced (initial) distribution figure out an even partitioning scheme (across repository nodes) Based on difference in initial and even distributions, figure out data senders/receivers Match up sender/receiver pairs Organize multiple concurrent transfers required to alleviate data delivery bottleneck DataGrid Lab

Approach to Integration Bandwidth available also used to determine level of concurrency After all vertical views of chunks are re-distributed to destination repository nodes, use wrapper to convert data Automatically generated wrapper: Input – physical data layout (storage layout) Output – logical data layout (application view) DataGrid Lab

Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab

Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab

Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation DataGrid Lab

Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab

Resource Allocation Approaches 3 broad categories for resource allocation: Heuristic approach to mapping Prediction through modeling: Statistical estimation/prediction Analytical modeling of parallel application Simulation based performance prediction DataGrid Lab

Conclusions FREERIDE-G – middleware for scalable processing of remote data Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Predictable and scalable performance Compliance with remote repository standards Future: Compliance with grid computing standards Support for data integration and load balancing DataGrid Lab