A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal DataGrid Lab
Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab
Emergence of Grid Computing Provides access to: Resources Data otherwise inaccessible Aimed at providing a stable foundation for distributed services Comes at a price: Standards are needed to integrate distributed services Open Grid Service Architecture Web Service Resource Framework DataGrid Lab
Details can be very tedious: Data retrieval, movement and caching Remote Data Analysis Remote data analysis Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details DataGrid Lab
Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Development of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab
FREERIDE-G Middleware Realized: Active Data Repository based implementation Storage Resource Broker based implementation compliance with remote repository standards Performance modeling and prediction framework Future: Compliance with grid computing standards Support for data integration and load balancing DataGrid Lab
Publications Published: “Parallelizing EM Clustering Algorithm on a Cluster of SMPs”, Euro-Par’04. “Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware”, IPDPS’04 “Parallelizing a Defect Detection and Categorization Application”, IPDPS’05. “FREERIDE-G: Supporting Applications that Mine Remote Data Repositories”, ICPP’06. “A Performance Prediction Framework for Grid-Based Data Mining Applications”, IPDPS’07 In Submission: “A Middleware for Remote Mining of Data From SRB-based Servers”, submitted to SC’07 “Middleware for Data Mining Applications on Clusters and Grids, invited to special issue of Journal of Parallel and Distributed Computing (JPDC) on "Parallel Techniques for Information Extraction" DataGrid Lab
Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab
FREERIDE-G Architecture Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab
SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent) DataGrid Lab
SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage system Used to host data for FREERIDE-G Access provided through client API DataGrid Lab
Compute Node More compute nodes than data hosts Each node: Registers IO (from index) Connects to data host While (chunks to process) Dispatch IO request(s) Poll pending IO Process retrieved chunks DataGrid Lab
FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. DataGrid Lab
Classic Data Mining Applications Tested here: Expectation Maximization clustering Kmeans clustering K Nearest Neighbor search Previously on FREERIDE: Apriori and FPGrowth frequent itemset miners RainForest decision tree construction Middleware is useful for wide variety of algorithms DataGrid Lab
Scientific applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Defect detection: Finds defects in molecular lattice datasets Categorizes defects into classes DataGrid Lab
Experimental setup (LAN) Single cluster of dual processor 2.4 GHz Opteron 250 connected through Mellanox Infiniband (1 Gb) Goals: Evaluate data and compute node scalability Quantify overhead of using SRB (vs. ADR and TCP/IP) Active Data Repository implementation uses ADR for data retrieval and TCP/IP streams for data delivery to compute node (for processing) DataGrid Lab
LAN results Defect detection (16.4 GB) Kmeans (6.4 GB) Observations: Configurations with equal numbers of compute nodes and data hosts ADR outperforms SRB Configurations with more compute nodes, SRB wins Great scalability with respect to data host and compute nodes DataGrid Lab
Setup 2: Organizational Grid Data hosted on Opteron 250’s (as before) Processed on Opteron 254’s 2 clusters connected through 2 10 GB optical fibers Goals: Investigate effects of remote processing Data chunk effect on remote retrieval overhead Retrieval concurrency effect on overhead DataGrid Lab
Effects of chunk size Applications: Vortex detection (13.6 GB) kNN search (6.4 GB) Observations: Chunk size has no effect on execution time in LAN Enlarging chunk size in organization grid decreases remote retrieval overheads Difference accounted for in ability to overlap retrieval with processing DataGrid Lab
Effects of concurrent retrieval Applications: Defect detection (12.2 GB) Kmeans clustering (6.4GB) Observations: Retrieval and processing more overlapped All applications tested benefited from increase in concurrency level Remote retrieval overhead can be reduced effectively DataGrid Lab
Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab
Motivation for Performance Prediction Dataset replica selection complicated: Cluster size (number of partitions) Network speed between repository and compute clusters Compute cluster selection: Cluster size (processing parallelism) Network speed (data delivery) Processing structure Modeling performance is essential for selection of data replicas and compute resources DataGrid Lab
Approach and Processing Structure 3 phases of execution: Retrieval at data server Data delivery to compute node Parallel processing at compute node Special processing structure: Generalized reduction Texec = Tdisk + Tnetwork + Tcompute DataGrid Lab
Information Needed Dataset replica (repository cluster) size Compute cluster size Connecting network transfer rate Application profile 3 stage breakdown Reduction operation characteristics Reduction object size DataGrid Lab
Initial Phases Data retrieval and delivery phases Both phases scale with: Dataset size Number of storage nodes Data delivery also scales with: Network transfer rate Appropriate component of execution time breakdown is scaled using factor ratios DataGrid Lab
Data Processing Phase Processing data in 2 components: Parallel processing on each node scales with: number of cluster nodes dataset size Sequential: reduction object communication global reduction Common processing structure leveraged for accurate processing time prediction DataGrid Lab
Predicting for Different Resources Requires scaling factors: 3 phases of computation representative set of applications application being modeled doesn’t have to be part of representative set Average used for scaling: susceptible to outliers accuracy effected DataGrid Lab
Summarized Prediction Results Modeling reduction in detail pays off Correctly model changes in (very accurate): Dataset size Parallel configuration storage cluster compute cluster Network speed Underlying resources (less accurate) DataGrid Lab
Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab
FREERIDE-G Grid Service Emergence of Grid leads to computing standards and infrastructures Data hosts component already integrated through SRB for remote data access OGSA – previous standard for Grid Services WSRF – standard that merges Web and Grid Service definitions Globus Toolkit is most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab
Data Mining Service Provider FREERIDE-G User Parallelization Support Dataset & Repository Info FREERIDE-G User Web Server Configuration WSDL API functions FREERIDE-G configuration Parallelization Support Remote Retrieval Support Web Server Launcher Deployer for Data Servers Compute Nodes Resource Mgr CN 1 Data Server 1 CN 1 … … Data Server N CN M-1 CN M DataGrid Lab
Conversion Challenges Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper Integrating data mining service through WSDL interface Encapsulate service as a resource Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab
Need for Load Balancing & Integration Data integration: Physical layout different from application (logical) view Automatically generated wrapper to make connection between 2 views Load balancing: Horizontal vs. vertical views Uneven distribution of vertical views Different network transfer rates (for delivery of different views) DataGrid Lab
Run-time Load Balancing Based on unbalanced (initial) distribution figure out an even partitioning scheme (across repository nodes) Based on difference in initial and even distributions, figure out data senders/receivers Match up sender/receiver pairs Organize multiple concurrent transfers required to alleviate data delivery bottleneck DataGrid Lab
Approach to Integration Bandwidth available also used to determine level of concurrency After all vertical views of chunks are re-distributed to destination repository nodes, use wrapper to convert data Automatically generated wrapper: Input – physical data layout (storage layout) Output – logical data layout (application view) DataGrid Lab
Performance prediction for FREERIDE-G Future directions Outline Introduction FREERIDE-G over SRB Performance prediction for FREERIDE-G Future directions Standardization as Grid Service Data integration and load balancing Related work Conclusion DataGrid Lab
Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab
Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation DataGrid Lab
Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab
Resource Allocation Approaches 3 broad categories for resource allocation: Heuristic approach to mapping Prediction through modeling: Statistical estimation/prediction Analytical modeling of parallel application Simulation based performance prediction DataGrid Lab
Conclusions FREERIDE-G – middleware for scalable processing of remote data Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Predictable and scalable performance Compliance with remote repository standards Future: Compliance with grid computing standards Support for data integration and load balancing DataGrid Lab