Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 2DataGrid Lab Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 3DataGrid Lab Grid and Cloud Computing Provides access to otherwise inaccessible: Resources Data Goal of Grid Computing: to provide a stable foundation for distributed services Price for such foundation: Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 4DataGrid Lab Remote Data Analysis Remote data analysis –Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 5DataGrid Lab Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Repository cluster Compute cluster Middleware user
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 6DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 7DataGrid Lab FREERIDE-G Grid Service Grid computing standards and infrastructures: OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 8DataGrid Lab SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 9DataGrid Lab FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 10DataGrid Lab FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G ADR responsible for remote data retrieval SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring Load balancing Data integration
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 11DataGrid Lab Evolution FREERIDE FREERIDE-G-ADR FREERIDE-G-SRBFREERIDE-G-GT Application Data ADR SRB Globus
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 12DataGrid Lab FREERIDE-G System Architecture
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 13DataGrid Lab Implementation Challenges I Interaction with Code Repository –Simplified Wrapper and Interface Generator –XML descriptors of API functions –Each API function wrapped in own class C++ Java C++ SWIG XML
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 14DataGrid Lab Implementation Challenges Integration with MPICH-G2 and Globus Toolkit Supports MPI Deployed through pre- WS Globus components Hides potential heterogeneity in: – service startup –management
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 15DataGrid Lab FREERIDE-G Applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: Clusters points based on Euclidean distance in an attribute space EM Clustering: Another distance-based clustering algorithm
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 16DataGrid Lab Experimental setup Organizational Grid: Data hosted on Opteron 250 cluster Processed on Opteron 254 cluster Connected using 2 10 GB optical fibers Goals: Demonstrate parallel scalability of applications Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 17DataGrid Lab Scalability Evaluation Scalable implementations with respect to numbers of: Compute nodes, Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom)
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 18DataGrid Lab Scalability Evaluation II EM Clustering (25.6 GB)Kmeans Clustering (25.6 GB) Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 19DataGrid Lab Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: %.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 20DataGrid Lab Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: scales with the total execution time doesn’t effect overall data processing service scalability
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 21DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 22DataGrid Lab Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: –GridLab, Nimrod/G (GridBus), Cactus, GRIST
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 23DataGrid Lab Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 24DataGrid Lab Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 25DataGrid Lab Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service Compliance with grid and repository standards Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Scalable performance with respect to data and compute nodes Low deployment overheads as compared to non- grid version
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 26DataGrid Lab SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent)
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 27DataGrid Lab Compute Node More compute nodes than data hosts Each node: 1.Registers IO (from index) 2.Connects to data host While (chunks to process) 1.Dispatch IO request(s) 2.Poll pending IO 3.Process retrieved chunks
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 28DataGrid Lab FREERIDE-G in Action SRB Agent SRB Master MCAT Data Host I/O Registration Connection establishment While (more chunks to process) I/O request dispatched Pending I/O polled Retrieved data chunks analyzed Compute Node
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 29DataGrid Lab Conversion Challenges 1.Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper 2.Integrating data mining service through WSDL interface Encapsulate service as a resource 3.Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment