Download presentation
Presentation is loading. Please wait.
Published byClaire Owen Modified over 9 years ago
1
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal
2
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 2DataGrid Lab Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining
3
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 3DataGrid Lab Grid and Cloud Computing Provides access to otherwise inaccessible: Resources Data Goal of Grid Computing: to provide a stable foundation for distributed services Price for such foundation: Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption
4
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 4DataGrid Lab Remote Data Analysis Remote data analysis –Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details
5
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 5DataGrid Lab Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Repository cluster Compute cluster Middleware user
6
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 6DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion
7
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 7DataGrid Lab FREERIDE-G Grid Service Grid computing standards and infrastructures: OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G
8
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 8DataGrid Lab SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API
9
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 9DataGrid Lab FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }
10
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 10DataGrid Lab FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G ADR responsible for remote data retrieval SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring Load balancing Data integration
11
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 11DataGrid Lab Evolution FREERIDE FREERIDE-G-ADR FREERIDE-G-SRBFREERIDE-G-GT Application Data ADR SRB Globus
12
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 12DataGrid Lab FREERIDE-G System Architecture
13
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 13DataGrid Lab Implementation Challenges I Interaction with Code Repository –Simplified Wrapper and Interface Generator –XML descriptors of API functions –Each API function wrapped in own class C++ Java C++ SWIG XML
14
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 14DataGrid Lab Implementation Challenges Integration with MPICH-G2 and Globus Toolkit Supports MPI Deployed through pre- WS Globus components Hides potential heterogeneity in: – service startup –management
15
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 15DataGrid Lab FREERIDE-G Applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: Clusters points based on Euclidean distance in an attribute space EM Clustering: Another distance-based clustering algorithm
16
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 16DataGrid Lab Experimental setup Organizational Grid: Data hosted on Opteron 250 cluster Processed on Opteron 254 cluster Connected using 2 10 GB optical fibers Goals: Demonstrate parallel scalability of applications Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster
17
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 17DataGrid Lab Scalability Evaluation Scalable implementations with respect to numbers of: Compute nodes, Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom)
18
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 18DataGrid Lab Scalability Evaluation II EM Clustering (25.6 GB)Kmeans Clustering (25.6 GB) Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads
19
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 19DataGrid Lab Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18- 20%.
20
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 20DataGrid Lab Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: scales with the total execution time doesn’t effect overall data processing service scalability
21
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 21DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion
22
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 22DataGrid Lab Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: –GridLab, Nimrod/G (GridBus), Cactus, GRIST
23
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 23DataGrid Lab Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation
24
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 24DataGrid Lab Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours
25
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 25DataGrid Lab Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service Compliance with grid and repository standards Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Scalable performance with respect to data and compute nodes Low deployment overheads as compared to non- grid version
26
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 26DataGrid Lab SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent)
27
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 27DataGrid Lab Compute Node More compute nodes than data hosts Each node: 1.Registers IO (from index) 2.Connects to data host While (chunks to process) 1.Dispatch IO request(s) 2.Poll pending IO 3.Process retrieved chunks
28
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 28DataGrid Lab FREERIDE-G in Action SRB Agent SRB Master MCAT Data Host I/O Registration Connection establishment While (more chunks to process) I/O request dispatched Pending I/O polled Retrieved data chunks analyzed Compute Node
29
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 29DataGrid Lab Conversion Challenges 1.Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper 2.Integrating data mining service through WSDL interface Encapsulate service as a resource 3.Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.