Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

Similar presentations


Presentation on theme: "Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing."— Presentation transcript:

1 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal

2 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 2DataGrid Lab Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining

3 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 3DataGrid Lab Grid and Cloud Computing Provides access to otherwise inaccessible: Resources Data Goal of Grid Computing: to provide a stable foundation for distributed services Price for such foundation: Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption

4 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 4DataGrid Lab Remote Data Analysis Remote data analysis –Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details

5 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 5DataGrid Lab Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Repository cluster Compute cluster Middleware user

6 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 6DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion

7 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 7DataGrid Lab FREERIDE-G Grid Service Grid computing standards and infrastructures: OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G

8 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 8DataGrid Lab SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API

9 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 9DataGrid Lab FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

10 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 10DataGrid Lab FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G ADR responsible for remote data retrieval SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring Load balancing Data integration

11 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 11DataGrid Lab Evolution FREERIDE FREERIDE-G-ADR FREERIDE-G-SRBFREERIDE-G-GT Application Data ADR SRB Globus

12 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 12DataGrid Lab FREERIDE-G System Architecture

13 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 13DataGrid Lab Implementation Challenges I Interaction with Code Repository –Simplified Wrapper and Interface Generator –XML descriptors of API functions –Each API function wrapped in own class C++ Java C++ SWIG XML

14 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 14DataGrid Lab Implementation Challenges Integration with MPICH-G2 and Globus Toolkit Supports MPI Deployed through pre- WS Globus components Hides potential heterogeneity in: – service startup –management

15 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 15DataGrid Lab FREERIDE-G Applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: Clusters points based on Euclidean distance in an attribute space EM Clustering: Another distance-based clustering algorithm

16 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 16DataGrid Lab Experimental setup Organizational Grid: Data hosted on Opteron 250 cluster Processed on Opteron 254 cluster Connected using 2 10 GB optical fibers Goals: Demonstrate parallel scalability of applications Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster

17 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 17DataGrid Lab Scalability Evaluation Scalable implementations with respect to numbers of: Compute nodes, Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom)

18 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 18DataGrid Lab Scalability Evaluation II EM Clustering (25.6 GB)Kmeans Clustering (25.6 GB) Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads

19 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 19DataGrid Lab Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18- 20%.

20 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 20DataGrid Lab Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: scales with the total execution time doesn’t effect overall data processing service scalability

21 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 21DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion

22 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 22DataGrid Lab Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: –GridLab, Nimrod/G (GridBus), Cactus, GRIST

23 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 23DataGrid Lab Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation

24 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 24DataGrid Lab Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours

25 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 25DataGrid Lab Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service Compliance with grid and repository standards Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Scalable performance with respect to data and compute nodes Low deployment overheads as compared to non- grid version

26 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 26DataGrid Lab SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent)

27 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 27DataGrid Lab Compute Node More compute nodes than data hosts Each node: 1.Registers IO (from index) 2.Connects to data host While (chunks to process) 1.Dispatch IO request(s) 2.Poll pending IO 3.Process retrieved chunks

28 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 28DataGrid Lab FREERIDE-G in Action SRB Agent SRB Master MCAT Data Host I/O Registration Connection establishment While (more chunks to process) I/O request dispatched Pending I/O polled Retrieved data chunks analyzed Compute Node

29 Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 29DataGrid Lab Conversion Challenges 1.Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper 2.Integrating data mining service through WSDL interface Encapsulate service as a resource 3.Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment


Download ppt "Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing."

Similar presentations


Ads by Google