Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

Slides:



Advertisements
Similar presentations
A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
Advertisements

Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Cyberinfrastructure for Coastal Forecasting and Change Analysis
OLAP Query Processing in Grids
BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Resource wrappers, web services, grid services Jaspreet Singh School of Computer.
Cracow Grid Workshop, November 5-6, 2001 Towards the CrossGrid Architecture Marian Bubak, Marek Garbacz, Maciej Malawski, and Katarzyna Zając.
Massimo Cafaro GridLab Review GridLab WP10 Information Services Massimo Cafaro CACT/ISUFI University of Lecce, Italy.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas
Nimrod/G GRID Resource Broker and Computational Economy David Abramson, Rajkumar Buyya, Jon Giddy School of Computer Science and Software Engineering Monash.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Wrapping Scientific Applications As Web Services Using The Opal Toolkit Wrapping Scientific Applications As Web Services Using The Opal Toolkit Sriram.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Department of Computing, School of Electrical Engineering and Computer Sciences, NUST - Islamabad KTH Applied Information Security Lab Secure Sharding.
1 Grid Activity Summary » Grid Testbed » CFD Application » Virtualization » Information Grid » Grid CA.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Economic and On Demand Brain Activity Analysis on Global Grids A case study.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
MSF and MAGE: e-Science Middleware for BT Applications Sep 21, 2006 Jaeyoung Choi Soongsil University, Seoul Korea
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Supporting Fault-Tolerance in Streaming Grid Applications
Communication and Memory Efficient Parallel Decision Tree Construction
A Grid-Based Middleware for Scalable Processing of Remote Data
Resource Allocation for Distributed Streaming Applications
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 2DataGrid Lab Abundance of Data Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics) Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 3DataGrid Lab Grid and Cloud Computing Provides access to otherwise inaccessible: Resources Data Goal of Grid Computing: to provide a stable foundation for distributed services Price for such foundation: Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 4DataGrid Lab Remote Data Analysis Remote data analysis –Grid is a good fit Details can be very tedious: Data retrieval, movement and caching Parallel data processing Resource allocation Application configuration Middleware can be useful to abstract away details

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 5DataGrid Lab Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Repository cluster Compute cluster Middleware user

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 6DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 7DataGrid Lab FREERIDE-G Grid Service Grid computing standards and infrastructures: OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 8DataGrid Lab SDSC Storage Resource Broker Standard for remote data: Storage Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 9DataGrid Lab FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 10DataGrid Lab FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G ADR responsible for remote data retrieval SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring Load balancing Data integration

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 11DataGrid Lab Evolution FREERIDE FREERIDE-G-ADR FREERIDE-G-SRBFREERIDE-G-GT Application Data ADR SRB Globus

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 12DataGrid Lab FREERIDE-G System Architecture

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 13DataGrid Lab Implementation Challenges I Interaction with Code Repository –Simplified Wrapper and Interface Generator –XML descriptors of API functions –Each API function wrapped in own class C++ Java C++ SWIG XML

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 14DataGrid Lab Implementation Challenges Integration with MPICH-G2 and Globus Toolkit Supports MPI Deployed through pre- WS Globus components Hides potential heterogeneity in: – service startup –management

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 15DataGrid Lab FREERIDE-G Applications VortexPro: Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: Clusters points based on Euclidean distance in an attribute space EM Clustering: Another distance-based clustering algorithm

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 16DataGrid Lab Experimental setup Organizational Grid: Data hosted on Opteron 250 cluster Processed on Opteron 254 cluster Connected using 2 10 GB optical fibers Goals: Demonstrate parallel scalability of applications Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 17DataGrid Lab Scalability Evaluation Scalable implementations with respect to numbers of: Compute nodes, Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom)

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 18DataGrid Lab Scalability Evaluation II EM Clustering (25.6 GB)Kmeans Clustering (25.6 GB) Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 19DataGrid Lab Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: %.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 20DataGrid Lab Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: scales with the total execution time doesn’t effect overall data processing service scalability

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 21DataGrid Lab Outline Motivation Introduction Middleware Overview Challenges Experimental Evaluation Related work Conclusion

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 22DataGrid Lab Grid Computing THE GRID Standards: Open Grid Service Architecture Web Services Resource Framework Implementations: Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: Condor and Legion, Workflow composition systems: –GridLab, Nimrod/G (GridBus), Cactus, GRIST

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 23DataGrid Lab Data Grids Metadata cataloging: Artemis project Remote data retrieval: SRB, SEMPLAR DataCutter, STORM Stream processing midleware: GATES dQUOB Data integration: Automatic wrapper generation

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 24DataGrid Lab Remote Data in Grid KnowledgeGrid tool-set: Developing datamining workflows (composition) GridMiner toolkit: Composition of distributed, parallel workflows DiscoveryNet layer: Creation, deployment, management of services DataMiningGrid framework: Distributed knowledge discovery Other projects partially overlap in goals with ours

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 25DataGrid Lab Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service Compliance with grid and repository standards Support for high-end processing Ease use of parallel configurations Hide details of data movement and caching Scalable performance with respect to data and compute nodes Low deployment overheads as compared to non- grid version

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 26DataGrid Lab SRB Data Host SRB Master: Connection establishment Authentication Forks agent to service I/O SRB Agent: Performs remote I/O Services multiple client API requests MCAT: Catalogs data associated with datasets, users, resources Services metadata queries (through SRB agent)

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 27DataGrid Lab Compute Node More compute nodes than data hosts Each node: 1.Registers IO (from index) 2.Connects to data host While (chunks to process) 1.Dispatch IO request(s) 2.Poll pending IO 3.Process retrieved chunks

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 28DataGrid Lab FREERIDE-G in Action SRB Agent SRB Master MCAT Data Host I/O Registration Connection establishment While (more chunks to process) I/O request dispatched Pending I/O polled Retrieved data chunks analyzed Compute Node

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 29DataGrid Lab Conversion Challenges 1.Interfacing C++ middleware with GT 4.0 (Java) Swig to generate Java Native Interface (JNI) wrapper 2.Integrating data mining service through WSDL interface Encapsulate service as a resource 3.Middleware deployment using GT 4.0 Use Web Service – Grid Resource Allocation Module for deployment