Scalable Clustering on the Data Grid Patrick Wendel Moustafa Ghanem Yike Guo Discovery Net Department of Computing Imperial College,

Slides:

Advertisements

Similar presentations

The e-Framework Bill Olivier Director Development, Systems and Technology JISC.

Advertisements

Copyright Discovery Net Imperial College SARS Analysis on the Grid Discovery Net in Bioinformatics.

Current status of grids: the need for standards Mike Mineter TOE-NeSC, Edinburgh.

anywhere and everywhere. omnipresent A sensor network is an infrastructure comprised of sensing (measuring), computing, and communication elements.

Chapter 10: Designing Databases

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.

Tom Sheridan IT Director Gas Technology Institute (GTI)

Information Retrieval in Practice

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Toward Knowledge Discovery in Databases Attached to Grids Peter Brezany Institute for Software.

Chapter 9: Moving to Design

Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.

Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.

Overview of Search Engines

H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

Database Environment 1.  Purpose of three-level database architecture.  Contents of external, conceptual, and internal levels.  Purpose of external/conceptual.

The Design Discipline.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.

An approach to Intelligent Information Fusion in Sensor Saturated Urban Environments Charalampos Doulaverakis Centre for Research and Technology Hellas.

Chapter 8 Architecture Analysis. 8 – Architecture Analysis 8.1 Analysis Techniques 8.2 Quantitative Analysis  Performance Views  Performance.

Chapter 2 CIS Sungchul Hong

Database System Concepts and Architecture

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

Cluster Reliability Project ISIS Vanderbilt University.

Wireless Networks Breakout Session Summary September 21, 2012.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

SOFTWARE DESIGN.

What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.

Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

The roots of innovation Future and Emerging Technologies (FET) Future and Emerging Technologies (FET) The roots of innovation Proactive initiative on:

1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.

Service - Oriented Middleware for Distributed Data Mining on the Grid ，劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.

Styx Grid Services: Lightweight, easy-to-use middleware for e-Science Jon Blower Keith Haines Reading e-Science Centre, ESSC, University of Reading, RG6.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

SEEK Welcome Malcolm Atkinson Director 12 th May 2004.

Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Authors: Ronnie Julio Cole David

9 Systems Analysis and Design in a Changing World, Fourth Edition.

Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.

GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.

Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.

GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

Unit – I Presentation. Unit – 1 (Introduction to Software Project management) Definition:-  Software project management is the art and science of planning.

OBJECT-ORIENTED TESTING. TESTING OOA AND OOD MODELS Analysis and design models cannot be tested in the conventional sense. However, formal technical reviews.

© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Decisive Themes, July, JL-1 ARTEMIS Decisive Theme for Integrasys Pedro A. Ruiz Integrasys July, 2011.

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Information Retrieval in Practice

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Distribution and components

Grid Computing.

University of Technology

Chapter 2 Database Environment Pearson Education © 2009.

Yining ZHAO Computer Network Information Center,

Presentation transcript:

Scalable Clustering on the Data Grid Patrick Wendel Moustafa Ghanem Yike Guo Discovery Net Department of Computing Imperial College, London

20/09/2005All Hands Meeting, Nottingham Outline Discovery Net Data Clustering Mining Distributed Data Description of the strategy Deployment Evaluation Conclusions – Future Works

20/09/2005All Hands Meeting, Nottingham Discovery Net Multidisciplinary project funded by the EPSRC under the UK e-Science programme (started Oct 2002, ended March 05) Developed an infrastructure for Knowledge Discovery Services for integrating and analysing data collected from high throughput devices and sensors Applications to: Life Sciences High throughput genomics and proteomics Real-time Environmental Monitoring High throughput dispersed air sensing technology Geo-Hazard modelling Earthquake modelling through satellite imagery The project covered many areas including infrastructure, applications and algorithms (text mining) Produced the Discovery Net platform which aims to integrate, compose, coordinate and deploy knowledge discovery services using a workflow technology.

20/09/2005All Hands Meeting, Nottingham Discovery Net Using Distributed Computing Resources Scientific Information Scientific Discovery Literature Databases Operational Data Images Instrument Data  e-Science  large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet.

20/09/2005All Hands Meeting, Nottingham Data Clustering We concentrate on a particular class of data mining algorithms: Clustering A class of explorative data mining techniques, used to find out groups of points that are similar/close to each other. Popular analysis technique. Useful for exploring, understanding, modelling large data sets Two main types of clustering: Hierarchical: Reorganises the data set into a hierarchy of clusters based on their similarity. Partition/Model based: Tries to partition the data set into a number of clusters or try to fit a statistical model (e.g. mixture of Gaussians) to a data set Successfully applied to sociological data, image processing and genomic data.

20/09/2005All Hands Meeting, Nottingham Mining Data on the Grid Changing environment for data analysis: From analysing data files held locally (or close to the algorithm), to using remote data source, using remote services through portals, now towards distributed data executions. Distributed data sources: Data mining processes can now require data spread across multiple organisations Service-oriented approach: High-level functionalities are now available through well-defined services, instead of providing low-level (terminal etc..) access to resources

20/09/2005All Hands Meeting, Nottingham Goal Design a service-oriented distributed data clustering strategy: that can be deployed on a Grid environment (i.e. a standard-based, service oriented, secure distributed environment) that would allow the end-user/data analysts to deploy easily against its own data sets

20/09/2005All Hands Meeting, Nottingham Requirements 1/2 Performance issues: The analysis process using data grids directly and analysis services must be more efficient than gathering all the data on my desktop! Accuracy: The strategy should at least provide a model more representative of the overall data set Security The deployed strategy should ensure consistent handling of authentication and authorization aspects throughout Privacy: Restricted access to the data source

20/09/2005All Hands Meeting, Nottingham Requirements 2/2 Heterogeneity of the resources used and/or connectivity It’s very unlikely the set of resources involved in the distributed analysis process will be similar or work over networks of similar bandwidth Loose-coupling between resources participating in the distributed analysis The analyst has less control on what is available/provided by each data grid or each analysis service. Therefore the framework should, as much as possible, be unaffected by minor differences between functionalities provided by each site. Service-oriented approach: The deployment of the analysis process should be based on the co-ordination of high-level services (instead of a dedicated distributed algorithm, e.g. MPI implementation)

20/09/2005All Hands Meeting, Nottingham Current strategy We restrict the current framework to the case where instances are distributed but have the same attributes on each different fragments (~ horizontal fragments) Based on the EM-Clustering algorithm (mixture of Gaussian model fitting algorithm). Hierarchical clustering inherently complex to distribute Statistical approach of EM provides a sound basis to define a model combination strategy

20/09/2005All Hands Meeting, Nottingham Approach Generate clustering models at each data source location (compute near the data) Transfer partial models in standard format (PMML) to a combiner site Normalise the relative weights of each model Perform an EM-based method on partial models to generate a global model.

20/09/2005All Hands Meeting, Nottingham Combining Cluster Models Derived from the EM-Clustering algorithm itself Adapted to take as input the models generated at each site Each partial model is treated like a (very) compressed representation of the fragment (similar to the two step approaches of some scalable clustering algorithms). More detailed algorithm and formulae in proceedings

20/09/2005All Hands Meeting, Nottingham Deployment: Discovery Net The Discovery Net platform is used to build and deploy this framework. Implementation based on an open architecture re-using common protocols and common infrastructure elements (such as the Globus Toolkits). It also defines its own protocol for workflows, Discovery Process Markup Language (DPML) which allows the definition of data analysis workflows to be executed on distributed resources. The platform comprises a server that stores, schedules the workflows and manage the data, and a thick client to help the workflow construction process. Thus giving the end user the ability to define application-specific workflows performing such tasks as distributed data mining. The model combiner is implemented as a workflow activity in Discovery Net

20/09/2005All Hands Meeting, Nottingham Deployment Data sourcesDiscovery Net servers Partial clustering PMML Partial models Global model Combiner site Source A Source B Source C

20/09/2005All Hands Meeting, Nottingham Deployment: Workflow The Discovery Net client enables the composition and the execution of the distributed process as a workflow constructed visually. The execution engine will coordinate the distributed execution

20/09/2005All Hands Meeting, Nottingham Accuracy Evaluation: Data Distribution Comparison of the accuracy of the combined model with the average accuracy of partial models against the entire data sets (i.e. have we gained some accuracy by considering the fragments together) Accuracy will strongly depend on how the data is distributed among different sites. In the evaluation we introduce a randomness ratio to determine how similar the data distribution is among fragments. 0 meaning that each site would have data drawn from different distributions 1 meaning that the data from all fragments are drawn from the same distribution Measured by log-likelihood function of the test data set: The likelihood function of a data set represents how much that data is likely to be following the distribution function defined by the model

20/09/2005All Hands Meeting, Nottingham Accuracy Evaluation: Data distribution As expected, the ratio has a huge effect on gained accuracy. For low levels, each fragment becomes less and less representative of the complete data set, therefore the combined model will outperform partial ones.

20/09/2005All Hands Meeting, Nottingham Accuracy Evaluation: Number of fragments (r= 0.2, 10,000 points, 5 clusters) The accuracy does degrade with increasing number of fragments, but so does the average accuracy of models generated from individual fragments.

20/09/2005All Hands Meeting, Nottingham Accuracy Evaluation: Increasing data size (r=0.2,d=5,5 fragments). Consistent behaviour of the combined model’s accuracy over partial ones.

20/09/2005All Hands Meeting, Nottingham Performance Evaluation Performance evaluation is only partially relevant, as the process does not feed back combined models and partial models are generated near the data. The heterogeneity of real deployments is difficult to take into account. Time in seconds, for an increasing number of fragments

20/09/2005All Hands Meeting, Nottingham Performance Evaluation Execution time with lower dimensionality and larger data sets

20/09/2005All Hands Meeting, Nottingham Conclusions Encouraging results in terms of accuracy vs. performance, given the constraints. But is the trade-off between accuracy and flexibility (generally the case in distributed data mining) acceptable? This should be part of a wider explorative process, probably as a first step into the understanding of the data set. Being part of the Discovery Net platform, the distributed analysis process can be simply designed from the Discovery Net client software.

20/09/2005All Hands Meeting, Nottingham Future Works First step towards more generic distributed data mining strategies (classification algorithms, association rules) Need evaluation against real data sets ! Possible improvements including: Refinement through feedback Use of a more complex intermediate summary structure for the partial models (e.g. tree structures containing summary information) Estimation of the number of clusters (using Bayesian Information Criteria) Plenty of possible clustering algorithms to try to use.

20/09/2005All Hands Meeting, Nottingham Questions?