Project Seminar on STABLE CLUSTERING ALGORITHM TO IDENTIFY CPU USAGE OF COMPUTERS BEHAVIOR IN GRID ENVIRONMENT Under the guidance of Prof. Lakshmi Rajamani (Head of the Department) SUBMITTED BY G. Naresh Kumar ( ) M.Tech(CSE)-III SEM
Contents: Introduction. Motivation. Problem statement. Work done so far. Work to be done. Conclusion. References.
Introduction. Grid computing or simply grid is a generic term given to technologies designed to make pools of distributed computer resources available on-demand. Grid computing has become a well-established method for Internet-based high- performance computing. Grid provides wide-spread, dynamic, flexible and coordinated sharing of geographically distributed networked resources, among dynamic user groups.
Data mining: Data Mining or Knowledge discovery refers to a variety of techniques that have developed in the fields of databases, machine learning and pattern recognition. The process of finding useful patterns and information from raw data is often known as Data mining.
Clustering: Clustering is a division of data into groups of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. It is a process of unsupervised learning. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing.
Clustering techniques: 1) Partitioning Clustering ~ PAM ~ CLARA ~ CLARANS ~ K-Means 2) Supervised Clustering ~ K-nearest neighbors 3) On line mode clustering ~ ECM ~ Evoc 4) Fuzzy Clustering ~ Fuzzy c-means
Motivation In a grid environment the number of computing nodes and users participating are increase and may reach up to thousands or millions. The abundance of these resources forges new problems, such as how to collect the massive amounts of evolving resources in real time and extract the useful information from them. And, these resources are not ordered, random and chaotic where normal user is not able to easily discover any knowledge or meaningful information from them. In order to deal with these requirements, clustering is proposed as one of the best ways in terms of processing large set of raw data and turning these data into meaningful information.
The Flow of Clustering Process in Grid Environment
Problem statement: Mining cluster in a single large database require more processing power. Due to conventional technology used for centralized data mining is no longer suitable for new systems. We apply different clustering methods on CPU usage to identify computers behaviors. To find out the stable algorithm it requires the dynamicity, accuracy and the ability to identify the stable cluster members. Among those best clustering algorithm will be implemented for better processing and cluster stability in grid environment. However, the results are based on threshold value, stability value and stability hour
Work done so far: Survey on the existing clustering algorithms. Survey on Grid technologies. Installed Grid gain toolkit.
Work to be done : Testing of different types of clustering algorithms and calculate their performances, complexity in a system. Testing of clustering algorithms in grid environment and measure their performances to find out the stable clustering algorithm. Finally, implementation of the stable clustering algorithm in grid environment for better processing and cluster stability.
Cluster Stability Stability Value: The value (in percentage) that measures the change in cluster radius. For instance, if the stability value is defined as 5%, any cluster radius that grows or shrinks less than 5% from the original size will be considered as stable. Stability Hour: The value that defines the required amount of time in hours for a cluster member to stay in the same cluster in order for it to be considered stable. If the stability hour is set to 3 hours, any cluster member that stays in the same cluster for more than this amount of time will be considered as stable.
Assumptions: A cluster is considered to be stable depending on stability value which is pre-defined by the user, for instance 20%. A cluster member is considered to be stable if it stays in the same stable cluster continuously for or at least two hours. The stability hour is determined by the users.
Conclusion: Here the stable clustering algorithm has been evaluated using three main criteria; that is dynamicity, accuracy and the ability to identify the stable cluster members. This stable clustering algorithm can handle and process massive amount of data without any significant error rate. From the experiment, we can conclude that the stable clustering algorithm is more dynamic than other existing clustering algorithms.
References: GRID COMPUTING: A Practical Guide to Technology and Applications. Ahmar Abbas. Charles River Media Inc, “Data Mining Concepts and Techniques” by Jiawei and Micheline Kamber, University of Illinois at Urbana-Champaign 2000© Morgan Kaufmann Publishers. Zhijie Xu, Laisheng Wang, Jiancheng Luo and Jianqin Zhang, “A Modified Clustering Algorithm for Data Mining”. Kee Sim Ee, Chan Huah Yang, Fazilah Haran, “Mining of Resource Usage Using Evoc Algorithm in Grid Environment”. Huimin Wang, Guihua Nie and Kui Fu, “Distributed data mining based on semantic web and grid” in 2009 International Conference on Computational Intelligence and Natural Computing. Ping Luo, Kevin Li, Zhongzhi Shi, Qing He, “Distributed data mining in grid computing environments. David A. Cieslak, Nitesh V. Chawla, and Douglas L. Thain published a “Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques” at 9th Grid Computing Conference 2008.
Thank you