A new clustering tool of Data Mining RAPID MINER.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
PARTITIONAL CLUSTERING
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Basic Data Mining Techniques
Cluster Analysis (1).
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
FLANN Fast Library for Approximate Nearest Neighbors
Radial Basis Function Networks
Enterprise systems infrastructure and architecture DT211 4
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Data Mining Techniques
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
DATA MINING CLUSTERING K-Means.
Inductive learning Simplest form: learn a function from examples
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Presented by Tienwei Tsai July, 2005
Appendix: The WEKA Data Mining Software
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
W E K A Waikato Environment for Knowledge Aquisition.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
What Is Cluster Analysis?
Data Mining K-means Algorithm
Waikato Environment for Knowledge Analysis
Topic 3: Cluster Analysis
Tutorial for WEKA Heejun Kim June 19, 2018.
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 7: Transformations
Topic 5: Cluster Analysis
Presentation transcript:

A new clustering tool of Data Mining RAPID MINER

Introduction To Clustering Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each

About The Project Aim of this project is to devise a new algorithm of clustering for Data Mining The main functionalities which would be implemented in the system would be preprocessing and clustering. In the preprocessing of the data, input file,.xls file can be chosen. The null values, if any, present in the input file would be removed in order to avoid the occurrence of faulty results in the output data sets. The redundancy or duplicity in the data sets of the attributes is removed. In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.

Present Tool: Weka Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.machine learningJavaUniversity of WaikatoNew Zealand The Explorer interface features several panels providing access to the main components of the workbench: The Preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria.databaseCSV The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.clusteringk-meansexpectation maximization algorithmnormal distributions

Our tool: Initially in the data preprocessing phase, the MS-Excel File is taken as input. There is no question of CSV of ARFF File(s). This is done since Excel file(s) are well known and comfortably handled by non-technical people as well. But, CSV and ARFF file(s) are needed to be well versed with also. This was done by importing a new library, the ‘jxl.jar’ library into the project. File(s) for data mining is firstly cleaned, by removing the null data sets from the input file(s). Null data sets are the data sets that contained no information or some information less than a threshold (minimum number of values of required attributes) value. The number of null data sets is reported to the user of the system as well. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). Redundant/ Duplicate data sets are the data sets which have all the attribute values same in value with some other data set. These data sets are eliminated for the further process of data mining. The number of these redundant/ duplicate data sets is also reported to the user.

KD Trees K Dimensional Trees Space Partitioning Data Structure Splitting planes perpendicular to Coordinate Axes Reduces the Overall Time Complexity to O(log n)

Clustering Our Clustering Algorithm uses KD Tree extensively for improving its Time Complexity Requirements. Our algorithm differs from existing approach in how nearest centers are computed. Efficiency is achieved because the data points do not vary throughout the computation and, hence, this data structure does not need to be recomputed at each stage.

K-means Clustering Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

K means K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below. 1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters. 2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated. 3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.

As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples. The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.

COMPARISON OF OUR TOOL WITH WEKA A set of data with the following statistics was run on WEKA and our tool both : Relation = weather No. of attributes = 3 No. of Instances ( including redundant/ duplicate and null instances) = 17

Limitations :- This tool does not provide protection from: Shared storage failures. Network service failures. Operational errors. Site disasters (unless a geographically dispersed clustering solution has been implemented).

In the near future… Market analysis Marketing strategies Advertisement Risk analysis and management Finance and finance investments Manufacturing and production Fraud detection and detection of unusual patterns (outliers) Telecommunication Finanancial transactions Anti-terrorism (!!!)

CONCLUSION We device a new algorithm for clustering by considering the following variations:- MS-Excel File(s) is successfully read, handled and processed by the system with the help of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel document were known. Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed. This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”) for the clustering algorithm. A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean step. The initial centers are chosen in this algorithm. K-MEANS does not specify how they are to be selected. An inappropriate choice of number of clusters can yield poor results. That is why, number of clusters are determined properly in the data set.

References An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu. Introduction to Clustering Techniques – by Leo Wanner A comprehensive overview of Basic Clustering Algorithms – Glenn Fung Introduction to Data Mining – Tan/Steinbach/Kumar

Questions/comments…?