Waikato Environment for Knowledge Analysis

Waikato Environment for Knowledge Analysis

Contents What is WEKA? The Explorer: References and Resources
Preprocess data Data Visualization Classification Clustering Association Rules Attribute Selection References and Resources

What is WEKA? Waikato Environment for Knowledge Analysis
It’s a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. Weka is also a bird found only on the islands of New Zealand.

Getting started with weka
Free : License GNU Multi platform (Java): Windows, Mac, Linux Easy to install Examples of dataset Documentation, Tutorial and Mooc

What is WEKA? Data Mining tool
User interface / Integrated to your Java code Data filters Classification Clustering Visualization

Input data ARFF (Attribute-Relation File Format) CSV SQL Database
Name of the dataset Attributes’ name, value and type Data

The explorer Preprocessing Visualization Classification Clustering
Finding associations Attribute selection

Explorer: Preprocessing
49 different filters. Supervised, unsupervised On attributes or instances

Demo: Filters

Demo: Visualization

Explorer: Classifier 76 different classification algorithms
Decision trees, instance-based classifiers, support vector machines (SVM), Bayes’ nets…

Demo: Classifier

Input data ARFF (Attribute-Relation File Format) CSV SQL Database
Name of the dataset Attributes’ name, value and type Data

Explorer: clustering data
WEKA contains “clusterers” for finding groups of similar instances in a dataset Implemented schemes are: - k-Means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to “true” clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution

K-Means Clustering (contd.)
Example

The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment

Demo : Clustering Data

Explorer: Finding associations
WEKA contains an implementation of the Apriori algorithm for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter  bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 9/21/2018

Basic Concepts: Frequent Patterns
Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer September 21, 2018

Basic Concepts: Association Rules
Tid Items bought Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%) September 21, 2018

Demo : Association

Explorer: attribute selection
Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, … Very flexible: WEKA allows (almost) arbitrary combinations of these two 9/21/2018

Demo : Attribute Selection

Different ways to use it
Explorer : Preprocessor, clustering, classifier, regression analysis, visualization Experimenter: Analysis and comparison of classifiers

Different ways to use it
Simple Command Line Instructions Can be integrated to your java code

Weka’s Advantages: Contains a lot of algorithms
Free (most other Data Mining tools are very expensive) Open source, so adapting it to your own needs is possible Constantly under development (not only by the original designers)

Drawbacks Lack of possibilities to interface with other software
Performance is often sacrificed in favor of portability, design transparency, etc. Memory limitation, because the data has to be loaded into main memory completely

Conclusion + Easy to use + No programming skill needed - Visualization and statistical tools limited

Waikato Environment for Knowledge Analysis

Similar presentations

Presentation on theme: "Waikato Environment for Knowledge Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Waikato Environment for Knowledge Analysis

Similar presentations

Presentation on theme: "Waikato Environment for Knowledge Analysis"— Presentation transcript:

Similar presentations

About project

Feedback