Download presentation
Presentation is loading. Please wait.
Published byAntony Thornton Modified over 9 years ago
1
Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University
2
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2 BI? (Predictive) Analytics Data Mining Machine Learning AI
3
Hype Cycle of Emerging Technologies 2010, Gartner Analytics as a Mainstream Technology 3 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Analytics as a Mainstream Technology 4 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
Components of Data Mining Knowledge representation Tables, trees, rules, clusters, … Evaluating what’s been learned Training and testing Performance Comparing algorithms Inferring rules Statistical modeling Divide-and-conquer Association Linear models … Concepts, Instances, Attributes Preparing the input InputAlgorithm OutputCredibility (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5
6
Weka as a Must-Have Tool (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6 I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool. A must for anyone even marginally interested in machine learning and classification techniques. One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus. Reviews in Sourceforge.net
7
7 Agenda Introduction to Weka General information Components Explorer, Experimenter, KnowledgeFlow, and CLI Classification practice with Weka Problem: classifying iris flowers Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Using KnowledgeFlow CLI (command line interface) and batch scripts (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
8 General Information on Weka Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka are data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.http://www.cs.waikato.ac.nz/ml/weka/ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Components of Weka (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 Explorer lets you do various data mining tasks in interactive, step-by-step way. The first choice, usually KnowledgeFlow allows you to design configurations for streamed data processing Experimenter allows you to classification and regression in batch way -Different parameter settings -Various datasets -Comparison of models -Large-scale statistical experiments Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system. Auxiliary Tools in the menu
10
Practice: Classifying Iris Flower (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10 Iris virginicaIris versicolorIris setosa Features for Classification
11
Practice: Classifying Iris Flower Define features (or attributes) Sepal length, sepal width, petal length, petal width Class label: species of the iris flower. Setosa, versicolor, or virginica Collect samples to build a dataset 50 samples from each of three species of Iris flowers Data table : 150 samples (or instances) * 5 attributes Fisher developed a linear discriminant model to distinguish the species from each otherlinear discriminant model Classification algorithm We will try neural networks, decision tree, and support vector machine Evaluating performance Various metrics Comparing learned models (algorithm + setting) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11
12
Terminology Features or Attributes Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification Training set / Test set Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12
13
13 Neural Networks MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
14 Decision Trees J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
Support Vector Machines SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15
16
Practice Scenario Basic Comparing the performances of algorithms MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter) Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in Weka Advanced Building committee machines using ‘meta’ algorithms for classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’ 16 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
Dataset for Practice with Weka Just open “iris.arff” in the data folder of Weka (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17
18
Data format for Weka (.ARFF) @RELATION iris @ATTRIBUTE sepallengthREAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidthREAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor … Data (CSV format) Header 18 Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
Neural Networks in Weka 19 click load a file that contains the training data by clicking ‘Open file’ button ‘ARFF’ or ‘CSV’ formats are readible Click ‘Classify’ tab Click ‘Choose’ button Select ‘weka – function - MultilayerPerceptron Click ‘MultilayerPerceptron’ Set parameters for MLP Set parameters for Test Click ‘Start’ for learning (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
20 Some Notes on the Parameter Setting Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel parameters (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21
Test Options and Classifier Output 21 There are various metrics for evaluation Setting the data set used for evaluation (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
22
22 How to Evaluate the Performance? (1/2) Usually, build a ‘Confusion Matrix’ on the test data set Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc. For fare evaluation, the ‘cross-validation’ scheme is used (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
23 How to Evaluate the Performance? (2/2) Confusion Matrix Real Prediction PositiveNegative Positive TPFP All with positive Test Negative FNTN All with Negative Test All with Disease All without Disease Everyone As recall ↑ precision ↓ conversely: As recall ↓ precision ↑ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
24
24 Evaluation Method - Cross Validation K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the other k-1 subsets are put together to form a ‘training set’. 128 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 6-fold cross validation (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
25
Data Manipulation with Filter in Weka Attribute Selection, discretize Instance Re-sampling, selecting specified folds (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25
26
Using Experimenter in Weka Tool for ‘Batch’ experiments 26 click Set experiment type/iteration control Set datasets / algorithms Click ‘New’ Select ‘Run’ tab and click ‘Start’ If it has finished successfully, click ‘Analyse’ tab and see the summary (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
27
Usages of Experimenter Model selection for classification/regression Various approaches Repeated training/test set split Repeated cross-validation (c.f. double cross- validation) Averaging Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc. Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27
28
KnowledgeFlow for Analysis Process Design 28 (‘Process Flow Diagram’ of SAS ® Enterprise Miner ) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
29
KnowledgeFlow: Example Usage Decision tree (J48) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29
30
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30
31
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 31
32
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32
33
Simple CLI Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3- 6\data\iris.arff" (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33 You may build a command line script for various experiments easily Refer Ch.1 of WekaManual-3-*-*.pdf for further information
34
Other ML Open Source S/W’s KNIME Konstanz Information Miner http://www.knime.org/ RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en TANAGRA http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34
35
General Information on Weka Current version (2012-2-3) Stable version: 3.6.6 Developer version: 3.7.5 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35
36
References Weka Wiki: http://weka.wikispaces.com/ http://weka.wikispaces.com/ Primer: good starting point Primer Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011. Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM Technical LibraryPart 1Part 2Part 3 Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재 (2009 7,8,9 월호 ) 블로그, MS Live 블로그MS Live (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.