Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University.

Similar presentations


Presentation on theme: "Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University."— Presentation transcript:

1 Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University

2 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2 BI? (Predictive) Analytics Data Mining Machine Learning AI

3 Hype Cycle of Emerging Technologies 2010, Gartner Analytics as a Mainstream Technology 3 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

4 Analytics as a Mainstream Technology 4 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

5 Components of Data Mining Knowledge representation Tables, trees, rules, clusters, … Evaluating what’s been learned Training and testing Performance Comparing algorithms Inferring rules Statistical modeling Divide-and-conquer Association Linear models … Concepts, Instances, Attributes Preparing the input InputAlgorithm OutputCredibility (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5

6 Weka as a Must-Have Tool (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6  I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.  A must for anyone even marginally interested in machine learning and classification techniques.  One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus. Reviews in Sourceforge.net

7 7 Agenda  Introduction to Weka General information Components Explorer, Experimenter, KnowledgeFlow, and CLI  Classification practice with Weka Problem: classifying iris flowers Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Using KnowledgeFlow CLI (command line interface) and batch scripts (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8 8 General Information on Weka  Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka are data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.http://www.cs.waikato.ac.nz/ml/weka/ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9 Components of Weka (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 Explorer lets you do various data mining tasks in interactive, step-by-step way. The first choice, usually KnowledgeFlow allows you to design configurations for streamed data processing Experimenter allows you to classification and regression in batch way -Different parameter settings -Various datasets -Comparison of models -Large-scale statistical experiments Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system. Auxiliary Tools in the menu

10 Practice: Classifying Iris Flower (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10 Iris virginicaIris versicolorIris setosa Features for Classification

11 Practice: Classifying Iris Flower  Define features (or attributes) Sepal length, sepal width, petal length, petal width Class label: species of the iris flower. Setosa, versicolor, or virginica  Collect samples to build a dataset 50 samples from each of three species of Iris flowers Data table : 150 samples (or instances) * 5 attributes Fisher developed a linear discriminant model to distinguish the species from each otherlinear discriminant model  Classification algorithm We will try neural networks, decision tree, and support vector machine  Evaluating performance Various metrics Comparing learned models (algorithm + setting) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11

12 Terminology  Features or Attributes Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification  Training set / Test set Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12

13 13 Neural Networks  MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14 14 Decision Trees  J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

15 Support Vector Machines  SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15

16 Practice Scenario  Basic Comparing the performances of algorithms MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter) Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in Weka  Advanced Building committee machines using ‘meta’ algorithms for classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’ 16 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

17 Dataset for Practice with Weka  Just open “iris.arff” in the data folder of Weka (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17

18 Data format for Weka (.ARFF) @RELATION iris @ATTRIBUTE sepallengthREAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidthREAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor … Data (CSV format) Header 18 Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

19 Neural Networks in Weka 19 click load a file that contains the training data by clicking ‘Open file’ button ‘ARFF’ or ‘CSV’ formats are readible Click ‘Classify’ tab Click ‘Choose’ button Select ‘weka – function - MultilayerPerceptron Click ‘MultilayerPerceptron’ Set parameters for MLP Set parameters for Test Click ‘Start’ for learning (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

20 20 Some Notes on the Parameter Setting  Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky  Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed  J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning  SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel parameters (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

21 Test Options and Classifier Output 21 There are various metrics for evaluation Setting the data set used for evaluation (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

22 22 How to Evaluate the Performance? (1/2)  Usually, build a ‘Confusion Matrix’ on the test data set  Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc.  For fare evaluation, the ‘cross-validation’ scheme is used (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

23 23 How to Evaluate the Performance? (2/2)  Confusion Matrix Real Prediction PositiveNegative Positive TPFP All with positive Test Negative FNTN All with Negative Test All with Disease All without Disease Everyone As recall ↑ precision ↓ conversely: As recall ↓ precision ↑ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

24 24 Evaluation Method - Cross Validation  K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the other k-1 subsets are put together to form a ‘training set’. 128 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 6-fold cross validation (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

25 Data Manipulation with Filter in Weka  Attribute Selection, discretize  Instance Re-sampling, selecting specified folds (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25

26 Using Experimenter in Weka  Tool for ‘Batch’ experiments 26 click Set experiment type/iteration control Set datasets / algorithms Click ‘New’ Select ‘Run’ tab and click ‘Start’ If it has finished successfully, click ‘Analyse’ tab and see the summary (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

27 Usages of Experimenter  Model selection for classification/regression Various approaches Repeated training/test set split Repeated cross-validation (c.f. double cross- validation) Averaging Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.  Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27

28 KnowledgeFlow for Analysis Process Design 28 (‘Process Flow Diagram’ of SAS ® Enterprise Miner ) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

29 KnowledgeFlow: Example Usage  Decision tree (J48) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29

30 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30

31 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 31

32 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32

33 Simple CLI  Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3- 6\data\iris.arff" (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33 You may build a command line script for various experiments easily Refer Ch.1 of WekaManual-3-*-*.pdf for further information

34 Other ML Open Source S/W’s  KNIME Konstanz Information Miner http://www.knime.org/  RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en  TANAGRA http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34

35 General Information on Weka  Current version (2012-2-3) Stable version: 3.6.6 Developer version: 3.7.5 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35

36 References  Weka Wiki: http://weka.wikispaces.com/ http://weka.wikispaces.com/ Primer: good starting point Primer  Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html  Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.  Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM Technical LibraryPart 1Part 2Part 3 Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재 (2009 7,8,9 월호 ) 블로그, MS Live 블로그MS Live (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36


Download ppt "Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University."

Similar presentations


Ads by Google