Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University.

Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University

Components of Data Mining Knowledge representation Tables, trees, rules, clusters, … Evaluating what’s been learned Training and testing Performance Comparing algorithms Inferring rules Statistical modeling Divide-and-conquer Association Linear models … Concepts, Instances, Attributes Preparing the input InputAlgorithm OutputCredibility (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5

Weka as a Must-Have Tool (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6  I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.  A must for anyone even marginally interested in machine learning and classification techniques.  One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus. Reviews in Sourceforge.net

7 Agenda  Introduction to Weka General information Components Explorer, Experimenter, KnowledgeFlow, and CLI  Classification practice with Weka Problem: classifying iris flowers Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Using KnowledgeFlow CLI (command line interface) and batch scripts (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8 General Information on Weka  Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka are data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.http://www.cs.waikato.ac.nz/ml/weka/ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Components of Weka (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 Explorer lets you do various data mining tasks in interactive, step-by-step way. The first choice, usually KnowledgeFlow allows you to design configurations for streamed data processing Experimenter allows you to classification and regression in batch way -Different parameter settings -Various datasets -Comparison of models -Large-scale statistical experiments Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system. Auxiliary Tools in the menu

Practice: Classifying Iris Flower  Define features (or attributes) Sepal length, sepal width, petal length, petal width Class label: species of the iris flower. Setosa, versicolor, or virginica  Collect samples to build a dataset 50 samples from each of three species of Iris flowers Data table : 150 samples (or instances) * 5 attributes Fisher developed a linear discriminant model to distinguish the species from each otherlinear discriminant model  Classification algorithm We will try neural networks, decision tree, and support vector machine  Evaluating performance Various metrics Comparing learned models (algorithm + setting) (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11

Terminology  Features or Attributes Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification  Training set / Test set Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12

Practice Scenario  Basic Comparing the performances of algorithms MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter) Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in Weka  Advanced Building committee machines using ‘meta’ algorithms for classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’ 16 (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Data format for Weka (.ARFF) @RELATION iris @ATTRIBUTE sepallengthREAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidthREAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor … Data (CSV format) Header 18 Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Neural Networks in Weka 19 click load a file that contains the training data by clicking ‘Open file’ button ‘ARFF’ or ‘CSV’ formats are readible Click ‘Classify’ tab Click ‘Choose’ button Select ‘weka – function - MultilayerPerceptron Click ‘MultilayerPerceptron’ Set parameters for MLP Set parameters for Test Click ‘Start’ for learning (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

20 Some Notes on the Parameter Setting  Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky  Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed  J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning  SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel parameters (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

22 How to Evaluate the Performance? (1/2)  Usually, build a ‘Confusion Matrix’ on the test data set  Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc.  For fare evaluation, the ‘cross-validation’ scheme is used (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

23 How to Evaluate the Performance? (2/2)  Confusion Matrix Real Prediction PositiveNegative Positive TPFP All with positive Test Negative FNTN All with Negative Test All with Disease All without Disease Everyone As recall ↑ precision ↓ conversely: As recall ↓ precision ↑ (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

24 Evaluation Method - Cross Validation  K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the other k-1 subsets are put together to form a ‘training set’. 128 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 6-fold cross validation (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Using Experimenter in Weka  Tool for ‘Batch’ experiments 26 click Set experiment type/iteration control Set datasets / algorithms Click ‘New’ Select ‘Run’ tab and click ‘Start’ If it has finished successfully, click ‘Analyse’ tab and see the summary (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Usages of Experimenter  Model selection for classification/regression Various approaches Repeated training/test set split Repeated cross-validation (c.f. double cross- validation) Averaging Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.  Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27

Simple CLI  Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3- 6\data\iris.arff" (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33 You may build a command line script for various experiments easily Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Other ML Open Source S/W’s  KNIME Konstanz Information Miner http://www.knime.org/  RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en  TANAGRA http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html http://eric.univ- lyon2.fr/~ricco/tanagra/en/tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34

References  Weka Wiki: http://weka.wikispaces.com/ http://weka.wikispaces.com/ Primer: good starting point Primer  Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html  Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.  Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM Technical LibraryPart 1Part 2Part 3 Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재 (2009 7,8,9 월호 ) 블로그, MS Live 블로그MS Live (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36

Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University.

Similar presentations

Presentation on theme: "Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University.

Similar presentations

Presentation on theme: "Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University."— Presentation transcript:

Similar presentations

About project

Feedback