Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
AI Practice 05 / 07 Sang-Woo Lee. 1.Usage of SVM and Decision Tree in Weka 2.Amplification about Final Project Spec 3.SVM – State of the Art in Classification.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
An Extended Introduction to WEKA. Data Mining Process.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
CSc288 Term Project Data mining on predict Voice-over-IP Phones market Huaqin Xu.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Evaluation – next steps
WEKA – Knowledge Flow & Simple CLI
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of.
Appendix: The WEKA Data Mining Software
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Data Mining Applied to Document Imaging Jeff Rekoske.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Hands-on predictive models and machine learning for software Foutse Khomh, Queen’s University Segla Kpodjedo, École Polytechnique de Montreal PASED - Canadian.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Artificial Neural Network Building Using WEKA Software
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
Weka – A Machine Learning Toolkit October 2, 2008 Keum-Sung Hwang.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
W E K A Waikato Environment for Knowledge Aquisition.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Exercise in Machine Learning
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Neural networks – Hands on
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Erich Smith Coleman Platt
Waikato Environment for Knowledge Analysis
WEKA.
Sampath Jayarathna Cal Poly Pomona
An Introduction to WEKA
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Machine Learning with Weka
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Project 1: Text Classification by Neural Networks
Tutorial for WEKA Heejun Kim June 19, 2018.
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning with Weka
Machine Learning with WEKA
Lecture 10 – Introduction to Weka
Neural Networks Weka Lab
Data Mining CSCI 307, Spring 2019 Lecture 7
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Introduction to Weka ML Seminar for Rookies Byoung-Hee Kim Biointelligence Lab, Seoul National University

(C) , SNU Biointelligence Lab, 2 BI? (Predictive) Analytics Data Mining Machine Learning AI

Hype Cycle of Emerging Technologies 2010, Gartner Analytics as a Mainstream Technology 3 (C) , SNU Biointelligence Lab,

Analytics as a Mainstream Technology 4 (C) , SNU Biointelligence Lab,

Components of Data Mining Knowledge representation Tables, trees, rules, clusters, … Evaluating what’s been learned Training and testing Performance Comparing algorithms Inferring rules Statistical modeling Divide-and-conquer Association Linear models … Concepts, Instances, Attributes Preparing the input InputAlgorithm OutputCredibility (C) , SNU Biointelligence Lab, 5

Weka as a Must-Have Tool (C) , SNU Biointelligence Lab, 6  I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.  A must for anyone even marginally interested in machine learning and classification techniques.  One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus. Reviews in Sourceforge.net

7 Agenda  Introduction to Weka General information Components Explorer, Experimenter, KnowledgeFlow, and CLI  Classification practice with Weka Problem: classifying iris flowers Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Using KnowledgeFlow CLI (command line interface) and batch scripts (C) , SNU Biointelligence Lab,

8 General Information on Weka  Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka are data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? or just type ‘Weka’ in google. (C) , SNU Biointelligence Lab,

Components of Weka (C) , SNU Biointelligence Lab, 9 Explorer lets you do various data mining tasks in interactive, step-by-step way. The first choice, usually KnowledgeFlow allows you to design configurations for streamed data processing Experimenter allows you to classification and regression in batch way -Different parameter settings -Various datasets -Comparison of models -Large-scale statistical experiments Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system. Auxiliary Tools in the menu

Practice: Classifying Iris Flower (C) , SNU Biointelligence Lab, 10 Iris virginicaIris versicolorIris setosa Features for Classification

Practice: Classifying Iris Flower  Define features (or attributes) Sepal length, sepal width, petal length, petal width Class label: species of the iris flower. Setosa, versicolor, or virginica  Collect samples to build a dataset 50 samples from each of three species of Iris flowers Data table : 150 samples (or instances) * 5 attributes Fisher developed a linear discriminant model to distinguish the species from each otherlinear discriminant model  Classification algorithm We will try neural networks, decision tree, and support vector machine  Evaluating performance Various metrics Comparing learned models (algorithm + setting) (C) , SNU Biointelligence Lab, 11

Terminology  Features or Attributes Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification  Training set / Test set Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier (C) , SNU Biointelligence Lab, 12

13 Neural Networks  MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron (C) , SNU Biointelligence Lab,

14 Decision Trees  J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48 (C) , SNU Biointelligence Lab,

Support Vector Machines  SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO (C) , SNU Biointelligence Lab, 15

Practice Scenario  Basic Comparing the performances of algorithms MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter) Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in Weka  Advanced Building committee machines using ‘meta’ algorithms for classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’ 16 (C) , SNU Biointelligence Lab,

Dataset for Practice with Weka  Just open “iris.arff” in the data folder of Weka (C) , SNU Biointelligence Lab, 17

Data format for Weka sepalwidth petallength class 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor … Data (CSV format) Header 18 Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file (C) , SNU Biointelligence Lab,

Neural Networks in Weka 19 click load a file that contains the training data by clicking ‘Open file’ button ‘ARFF’ or ‘CSV’ formats are readible Click ‘Classify’ tab Click ‘Choose’ button Select ‘weka – function - MultilayerPerceptron Click ‘MultilayerPerceptron’ Set parameters for MLP Set parameters for Test Click ‘Start’ for learning (C) , SNU Biointelligence Lab,

20 Some Notes on the Parameter Setting  Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky  Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed  J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning  SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel parameters (C) , SNU Biointelligence Lab,

Test Options and Classifier Output 21 There are various metrics for evaluation Setting the data set used for evaluation (C) , SNU Biointelligence Lab,

22 How to Evaluate the Performance? (1/2)  Usually, build a ‘Confusion Matrix’ on the test data set  Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc.  For fare evaluation, the ‘cross-validation’ scheme is used (C) , SNU Biointelligence Lab,

23 How to Evaluate the Performance? (2/2)  Confusion Matrix Real Prediction PositiveNegative Positive TPFP All with positive Test Negative FNTN All with Negative Test All with Disease All without Disease Everyone As recall ↑ precision ↓ conversely: As recall ↓ precision ↑ (C) , SNU Biointelligence Lab,

24 Evaluation Method - Cross Validation  K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the other k-1 subsets are put together to form a ‘training set’. 128 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 6-fold cross validation (C) , SNU Biointelligence Lab,

Data Manipulation with Filter in Weka  Attribute Selection, discretize  Instance Re-sampling, selecting specified folds (C) , SNU Biointelligence Lab, 25

Using Experimenter in Weka  Tool for ‘Batch’ experiments 26 click Set experiment type/iteration control Set datasets / algorithms Click ‘New’ Select ‘Run’ tab and click ‘Start’ If it has finished successfully, click ‘Analyse’ tab and see the summary (C) , SNU Biointelligence Lab,

Usages of Experimenter  Model selection for classification/regression Various approaches Repeated training/test set split Repeated cross-validation (c.f. double cross- validation) Averaging Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.  Batch and/or Distributed processing Load/save experiment settings Multi-core support : utilize all the cores on a multi-core machine (C) , SNU Biointelligence Lab, 27

KnowledgeFlow for Analysis Process Design 28 (‘Process Flow Diagram’ of SAS ® Enterprise Miner ) (C) , SNU Biointelligence Lab,

KnowledgeFlow: Example Usage  Decision tree (J48) (C) , SNU Biointelligence Lab, 29

(C) , SNU Biointelligence Lab, 30

(C) , SNU Biointelligence Lab, 31

(C) , SNU Biointelligence Lab, 32

Simple CLI  Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3- 6\data\iris.arff" (C) , SNU Biointelligence Lab, 33 You may build a command line script for various experiments easily Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Other ML Open Source S/W’s  KNIME Konstanz Information Miner  RapidMiner With Weka as its core  TANAGRA lyon2.fr/~ricco/tanagra/en/tanagra.html lyon2.fr/~ricco/tanagra/en/tanagra.html (C) , SNU Biointelligence Lab, 34

General Information on Weka  Current version ( ) Stable version: Developer version: (C) , SNU Biointelligence Lab, 35

References  Weka Wiki: Primer: good starting point Primer  Weka online documentation:  Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan  Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM Technical LibraryPart 1Part 2Part 3 Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재 (2009 7,8,9 월호 ) 블로그, MS Live 블로그MS Live (C) , SNU Biointelligence Lab, 36