Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Rerun of machine learning Clustering and pattern recognition.
Machine Learning: Intro and Supervised Classification
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
An Overview of Machine Learning
Indian Statistical Institute Kolkata
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Reduced Support Vector Machine
Three kinds of learning
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Part I: Classification and Bayesian Learning
Chapter 5 Data mining : A Closer Look.
Introduction to machine learning
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Machine Learning.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Ensemble Methods: Bagging and Boosting
Learning from observations
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Predicting Good Probabilities With Supervised Learning
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Data Mining and Decision Support
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Brief Intro to Machine Learning CS539
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Models
Semi-Supervised Clustering
An Empirical Comparison of Supervised Learning Algorithms
Objectives of the Course and Preliminaries
Chapter 11: Learning Introduction
Dipartimento di Ingegneria «Enzo Ferrari»,
Basic Intro Tutorial on Machine Learning and Data Mining
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Machine Learning with Clinical Data
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer

What is Machine Learning?  In 1959, Arthur Samuel defines machine learning as the field of study that gives computers the ability to learn without being explicitly programmed  A computer is said to learn by experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.  Why use machine learning ?

Applications of Machine Learning Today  Bioinformatics  Cheminformatics  Data Mining  Character Recognition  Spam Detection  Pattern Recognition  Speech Recognition  Smart ADS  Game Playing and so on…….

Types of Machine Learning Algorithms  Supervised Machine Learning  Unsupervised Machine Learning  Semi Supervised Learning  Reinforced Learning

Supervised Learning

Sad state of affairs – Too many options  Linear/ Polynomial Regression  Logistic Regression  K-nearest neighbor  Neural Nets  Decision Trees  SVM’s  Naïve Bayes  Inductive Logic Programming  ………….

Purpose of research paper  The primary focus of this research paper will be on comparing supervised learning algorithms.  Comparisons will be made between Decision trees, Artificial Neural Networks, Support Vector Machines (SVM),Naïve Bayes, Logistic Regression and k-nearest neighbor to decide which algorithm is the most efficient.  Research will also give us insight into the different domains where a certain algorithm performs better than the others.  Will use WEKA,Orange, TunedIT.org to compare the algorithms.

Data Sets  Data Sets should enclose all possible fields of implementation and not favor any algorithm  Primary source for getting the data sets -> UCI repository  Approximately different data sets used during the entire experiment  Used the cross validation option with 10 folds. This option partitions the data sets into 10 sub samples.  Scalability

Method Followed  Select a data set  Select an algorithm  Generate confusion matrix and precision values for the algorithm – data set pair using WEKA, Orange and tunedit.org  Repeat the procedure for each algorithm under consideration  Repeat the procedure for all the data sets  Compare my results with research material available

Confusion Matrix Details

Precision Values Data Sets Lung Cancer Oral Cancer Blood Cancer Naïve Bayes Logistic Regression

Demo Boosting Lift

Results General ALGOPrecisionF-measureLiftTP RateRecallMean ANN BST – DT SVM KNN LOG-REG NAÏVE-B

Are We Done ? Not Quite ALGOPrecisionF-measureLiftTP RateRecallMean BEST ANN BST – DT SVM KNN LOG-REG NAÏVE-B

Analysis of Algorithms ALGOBoosted Decision TreeANNNaïve BayesKNNSVMLogistic Regression Accuracy **** ******* Speed of Learning ******** **** Tolerance *** ** ***** Overfitting ****** ** Incremental Learning *** **** ***** Speed of Classification **** *********** Average

Results: Domain Specific ApplicationBest AlgorithmWorst Algorithm BioinformaticsANNLogistic Regression Game PlayingANNNaïve Bayes Data MiningSVMNaïve Bayes Spam DetectionANNNaïve Bayes MedicineBoosted Decision TreeNot Clear Character RecognitionANNNot Clear Physics & Scientific ResearchSVMLogistic Regression

Why ? Why ? Why ?  Naïve Bayes always performs better in classifying instances based on some prior probabilistic knowledge  SVM excellent when the training data is separated in the vector space but it takes a long training time  Artificial Neural Networks best used for areas where data is dynamic, vast and there are too many possibilities involved. Efficient due to its heuristic implementation  K- nearest neighbor consistent in almost every area but excellent in areas where networking is involved  Decision Trees are simple to learn and implement and very efficient

Challenges Faced  Enormous data  Understanding the machine learning algorithms  Type conversion to support.arff for WEKA and.tab or.si format for Orange  Overfitting : Model described random noise instead of the underlying relationship due to an excessively complex model with too many parameters  Inductive Bias : Assumptions used to predict outputs, given inputs that were not encountered in the training set  Class membership probabilities: Uncertainty in classification due to low probabilistic confidence.

The Future  The key question when dealing with Machine Learning is not whether a learning algorithm is superior to others, but under which conditions a particular method can significantly outperform others on a given application problem.  After a better understanding of the strengths and limitations of each method, the possibility of integrating two or more algorithms together to solve a problem should be investigated.  Improve the running time of the algorithms  Possibly build a super powerful algorithm that could be used in every domain and for any data set

Questions?

References  Andrew.NG, " Machine Learning CS 229 ", Machine Learning Stanford University  Rich Caruana and Alexandru Niculescu-Mizil, " An Empirical Comparison of Supervised Learning Algorithms", Department of Computer Science, Cornell University,  Yang.T,"Computational Verb Decision Trees", Yang's Scientific Press, 2006  Rich Caruana and Alexandru Niculescu-Mizil, "Obtaining Calibrated Probabilities from Boosting Isotonic Regression.", Department of Computer Science, Cornell University,  Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003  Sameera Mahajani, “Comparing Data mining techniques for cancer classification”, Department of Computer Science, CSU Chico

Annotated Bibliography  1. Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003 This paper presents the theory and research involved behind the application of supervised machine learning techniques to the field of bioinformatics and classification of biological data. The paper suggests with enough practical evidence that none of the supervised machine learning algorithms perform consistently well over the data sets selected. Observations in this paper show that a combination of machine learning techniques perform much better than the individual ones and the performance is highly dependent on the type of training data. The paper also suggests some important points to consider while selecting a supervised machine learning algorithm for a data set.  2. Rich Caruna and Alexandru Niculescu-Mizil, “An Empirical Comparison of Supervised Learning Algorithms”, Department of Computer Science, Cornell University, 2006 This paper performs a very detailed empirical comparison of ten machine learning algorithms using eight performance criteria considering a variety of data sets and well documented results. The paper suggests that calibrated boosted trees are the best supervised learning algorithm followed by Random forests, SVM’s and neural networks. The paper also talks about the various new supervised learning algorithms recently introduced and how they are more efficient than the older algorithms. The paper discusses the significant variations seen according to the various problems and metrics and evaluates the situations under which these kind of performance fluctuations occur.