Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer.

Similar presentations


Presentation on theme: "Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer."— Presentation transcript:

1 Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer

2 What is Machine Learning?  In 1959, Arthur Samuel defines machine learning as the field of study that gives computers the ability to learn without being explicitly programmed  A computer is said to learn by experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.  Why use machine learning ?

3 Applications of Machine Learning Today  Bioinformatics  Cheminformatics  Data Mining  Character Recognition  Spam Detection  Pattern Recognition  Speech Recognition  Smart ADS  Game Playing and so on…….

4 Types of Machine Learning Algorithms  Supervised Machine Learning  Unsupervised Machine Learning  Semi Supervised Learning  Reinforced Learning

5 Supervised Learning

6 Sad state of affairs – Too many options  Linear/ Polynomial Regression  Logistic Regression  K-nearest neighbor  Neural Nets  Decision Trees  SVM’s  Naïve Bayes  Inductive Logic Programming  ………….

7 Purpose of research paper  The primary focus of this research paper will be on comparing supervised learning algorithms.  Comparisons will be made between Decision trees, Artificial Neural Networks, Support Vector Machines (SVM),Naïve Bayes, Logistic Regression and k-nearest neighbor to decide which algorithm is the most efficient.  Research will also give us insight into the different domains where a certain algorithm performs better than the others.  Will use WEKA,Orange, TunedIT.org to compare the algorithms.

8 Data Sets  Data Sets should enclose all possible fields of implementation and not favor any algorithm  Primary source for getting the data sets -> UCI repository  Approximately 55-60 different data sets used during the entire experiment  Used the cross validation option with 10 folds. This option partitions the data sets into 10 sub samples.  Scalability

9 Method Followed  Select a data set  Select an algorithm  Generate confusion matrix and precision values for the algorithm – data set pair using WEKA, Orange and tunedit.org  Repeat the procedure for each algorithm under consideration  Repeat the procedure for all the data sets  Compare my results with research material available

10 Confusion Matrix Details

11 Precision Values Data Sets Lung Cancer Oral Cancer Blood Cancer Naïve Bayes0.8940.8370.974 Logistic Regression0.7370.6070.962

12 Demo Boosting Lift

13 Results General ALGOPrecisionF-measureLiftTP RateRecallMean ANN0.9570.8710.9580.8920.8260.901 BST – DT0.9580.8540.9560.9190.8080.899 SVM0.9260.8510.9470.8820.7690.875 KNN0.8930.8200.9140.7860.7060.824 LOG-REG0.8320.8230.9290.7140.7780.7946 NAÏVE-B0.7330.6150.7860.5390.8850.712

14 Are We Done ? Not Quite ALGOPrecisionF-measureLiftTP RateRecallMean BEST0.9580.8710.9580.9190.8850.9182 ANN0.9570.8710.9580.8920.8260.901 BST – DT0.9580.8540.9560.9190.8080.899 SVM0.9260.8510.9470.8820.7690.875 KNN0.8930.8200.9140.7860.7060.824 LOG-REG0.8320.8230.9290.7140.7780.7946 NAÏVE-B0.7330.6150.7860.5390.8850.712

15 Analysis of Algorithms ALGOBoosted Decision TreeANNNaïve BayesKNNSVMLogistic Regression Accuracy **** ******* Speed of Learning ******** **** Tolerance *** ** ***** Overfitting ****** ** Incremental Learning *** **** ***** Speed of Classification **** *********** Average 3.22.72.82.72.32.1

16 Results: Domain Specific ApplicationBest AlgorithmWorst Algorithm BioinformaticsANNLogistic Regression Game PlayingANNNaïve Bayes Data MiningSVMNaïve Bayes Spam DetectionANNNaïve Bayes MedicineBoosted Decision TreeNot Clear Character RecognitionANNNot Clear Physics & Scientific ResearchSVMLogistic Regression

17 Why ? Why ? Why ?  Naïve Bayes always performs better in classifying instances based on some prior probabilistic knowledge  SVM excellent when the training data is separated in the vector space but it takes a long training time  Artificial Neural Networks best used for areas where data is dynamic, vast and there are too many possibilities involved. Efficient due to its heuristic implementation  K- nearest neighbor consistent in almost every area but excellent in areas where networking is involved  Decision Trees are simple to learn and implement and very efficient

18 Challenges Faced  Enormous data  Understanding the machine learning algorithms  Type conversion to support.arff for WEKA and.tab or.si format for Orange  Overfitting : Model described random noise instead of the underlying relationship due to an excessively complex model with too many parameters  Inductive Bias : Assumptions used to predict outputs, given inputs that were not encountered in the training set  Class membership probabilities: Uncertainty in classification due to low probabilistic confidence.

19 The Future  The key question when dealing with Machine Learning is not whether a learning algorithm is superior to others, but under which conditions a particular method can significantly outperform others on a given application problem.  After a better understanding of the strengths and limitations of each method, the possibility of integrating two or more algorithms together to solve a problem should be investigated.  Improve the running time of the algorithms  Possibly build a super powerful algorithm that could be used in every domain and for any data set

20 Questions?

21 References  Andrew.NG, " Machine Learning CS 229 ", Machine Learning Stanford University 2009.  Rich Caruana and Alexandru Niculescu-Mizil, " An Empirical Comparison of Supervised Learning Algorithms", Department of Computer Science, Cornell University, 2006.  Yang.T,"Computational Verb Decision Trees", Yang's Scientific Press, 2006  Rich Caruana and Alexandru Niculescu-Mizil, "Obtaining Calibrated Probabilities from Boosting Isotonic Regression.", Department of Computer Science, Cornell University, 2006.  Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003  Sameera Mahajani, “Comparing Data mining techniques for cancer classification”, Department of Computer Science, CSU Chico

22 Annotated Bibliography  1. Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003 This paper presents the theory and research involved behind the application of supervised machine learning techniques to the field of bioinformatics and classification of biological data. The paper suggests with enough practical evidence that none of the supervised machine learning algorithms perform consistently well over the data sets selected. Observations in this paper show that a combination of machine learning techniques perform much better than the individual ones and the performance is highly dependent on the type of training data. The paper also suggests some important points to consider while selecting a supervised machine learning algorithm for a data set.  2. Rich Caruna and Alexandru Niculescu-Mizil, “An Empirical Comparison of Supervised Learning Algorithms”, Department of Computer Science, Cornell University, 2006 This paper performs a very detailed empirical comparison of ten machine learning algorithms using eight performance criteria considering a variety of data sets and well documented results. The paper suggests that calibrated boosted trees are the best supervised learning algorithm followed by Random forests, SVM’s and neural networks. The paper also talks about the various new supervised learning algorithms recently introduced and how they are more efficient than the older algorithms. The paper discusses the significant variations seen according to the various problems and metrics and evaluates the situations under which these kind of performance fluctuations occur.


Download ppt "Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer."

Similar presentations


Ads by Google