Download presentation
Presentation is loading. Please wait.
1
Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer
2
What is Machine Learning? In 1959, Arthur Samuel defines machine learning as the field of study that gives computers the ability to learn without being explicitly programmed A computer is said to learn by experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Why use machine learning ?
3
Applications of Machine Learning Today Bioinformatics Cheminformatics Data Mining Character Recognition Spam Detection Pattern Recognition Speech Recognition Smart ADS Game Playing and so on…….
4
Types of Machine Learning Algorithms Supervised Machine Learning Unsupervised Machine Learning Semi Supervised Learning Reinforced Learning
5
Supervised Learning
6
Sad state of affairs – Too many options Linear/ Polynomial Regression Logistic Regression K-nearest neighbor Neural Nets Decision Trees SVM’s Naïve Bayes Inductive Logic Programming ………….
7
Purpose of research paper The primary focus of this research paper will be on comparing supervised learning algorithms. Comparisons will be made between Decision trees, Artificial Neural Networks, Support Vector Machines (SVM),Naïve Bayes, Logistic Regression and k-nearest neighbor to decide which algorithm is the most efficient. Research will also give us insight into the different domains where a certain algorithm performs better than the others. Will use WEKA,Orange, TunedIT.org to compare the algorithms.
8
Data Sets Data Sets should enclose all possible fields of implementation and not favor any algorithm Primary source for getting the data sets -> UCI repository Approximately 55-60 different data sets used during the entire experiment Used the cross validation option with 10 folds. This option partitions the data sets into 10 sub samples. Scalability
9
Method Followed Select a data set Select an algorithm Generate confusion matrix and precision values for the algorithm – data set pair using WEKA, Orange and tunedit.org Repeat the procedure for each algorithm under consideration Repeat the procedure for all the data sets Compare my results with research material available
10
Confusion Matrix Details
11
Precision Values Data Sets Lung Cancer Oral Cancer Blood Cancer Naïve Bayes0.8940.8370.974 Logistic Regression0.7370.6070.962
12
Demo Boosting Lift
13
Results General ALGOPrecisionF-measureLiftTP RateRecallMean ANN0.9570.8710.9580.8920.8260.901 BST – DT0.9580.8540.9560.9190.8080.899 SVM0.9260.8510.9470.8820.7690.875 KNN0.8930.8200.9140.7860.7060.824 LOG-REG0.8320.8230.9290.7140.7780.7946 NAÏVE-B0.7330.6150.7860.5390.8850.712
14
Are We Done ? Not Quite ALGOPrecisionF-measureLiftTP RateRecallMean BEST0.9580.8710.9580.9190.8850.9182 ANN0.9570.8710.9580.8920.8260.901 BST – DT0.9580.8540.9560.9190.8080.899 SVM0.9260.8510.9470.8820.7690.875 KNN0.8930.8200.9140.7860.7060.824 LOG-REG0.8320.8230.9290.7140.7780.7946 NAÏVE-B0.7330.6150.7860.5390.8850.712
15
Analysis of Algorithms ALGOBoosted Decision TreeANNNaïve BayesKNNSVMLogistic Regression Accuracy **** ******* Speed of Learning ******** **** Tolerance *** ** ***** Overfitting ****** ** Incremental Learning *** **** ***** Speed of Classification **** *********** Average 3.22.72.82.72.32.1
16
Results: Domain Specific ApplicationBest AlgorithmWorst Algorithm BioinformaticsANNLogistic Regression Game PlayingANNNaïve Bayes Data MiningSVMNaïve Bayes Spam DetectionANNNaïve Bayes MedicineBoosted Decision TreeNot Clear Character RecognitionANNNot Clear Physics & Scientific ResearchSVMLogistic Regression
17
Why ? Why ? Why ? Naïve Bayes always performs better in classifying instances based on some prior probabilistic knowledge SVM excellent when the training data is separated in the vector space but it takes a long training time Artificial Neural Networks best used for areas where data is dynamic, vast and there are too many possibilities involved. Efficient due to its heuristic implementation K- nearest neighbor consistent in almost every area but excellent in areas where networking is involved Decision Trees are simple to learn and implement and very efficient
18
Challenges Faced Enormous data Understanding the machine learning algorithms Type conversion to support.arff for WEKA and.tab or.si format for Orange Overfitting : Model described random noise instead of the underlying relationship due to an excessively complex model with too many parameters Inductive Bias : Assumptions used to predict outputs, given inputs that were not encountered in the training set Class membership probabilities: Uncertainty in classification due to low probabilistic confidence.
19
The Future The key question when dealing with Machine Learning is not whether a learning algorithm is superior to others, but under which conditions a particular method can significantly outperform others on a given application problem. After a better understanding of the strengths and limitations of each method, the possibility of integrating two or more algorithms together to solve a problem should be investigated. Improve the running time of the algorithms Possibly build a super powerful algorithm that could be used in every domain and for any data set
20
Questions?
21
References Andrew.NG, " Machine Learning CS 229 ", Machine Learning Stanford University 2009. Rich Caruana and Alexandru Niculescu-Mizil, " An Empirical Comparison of Supervised Learning Algorithms", Department of Computer Science, Cornell University, 2006. Yang.T,"Computational Verb Decision Trees", Yang's Scientific Press, 2006 Rich Caruana and Alexandru Niculescu-Mizil, "Obtaining Calibrated Probabilities from Boosting Isotonic Regression.", Department of Computer Science, Cornell University, 2006. Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003 Sameera Mahajani, “Comparing Data mining techniques for cancer classification”, Department of Computer Science, CSU Chico
22
Annotated Bibliography 1. Aik Choon Tan and David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Department of Computer Science, University of Glasgow, 2003 This paper presents the theory and research involved behind the application of supervised machine learning techniques to the field of bioinformatics and classification of biological data. The paper suggests with enough practical evidence that none of the supervised machine learning algorithms perform consistently well over the data sets selected. Observations in this paper show that a combination of machine learning techniques perform much better than the individual ones and the performance is highly dependent on the type of training data. The paper also suggests some important points to consider while selecting a supervised machine learning algorithm for a data set. 2. Rich Caruna and Alexandru Niculescu-Mizil, “An Empirical Comparison of Supervised Learning Algorithms”, Department of Computer Science, Cornell University, 2006 This paper performs a very detailed empirical comparison of ten machine learning algorithms using eight performance criteria considering a variety of data sets and well documented results. The paper suggests that calibrated boosted trees are the best supervised learning algorithm followed by Random forests, SVM’s and neural networks. The paper also talks about the various new supervised learning algorithms recently introduced and how they are more efficient than the older algorithms. The paper discusses the significant variations seen according to the various problems and metrics and evaluates the situations under which these kind of performance fluctuations occur.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.