Download presentation
Presentation is loading. Please wait.
Published byTyrone Eaton Modified over 9 years ago
1
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen
2
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 1 Sometimes understanding why certain predictions are being made is more important than the actual prediction. Why? How can you determine why the predictions are being made? Do some ML algorithms make that easier than others (which)? What is the difference between a classification rule and an association rule? Be able to interpret the models produced by several algorithms: Decision Trees, Rules, Linear Classifier, Naïve Bayes, ANN, Nearest Neighbor, Bagging, Random Forest, When are results good, okay, bad, etc.? What are the common factors involved in deciding a classifier’s (or an individual rule’s) value? What are search, generalization and regularization in the context of DM? What are their roles / why are they important? Describe concept space in terms of rules. Define a concept description in terms of rules. Practical problems associated with enumerating the entire search space in looking for the right concept description for a dataset? What are the three important decisions in a learning algorithm, which determine its bias? What is meant by bias in ML/DM? Understand the different aspects of these components of the bias. Understand the difference between general-to-specific versus specific-to-general search. Evaluate potential ethical issues in DM.
3
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 2 Concept descriptions How do you use training data, validation, and test data? Evaluate different experimental designs. Why do supervised classification algorithms typically not place constraints on coverage or accuracy, where association rule learning does utilize such constraints? Compare and contrast classification, clustering, semisupervised learning, and active learning. Compare and contrast nominal, ordinal, interval, and ratio data. Describe the potential significance of missing values. Analyze the significance of reasons for errors in the data and their impacts.
4
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 3 Analyze the difference between rules generated from a decision tree versus those output by various types of rule learning algorithms. Analyze the number of classification rules possible for a dataset versus the number of association rules. Describe the role of support and confidence in association rule learning. Evaluate the benefit of exceptions in rules. Advantages of rules versus decision trees Advantages and disadvantages of instance-based learning Interpret the output of various learning algorithms whose output can relatively easily be interpreted. Analyze a learning problem and decide what form of learning or what algorithm is most suitable (restricted to clear scenarios).
5
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 5 Different evaluation techniques: k-fold cross validation versus test, training and validation versus test and training datasets versus bootstrap sampling versus repeated random resampling (repeated holdout) versus leave-one-out cross validation: pros and cons (advantages and disadvantages), ability to predict evaluation results for unseen (future) data Proper creation and use of datasets, common mistakes, trade offs, pros and cons of stratification Common evaluation performance measures (Accuracy, P, R, F 1 -measure, error rate, relative reduction in the error rate, mean squared error, LIFT, AUC, know correlation coefficient conceptually, know minimum description length conceptually), when they should be used, pros and cons Apply cost-sensitive classification given a cost matrix and class probability estimates Given an evaluation metric’s result on one set of data, how close can you expect it to be to the result you would get on unseen data When to use the Student’s t-test versus the paired t-test Interpret a statistical p value, evaluate different experimental designs, If the results look too good to be true, ___________________! Precision-Recall Tradeoff
6
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 4 & 6 Discriminative versus Generative classifiers One versus all scheme Common advantages of one algorithm versus another (not obscure advantages)
7
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 4 & 6 To the extent we discussed it in class, for the following algorithms, know: the pros and cons relative to the others (the benefits, what are they good for, and what causes them problems), the typical uses / why you would want to use it, the basics of the algorithm, significant assumptions made, how to interpret/apply the model generated, how (if possible) to handle nominal and or numeric attributes and if not possible, how do you work around the issue, how to treat missing values, do they naturally work with multiclass problems and if not, how do you work around the issue, are there only certain concept spaces for which the base algorithm works and if so, is there a way to work around the issue, how they stand up to noisy data, how to avoid overfitting (regularization), whether solutions are global optimums (max or min of metric), local optimums, or neither, standard parameters
8
Rodney Nielsen, Human Intelligence & Language Technologies Lab Chapter 4 & 6 Naïve Bayes: know: Bayes Rule, why we don’t typically need to be able to compute the denominator P(x) or P(evidence), how to calculate the relevant probabilities given a simple example Decision Trees: know: divide and conquer, (do not need to be able to apply entropy-based equations), why we use Gain Ratio rather than just Information Gain, can DTs ever be ambiguous, pros and cons (advantages and disadvantages) of pre-pruning versus post- pruning Covering Algorithm: know: separate and conquer, reduced error pruning concept, rules with exceptions concept, can rule sets ever be ambiguous Association Rules: know: support, confidence, be able to derive association rules from a simple dataset Linear Models: know: logistic regression, perceptron algorithm, what do algorithms most often try to minimize, what do the basic algorithms use to do that minimization, kernel concept, maximum margin, what is a support vector Instance-based Learning: know: k-NN, how to measure nearness, normalization, conceptual understanding of and ability to apply kD- and ball-trees Clustering: know: difference between disjoint vs. overlapping, deterministic vs. probabilistic, hierarchical vs. flat, agglomerative versus divisive, single linkage vs. complete linkage vs. centroid linkage vs. average linkage, k-means, EM (not computing variance or covariance), how to interpret a dendogram Principle components analysis: know what this does conceptually Semi-supervised learning and active learning: know: co-training, self-training
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.