Download presentation
Published bySamuel Pearson Modified over 9 years ago
1
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park
2
Today’s Lecture Where we’ve been Where we’re going today
Big Data Statistics MapReduce Interpretation of results Where we’re going today Machine learning Where we’re going next Part 2 of course: Security and InSecurity in the Real World 2 readings each lecture
3
Machine Learning Systems that automatically learn programs from data
Supervised learning: have inputs and associated outputs Learn relationships between them using available training data (also called “labeled data”, “ground truth”) Predict future values Classification: The output (learned attribute) is categorical Regression: The output (learned attribute) is numeric Unsupervised learning: have only inputs Learn “latent” labels Clustering: Identify natural groups in the data Systems that automatically learn programs from data P. Domingos, CACM 2012
4
Rules Want to decide when to play Example: 1 attribute
Weather and golf Want to decide when to play Create rules based on attributes Example: 1 attribute if (outlook == “rainy”) then play = “no” else play = “yes” Errors: 6/14 Can refine rule by adding conditions on other attributes Create a decision tree outlook temp humidity windy play overcast cool normal TRUE yes hot high FALSE mild rainy no sunny
5
Entropy Which attribute do we choose at each level?
Consider two sequences of coin flips How much information do we get after flipping each coin once? We want some function “Information” that satisfies: Information1&2(p1p2) = Information1(p1) + Information2(p2) Expected Information = “Entropy” Examples Flipping a coin In learning the outcome of the coin flip we learned 1 bit of information Rolling a fair die A die is more unpredictable than a coin Rolling a weighted die with p1..5=0.1, p6=0.5 A weighted die is less unpredictable than a fair die By definition, information must (1) depend only on the probability of the event; (2) be positive; and (3) be additive (the information from the intersection of two independent events must be the sum of the information for each of the two events).
6
Decision Tree Weather and golf At each level, choose the attribute with the highest information gain The one that reduces the unpredictability the most Before: 9/14 “yes” outcomes => H=0.94 Outlook: H=0.69 4/4 “yes” for overcast (H=0) 3/5 “yes” for rainy (H=0.97) 2/5 “yes” for sunny (H=0.97) Temperature: H=0.91 Humidity: H=0.94 Windy: 0.87 Outlook provides highest information gain: 0.94 – 0.69 = 0.25 outlook temp humidity windy play overcast cool normal TRUE yes hot high FALSE mild rainy no sunny
7
Resulting Decision Tree
Putting the decision tree together Choose the attribute with the highest Information Gain Create branches for each value of attribute Discretize continuous attributes (choose partition with highest gain) R package: rpart Not a perfect classification (still makes some incorrect decisions)
8
Overfitting Low error on training data and high error on test data
“If the knowledge and data we have are not sufficient to completely determine the correct classifier, […] we run the risk of just hallucinating a classifier that […] simply encodes random quirks in the data.” – P. Domingos, CACM’12 Some algorithms can prune the tree to avoid overfitting Underfitting Overfitting
9
True Positive (TP) Correct decision
Confusion Matrix How to determine if the classifier does a good job? You need a training set (ground truth) and a testing set Or you can split your ground truth into two data sets Even better: K-fold cross-validation Select K samples without replacement and train classifier multiple times You can make a mistake in two different ways True - True + Predicted - True Negative (TN) Correct decision False Negative (FN) Type 2 error Predicted + False Positive (FP) Type 1 error True Positive (TP) Correct decision
10
FP rate (1 – Specificity)
Evaluating Results Is it better to have low FPs or low FNs? There is usually a trade-off between FPs and FNs Reducing type 1 errors causes more type 2 errors, and vice-versa Sensitivity = TP / (TP+FN) Ability to identify true positives Also called true positive rate Specificity = TN / (FP + TN) Ability to rule out true negatives Also called true negative rate Can plot a Receiver Operating Characteristic (ROC) curve R package: ROCR Evaluating keystroke dynamics [Killourhy & Maxion, DSN’09] FP rate (1 – Specificity) TP rate (Sensitivity) equal-error error rate: FP=FN Precision = fraction of positive test results that really have the condition (a.k.a. positive predictive value) = TP / (TP+FP) Recall = fraction of subjects with the condition that return positive tests (a.k.a. sensitivity, true positive rate)
11
Unsupervised Learning
Agglomerative hierarchical clustering (R: hclust) No ground truth; goal is to identify patterns that describe the data Start from individual points and progressively merge nearby clusters Distance metric (e.g. Euclidian, rank correlation, Gower) Linkage: how to aggregate pairwise point distances into cluster distances Average? Minimum (single)? Maximum (complete)? Variance decrease (Ward)? Choose classification or clustering features carefully Applications of Unsupervised Learning • Outlier detection or monitoring – “Is this normal?” • Classification – “What group is this item most similar to?” • Compression and Communication – send the string “ababababababab” or send “ab*7” • In all cases, we identify patterns that describe the data and put them to use Dendrogram of 1970 cars (features: MPG, weight, drive ratio)
12
Additional Machine Learning Resources
Classification We saw: decision trees Other classifiers: naïve Bayes, Support Vector Machines (SVM) Natural language processing Text mining (R package: tm) Sentiment analysis (annotated English wordlist: Clustering We saw: hierarchical clustering Other clustering techniques: k-means, k-medoids, time series clustering Dimensionality reduction: principal component analysis (PCA) Machine learning tools For R: For Hadoop: Mahout ( Techniques for time series: dynamic time warping, autoregression, Fourier analysis
13
Project Peer-Reviews Pilot project reports Pilot project peer reviews
Reports due today Discuss hypothesis (security problem and data analyzed to solve it) Feasibility study Report data volume, velocity, variety and quality Post report on Piazza Pilot project peer reviews Review at least 2 project reports from other students Use skills learned from paper reviews Peer reviews are a part of your grade Post reviews on Piazza (as follow-ups to report posts) by Monday
14
Review of Lecture What did we learn? What’s next? Deadline reminders
Classification Clustering What’s next? Paper discussion: ‘Sex, Lies and Cyber-crime Surveys’ Next lecture: start of part 2 of course – 2 readings / lecture Deadline reminders Pilot project reports due today Pilot project reviews due Monday Group project proposals due Monday, 09/30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.