Supervised Learning Seminar Social Media Mining University UC3M Date May 2017 Lecturer Carlos Castillo http://chato.cl/ Sources: CS583 slides by Bing Liu (2017) Supervised learning course by Mark Herbster (2014) based on: T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2002.
What is “learning” in this context? Computing a functional relationship between an input (x) and an output (y) E.g.: xi = e-mail, yi = { spam, not spam } xi = tweet, yi = { health-related, not health-related} xi = hand written digit, yi = { 0, 1, 2, …, 9 } xi = news, yi = [ 0: meaningless, …, 1: important ] Vector x is usually high dimensional
Example: photo of people or not?
Example (cont.) Many problems: doing this efficiently, generalizing well, ...
Formally Goal: given training data: … infer a function such that... … and apply it to future data: Binary classification: Regression:
Supervised learning algorithms use a training data set Example supervised learning algorithms: Linear regression / logistic regression Decision trees / decision forests Neural networks
What do we want? Collect appropriate training data This requires some assumptions (e.g., uniform random sample) Represent inputs appropriately Good feature construction and selection Learn efficiently E.g., linearly on the number of training elements “Test” efficiently I.e., operate efficiently at running time
Key goal: generalize Borges’ “Funes el Memorioso” (1942) “Not only was it difficult for him to see that the generic symbol 'dog' took in all the dissimilar individuals of all shapes and sizes, it irritated him that the 'dog' of three-fourteen in the afternoon, seen in profile, should be indicated by the same noun as the dog at three-fifteen, seen frontally" To generalize is to forget differences and focus on the important Simple models (using fewer features) are preferable in general
Overfitting and Underfitting Underfitted models perform bad in the training set and in the testing set Overfitted models perform great in the training set but very poorly in the testing set Source: http://pingax.com/regularization-implementation-r/
Finding a sweet spot: prototypical error curves Inference error = “training error” Estimation error = “testing error”
Example: k-NN classifier Suppose we’re classifying social media postings as “health-related” (green) vs “not health-related” (red) All messages in the training set can be “pivots” For a new, unlabeled, unseen message, pick the k pivots that are more similar to the it Do majority voting Green wins => message is about health Red wins => message is not about health
How large should k be? How to decide?
Overfitting with k-NN
Decision trees Discriminative model based on per-feature decisions; each internal node is a decision Source: http://blog.akanoo.com/tag/decision-tree-example/
Example (loan application) Class: yes = credit, no = no-credit
Example (loan application) Class: yes = credit, no = no-credit BEFORE READING THE NEXT SLIDES ... Build manually a decision tree for this table. Try to have few internal nodes. You can start in any column (not necessarily the first one).
Example (loan application) Class: yes = credit, no = no-credit BEFORE READING THE NEXT SLIDES ... Build manually a decision tree for this table. Try to have few internal nodes. You can start in any column (not necessarily the first one).
Example (loan application) Class: yes = credit, no = no-credit
Simplest decision tree: majority class START Yes Accuracy? (Accuracy = correct / total)
Example tree
Is this decision tree unique? No. Here is a simpler tree. We want a small tree and an accurate tree. It’s easy to understand and performs better. Finding the optimal tree is NP-hard, we need to use heuristics
Basic algorithm (greedy divide-and-conquer) Assume attributes are categorical for now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain) Example conditions for stopping partitioning All examples for a given node belong to the same class Largest leaf node has min_leaf_size elements or less
Example: information gain Source (this and following slides): http://www.saedsayad.com/decision_tree.htm
Entropy Entropy:
Expected entropy of splitting attribute X
Information gain See also: http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez12.pdf
Information gain See also: http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez12.pdf
Building decision tree recursively on every sub-dataset
There is a lot more in supervised learning! They all require labeled input data (“supervision”) Main practical difficulties: Good labeled data can be expensive to get Efficiency requires careful algorithmic design Typical problems Sensitivity to incorrectly labeled instances Slow convergence and no guarantee of global optimality We may want to update a model (online learning) We may want to know what to label (active learning) Overfitting
Some state-of-the-art methods Text classification: Random forests Image classification: Neural networks
Important element: explainability Example: Husky vs Wolf classifier Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?: Explaining the Predictions of Any Classifier." In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135-1144. ACM, 2016.