Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Slides:



Advertisements
Similar presentations
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Computer vision: models, learning and inference
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Machine Learning Neural Networks
Lecture 14 – Neural Networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
EECS 349 Machine Learning Instructor: Doug Downey Note: slides adapted from Pedro Domingos, University of Washington, CSE
CSE 546 Data Mining Machine Learning Instructor: Pedro Domingos.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Sparse vs. Ensemble Approaches to Supervised Learning
CS 4700: Foundations of Artificial Intelligence
Machine learning Image source:
Collaborative Filtering Matrix Factorization Approach
Machine learning Image source:
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Classification / Regression Neural Networks 2
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, March 9, 2015 Session 6: MapReduce – Data Mining This work is licensed under a Creative.
Instructor: Pedro Domingos
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Classification using Co-Training
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is.
Big Data Infrastructure Week 8: Data Mining (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Usman Roshan Dept. of Computer Science NJIT
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
Deep Feedforward Networks
Big Data Infrastructure
Machine Learning & Deep Learning
A Fast Trust Region Newton Method for Logistic Regression
Boosting and Additive Trees
Machine learning, pattern recognition and statistical data modelling
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Basic machine learning background with Python scikit-learn
Asymmetric Gradient Boosting with Application to Spam Filtering
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Collaborative Filtering Matrix Factorization Approach
CSCI B609: “Foundations of Data Science”
Logistic Regression & Parallel SGD
Overview of Machine Learning
Data-Intensive Distributed Computing
Image Classification & Training of Neural Networks
Recap: Naïve Bayes classifier
Introduction to Neural Networks
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 1, 2016 These slides are available at

Structure of the Course “Core” framework features and algorithm design Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

Supervised Machine Learning The generic problem of function induction given sample instances of input and output Classification: output draws from finite discrete labels Regression: output is a continuous value This is not meant to be an exhaustive treatment of machine learning! Focus today

Source: Wikipedia (Sorting) Classification

Applications Spam detection Sentiment analysis Content (e.g., genre) classification Link prediction Document ranking Object recognition And much much more! Fraud detection

training Model training data Machine Learning Algorithm testing/deployment ? Supervised Machine Learning

Objects are represented in terms of features: “Dense” features: sender IP, timestamp, # of recipients, length of message, etc. “Sparse” features: contains the term “viagra” in message, contains “URGENT” in subject, etc. Who comes up with the features? How? Feature Representations

Applications Spam detection Sentiment analysis Content (e.g., genre) classification Link prediction Document ranking Object recognition And much much more! Fraud detection Features are highly application-specific!

Components of a ML Solution Data Features Model Optimization What “matters” the most? logistic regression, naïve Bayes, SVM, random forests, perceptrons, neural network, etc. gradient descent, stochastic gradient descent, L-BFGS, etc.

No data like more data! (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) s/knowledge/data/g;

Limits of Supervised Classification? Why is this a big data problem? Isn’t gathering labels a serious bottleneck? Solution: crowdsourcing Solution: bootstrapping, semi-supervised techniques Solution: user behavior logs Learning to rank Computational advertising Link recommendation The virtuous cycle of data-driven products

Supervised Binary Classification Restrict output label to be binary Yes/No 1/0 Binary classifiers form a primitive building block for multi- class problems One vs. rest classifier ensembles Classifier cascades

Induce Such that loss is minimized Given Typically, consider functions of a parametric form: The Task (sparse) feature vector label loss function model parameters

Key insight: machine learning as an optimization problem! (closed form solutions generally not possible)

Gradient Descent: Preliminaries Rewrite: Compute gradient: “Points” to fastest increasing “direction” So, at any point: * * caveats

Gradient Descent: Iterative Update Start at an arbitrary point, iteratively update: We have: Lots of details: Figuring out the step size Getting stuck in local minima Convergence rate …

Gradient Descent Repeat until convergence:

Intuition behind the math… Old weights Update based on gradient New weights

Gradient Descent Source: Wikipedia (Hills)

Lots More Details… Gradient descent is a “first order” optimization technique Often, slow convergence Conjugate techniques accelerate convergence Newton and quasi-Newton methods: Intuition: Taylor expansion Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute

Source: Wikipedia (Hammer) Logistic Regression

Logistic Regression: Preliminaries Given Let’s define: Interpretation:

Relation to the Logistic Function After some algebra: The logistic function:

Training an LR Classifier Maximize the conditional likelihood: Define the objective in terms of conditional log likelihood: We know so: Substituting:

LR Classifier Update Rule Take the derivative: General form for update rule: Final update rule:

Lots more details… Regularization Different loss functions … Want more details? Take a real machine-learning course!

mapper reducer compute partial gradient single reducer mappers update model iterate until convergence MapReduce Implementation

Shortcomings Hadoop is bad at iterative algorithms High job startup costs Awkward to retain state across iterations High sensitivity to skew Iteration speed bounded by slowest task Potentially poor cluster utilization Must shuffle all data to a single reducer Some possible tradeoffs Number of iterations vs. complexity of computation per iteration E.g., L-BFGS: faster convergence, but more to compute

Spark Implementation val points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient } mapper reducer compute partial gradient update model What’s the difference?

Source: Wikipedia (Japanese rock garden) Questions?