Big Data Infrastructure Week 8: Data Mining (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

Linear Regression.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Support Vector Machines and Margins

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

x – independent variable (input)

Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Three kinds of learning

Sparse vs. Ensemble Approaches to Supervised Learning

Distributed Representations of Sentences and Documents

Collaborative Filtering Matrix Factorization Approach

Ensembles of Classifiers Evgueni Smirnov

Machine Learning CS 165B Spring 2012

Learning with large datasets Machine Learning Large scale machine learning.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Scaling up Decision Trees. Decision tree learning.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.

INTRODUCTION TO Machine Learning 3rd Edition

Big Data Infrastructure Jimmy Lin University of Maryland Monday, March 9, 2015 Session 6: MapReduce – Data Mining This work is licensed under a Creative.

Classification Ensemble Methods 1

Logistic Regression William Cohen.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Efficient Logistic Regression with Stochastic Gradient Descent

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 7, 2013 Session 7: Clustering and Classification This work is.

Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.

Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

Matrix Factorization Reporter : Sun Yuanshuai

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Big Data Infrastructure

Big Data Infrastructure

Big Data Infrastructure

Big Data Infrastructure

Boosting and Additive Trees

Machine Learning I & II.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

Basic machine learning background with Python scikit-learn

ECE 5424: Introduction to Machine Learning

Neural Networks and Backpropagation

Combining Base Learners

Data Mining Practical Machine Learning Tools and Techniques

Collaborative Filtering Matrix Factorization Approach

Logistic Regression & Parallel SGD

Large Scale Support Vector Machines

Multilayer Perceptron & Backpropagation

Data-Intensive Distributed Computing

Charles Tappert Seidenberg School of CSIS, Pace University

CSE 491/891 Lecture 25 (Mahout).

Softmax Classifier.

Presentation transcript:

Big Data Infrastructure Week 8: Data Mining (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 3, 2016 These slides are available at

Induce Such that loss is minimized Given Typically, consider functions of a parametric form: The Task (sparse) feature vector label loss function model parameters

Gradient Descent Source: Wikipedia (Hills)

mapper reducer compute partial gradient single reducer mappers update model iterate until convergence MapReduce Implementation

Spark Implementation val points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient } mapper reducer compute partial gradient update model What’s the difference?

Gradient Descent Source: Wikipedia (Hills)

Stochastic Gradient Descent Source: Wikipedia (Water Slide)

Gradient Descent Stochastic Gradient Descent (SGD) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance In practice… just as good! Batch vs. Online Opportunity to interleaving prediction and learning!

Practical Notes Order of the instances important! Most common implementation: Randomly shuffle training instances Stream instances through learner Single vs. multi-pass approaches “Mini-batching” as a middle ground between batch and stochastic gradient descent We’ve solved the iteration problem! What about the single reducer problem?

Source: Wikipedia (Orchestra) Ensembles

Ensemble Learning Learn multiple models, combine results from different models to make prediction Why does it work? If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error A variety of different techniques: Majority voting Simple weighted voting: Model averaging …

Practical Notes Common implementation: Train classifiers on different input partitions of the data Embarrassingly parallel! Contrast with other ensemble techniques, e.g., boosting

MapReduce Implementation training data mapper learner

MapReduce Implementation training data mapper reducer learner

What about Spark? mapPartitions f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] learner

MapReduce Implementation: Details Two possible implementations: Write model out as “side data” Emit model as intermediate output

Candidate Generation Candidates Classification Follow graphRetweet graphOther log data … Final Results Trained Model Case Study: Link Recommendation Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012.

Classifier Training Making Predictions Just like any other parallel Pig dataflow label, feature vector modelUDF feature vector prediction modelUDF feature vector prediction model previous Pig dataflow map reduce previous Pig dataflow model Pig storage function

Classifier Training training = load ‘training.txt’ using SVMLightStorage() as (target: int, features: map[]); store training into ‘model/’ using FeaturesLRClassifierBuilder(); Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5; Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)

define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Making Predictions Want an ensemble? define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);

Sentiment Analysis Case Study Binary polarity classification: {positive, negative} sentiment Independently interesting task Illustrates end-to-end flow Use the “emoticon trick” to gather data Data Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split) Features: Sliding window byte-4grams Models: Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting) Lin and Kolcz, SIGMOD 2012

“for free” Ensembles with 10m examples better than 100m single classifier! Diminishing returns… single classifier10m ensembles100m ensembles

training Model training data Machine Learning Algorithm testing/deployment ? Supervised Machine Learning

Applied ML in Academia Download interesting dataset (comes with the problem) Run baseline model Train/test Build better model Train/test Does new model beat baseline? Yes: publish a paper! No: try again!

Three Commandants of Machine Learning Thou shalt not mix training and testing data

Training/Testing Splits Trainin g Test What happens if you need more? Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Applied ML in Academia Download interesting dataset (comes with the problem) Run baseline model Train/test Build better model Train/test Does new model beat baseline? Yes: publish a paper! No: try again!

Fantasy Extract features Develop cool ML technique #Profit Reality What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate…

It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil “Data Jujitsu” Source: Wikipedia (Jujitsu)

On finding things…

CamelCase smallCamelCase snake_case camel_Snake dunder__snake uid UserId userId userid user_id user_Id On finding things…

^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ \\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$ An actual Java regular expression used to parse log message at Twitter circa 2010 On feature extraction… Friction is cumulative!

Frontend Engineer Develops new feature, adds logging code to capture clicks Data Scientist Analyze user behavior, extract insights to improve feature Okay, let’s get going… where’s the click data? Well, that’s kinda non-intuitive, but okay… Oh, BTW, where’s the timestamp of the click? It’s over here… Well, it wouldn’t fit, so we had to shoehorn… Hang on, I don’t remember… Uh, bad news. Looks like we forgot to log it… [grumble, grumble, grumble] … Data Plumbing… Gone Wrong! [scene: consumer internet company in the Bay Area…]

Extract features Develop cool ML technique #Profit What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate… Finally works! FantasyReality

Source: Wikipedia (Hills) Congratulations, you’re halfway there…

Does it actually work? Congratulations, you’re halfway there… Is it fast enough? Good, you’re two thirds there… A/B testing

Source: Wikipedia (Oil refinery) Productionize

What are your jobs’ dependencies? Productionize How/when are your jobs scheduled? Infrastructure is critical here! Are there enough resources? How do you know if it’s working? Who do you call if it stops working? (plumbing)

Source: Wikipedia (Plumbing) Plumbing matters! Takeaway lesson:

Source: Wikipedia (Japanese rock garden) Questions?