@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Linear Regression.
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
M6D Targeting Model - paper reading 7/23/2014.
Analytics Capabilities. Content DATAMATICS’ RESEARCH & ANALYTICS ADVANCE ANALYTICS CAPABILITIES OUR EXPERIENCES CONJOINT ANALYSIS RESEARCH TOOLS – SIMULATORS.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao† Wei Fan‡ Jing Jiang†Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Sparse vs. Ensemble Approaches to Supervised Learning
Classification and Prediction: Regression Analysis
Anomaly detection Problem motivation Machine Learning.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Multi-Task Learning for HIV Therapy Screening Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, Tobias Scheffer.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Modern Topics in Multivariate Methods for Data Analysis.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
Source-Selection-Free Transfer Learning
Predictive Analytics World CONFIDENTIAL1 Predictive Keyword Scores to Optimize PPC Campaigns Vincent Granville, Ph.D. Click Forensics February 19, 2009.
Transfer Learning Motivation and Types Functional Transfer Learning Representational Transfer Learning References.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Web-Mining Agents: Transfer Learning TrAdaBoost R. Möller Institute of Information Systems University of Lübeck.
Transfer and Multitask Learning Steve Clanton. Multiple Tasks and Generalization “The ability of a system to recognize and apply knowledge and skills.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
CSE 4705 Artificial Intelligence
Michael Xie, Neal Jean, Stefano Ermon
Chapter 7. Classification and Prediction
Transfer Learning in Astronomy: A New Machine Learning Paradigm
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Source: Procedia Computer Science(2015)70:
Introductory Seminar on Research: Fall 2017
By Thomas Hartmann, Assad Moawad, Francois Fouquet, Yves Le Traon
Machine Learning & Data Science
Goal, Question, and Metrics
iSRD Spam Review Detection with Imbalanced Data Distributions
Generalization in deep learning
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Recommender Systems Copyright: Dietmar Jannah, Markus Zanker and Gerhard Friedrich (slides based on their IJCAI talk „Tutorial: Recommender Systems”)
Knowledge Transfer via Multiple Model Local Structure Mapping
Predictive Keyword Scores to Optimize Online Advertising Campaigns
Feature Selection Methods
Information Organization: Overview
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Multiple DAGs Learning with Non-negative Matrix Factorization
NAÏVE BAYES CLASSIFICATION
Naïve Bayes Classifier
Presentation transcript:

@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro

@delbrians Motivation

@delbrians Personalized Spam Filter Goal: You want to build a personalized SPAM filter Problem: You have many features but relatively few examples per user. And for new users, you have no examples!

@delbrians Disease Prediction Goal: Predict whether a patients EKG reading exhibit Cardiac Arrhythmia. Problem: CAD is generally rare and labeling is expensive. Building a training set that is large enough for a single patient is invasive and expensive.

@delbrians Ad Targeting Goal: Predict whether a person will convert after seeing an ad. Problem: Training data costs a lot of money to acquire, conversion events are rare, and feature sets are large.

@delbrians Common Denominator Goal: Classification on one or multiple tasks. Problem: Little to no training data but access to other potentially relevant data sources.

@delbrians Definition, Notation, Technical Jargon etc.

@delbrians Some notation… Source Data (the data you have): Target Data (the data you need): A Domain is a set of features and their corresponding distribution. A Task is a dependent variable and the model that maps X into Y.

@delbrians Definition Given source and target domains, and source and target tasks, transfer learning aims to improve the performance of the target predictive function using the knowledge from the source domain and task, where and. Source: Title: “A Survey on Transfer Learning” Authors: Pan, Yang Published: IEEE Transactions on Knowledge and Data Engineering, October 2010

@delbrians In other words… If you don’t have enough of the data you need, don’t give up: Use something else to build a model! SourceTarget

@delbrians There’s more… Source: “A Survey on Transfer Learning”

@delbrians Alright, focus… Source: “A Survey on Transfer Learning” Inductive Transfer Inductive Transfer : the Supervised Learning subset of Transfer Learning, characterized by having labels in the Target Data.

@delbrians How does this change things?

@delbrians Classic vs. Inductive Transfer Learning Classic Train/Test: “same” data Transfer Train/Test: different data Training Layer Testing Layer Inductive Transfer Learning follows the same train/test paradigm, only we dispense of the assumption that train/test are drawn from the same distribution.

@delbrians Why does this Work?

@delbrians The Bias-Variance Tradeoff A biased model trained on a lot of the ‘wrong’ data is often better than a high variance model trained on the ‘right’ data. Trained on 100 examples of ‘right’ data (AUC=0.786) Ground Truth (AUC=0.806) Trained on 100k examples of ‘wrong’ data (AUC=0.804) Target: P(Y|X)=f(X1+X2-1) Source: P(Y|X)=f(1.1*X1+0.9*X2)

@delbrians Transfer Learning in Action

@delbrians TL in Action Multi-Task Learning for SPAM detection Source: Title: “Feature Hashing for Large Scale Multitask Learning” Authors: Weinberger, Dasgupta, Langford, Smola, Yahoo!Research Published: Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009.

@delbrians Multi-task Learning Multitask learning involves joint optimization over several related tasks, having the same or similar features. Intuition: Let individual tasks borrow information from each other Common Approaches: Learn joint and task-specific features Hierarchical Bayesian methods Learn a joint subspace over model parameters TaskYX_1X_2X_3…X_k … … … … … … … … … … … … 0.8 ………………… T … 0.6 T0 0.2 … 0.7 T … 0.7 T … 0.8

@delbrians SPAM Detection - Method Learn a user level SPAM predictor. Predict if is ‘SPAM/Not SPAM’ Methodology: 1.Pool users 2.Transform feature into binary term features 3.Create User-Term interaction features 4.Hash features for scalability 5.Learn model

@delbrians SPAM Detection - Performance Graph Source: “Feature Hashing for Large Scale Multitask Learning” MTL makes individual predictions more accurate Hashing trick makes MTL scalable 33% Reduction in Spam Misclassification

@delbrians TL in Action Bayesian Transfer for Online Advertising

@delbrians The Non- Branded Web Conversion/Br and Actions A Consumer’s Online Activity… Gets recorded like this. Data UserIDYURL_1URL_2URL_3…URL_k … … … … … 1 ………………… N 0001 … 1

@delbrians We collect data via different data streams… Two Sources of Data Ad Serving (Target): For every impression served, we track user features and whether user converted after seeing an ad. General Web Browsing (Source): By pixeling client’s web site, we can see who interacts with the site independent of advertising. UserIDYURL_1URL_2URL_3…URL_k … … … … … 1 ………………… N 0001 … 1 UserIDYURL_1URL_2URL_3…URL_k … … … … … 0 ………………… N 0110 … 0 $ $$$$$$$$$$$$

@delbrians The Challenges Because conversion rates are so low (or don’t exist prior to campaign), getting enough impression training data is expensive or impossible.

@delbrians Transfer Learning Intuition: The auxiliary model is biased but much more reliable. Use this to inform the real model. The algorithm can learn how much of the auxiliary data to use. General Web Browsing (Source) Data Ad Serving (Target) Data Assumeis high varianceand Solution:Useas a regularization prior for

@delbrians Example Algorithm 1.Run a logistic regression on source data (using your favorite methodology) 2. Use results of step 1 as informative-prior for Target model

@delbrians Breaking it Down Standard Log-Likelihood for Logistic Regression

@delbrians Breaking it Down Standard Log-Likelihood for Logistic Regression Prior knowledge from source model is transferred via regularization.

@delbrians Breaking it Down Standard Log-Likelihood for Logistic Regression Prior knowledge from source model is transferred via regularization. The amount of transfer is determined by the regularization weight.

@delbrians Performance On average, combining source and target models outperforms using either one by itself. Also, transfer learning is very robust to regularization.

@delbrians Summary

@delbrians Key Questions What to transfer? How to transfer? When to transfer?

@delbrians Thinking in Terms of ROI “data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets”

@delbrians Thinking in Terms of ROI Transfer Learning is another tool for extracting better ROI from existing data assets.

@delbrians All models are wrong… Some are useful. - George E. P. Box