Machine Learning & Data Mining CS/CNS/EE 155 Lecture 4: Recent Applications of Lasso 1.

Slides:

Advertisements

Similar presentations

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.

Advertisements

Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

Web Intelligence Text Mining, and web-related Applications

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Supervised Learning Recap

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Localization of Diagnostically Relevant Regions of Interest in Whole Slide Images Ezgi Mercan 1, Selim Aksoy 2, Linda G. Shapiro 1, Donald L. Weaver 3,

Data Visualization STAT 890, STAT 442, CM 462

1 Learning Dynamic Models from Unsequenced Data Jeff Schneider School of Computer Science Carnegie Mellon University joint work with Tzu-Kuo Huang, Le.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Multiple Instance Learning

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Reduced Support Vector Machine

Condition State Transitions and Deterioration Models H. Scott Matthews March 10, 2003.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Scalable Text Mining with Sparse Generative Models

Part I: Classification and Bayesian Learning

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Crash Course on Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Mining Cross-network Association for YouTube Video Promotion Ming Yan Institute of Automation, C hinese Academy of Sciences May 15, 2014.

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Anomaly detection with Bayesian networks Website: John Sandiford.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.

An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.

Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Machine Learning Saarland University, SS 2007 Holger Bast Marjan Celikik Kevin Chang Stefan Funke Joachim Giesen Max-Planck-Institut für Informatik Saarbrücken,

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.

Logistic Regression & Elastic Net

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

Classification Results for Folder Classification on Enron Dataset.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

KNN & Naïve Bayes Hongning Wang

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Learning Deep Generative Models by Ruslan Salakhutdinov

Erasmus University Rotterdam

Introduction to Data Science Lecture 7 Machine Learning Overview

Supervised Learning Seminar Social Media Mining University UC3M

Representing Documents Through Their Readers

Machine Learning Ali Ghodsi Department of Statistics

Overfitting and Underfitting

Applied Machine Learning For Quant Finance

Sequential Learning with Dependency Nets

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 4: Recent Applications of Lasso 1

Today: Two Recent Applications Cancer Detection Personalization via twitter Image Sources: Applications of Lasso (and related methods) Think about the data & modeling goals Some new learning problems Slide material borrowed from Rob Tibshirani and Khalid El-Arini

Aside: Convexity 3 Image Source: Easy to find global optima! Strict convex if diff always >0 Not Convex

Aside: Convexity All local optima are global optima: Strictly convex: unique global optimum: Almost all objectives discussed are (strictly) convex: – SVMs, LR, Ridge, Lasso… (except ANNs) 4

Cancer Detection 5

6 “Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging” Proceedings of the National Academy of Sciences (2014) Livia S. Eberlin, Robert Tibshirani, Jialing Zhang, Teri Longacre, Gerald Berry, David B. Bingham, Jeffrey Norton, Richard N. Zare, and George A. Poultsides Surgeon removes tissue 2.Pathologist examines tissue – Under microscope 3.If no margin, GOTO Step 1. Gastric (Stomach) Cancer Image Source:

7 Gastric (Stomach) Cancer Expensive: requires a pathologist Slow: examination can take up to an hour Unreliable: 20%-30% can’t predict on the spot Drawbacks Image Source: 1.Surgeon removes tissue 2.Pathologist examines tissue – Under microscope 3.If no margin, GOTO Step 1.

Machine Learning to the Rescue! (actually just statistics) Lasso originated from statistics community. – But we machine learners love it! Train a model to predict cancerous regions! – Y = {C,E,S} (How to predict 3 possible labels?) – What is X? – What is loss function? 8 Basic Lasso:

Mass Spectrometry Imaging DESI-MSI (Desorption Electrospray Ionization) Effectively runs in real-time (used to generate x) 9 Image Source:

10 Each pixel is data point x via spectroscopy y via cell-type label x Image Source:

11 Image Source: Each pixel has 11K features. Visualizing a few features. x

Multiclass y: Most common model: Loss function? Multiclass Prediction 12 Predict via Largest Score: Replicate Weights: Score All Classes:

Multiclass Logistic Regression 13 Referred to as Multinomial Log-Likelihood by Tibshirani Binary LR: “Log Linear” Property: Extension to Multiclass: Keep a (w k,b k ) for each class (w 1,b 1 ) = (-w -1,-b -1 ) Multiclass LR:

Multiclass Log Loss 14

Multiclass Log Loss Suppose x=1 & ignore b – Model score is just w k – Vary one weight, others = 1 15 Log Loss y=k y≠k

Lasso Multiclass Logistic Regression Probabilistic model Sparse weights 16

Back to the Problem Image Tissue Samples Each pixel is an x – 11K features via Mass Spec – Computable in real time – 1 prediction per pixel y via lab results – ~2 weeks turn-around 17 Visualization of all pixels for one feature x

Learn a Predictive Model Training set: 28 tissue samples from 14 patients – Cross validation to select λ Test set: 21 tissue samples from 9 patients Test Performance: 18 ≥0.2 margin in probability

19 Lasso yields sparse weights! (Manual Inspection Feasible!) Many correlated features – Lasso tends to focus on one

Extension: Local Linearity Assumes probability shifts along straight line – Often not true Approach: cluster based on x – Train customized model for each cluster 20

Recap: Cancer Detection Seems Awesome! What’s the catch? – Small sample size Tested on 9 patients – Machine Learning only part of the solution Need infrastructure investment, etc. Analyze the scientific legitimacy – Social/Political/Legal If there is mis-prediction, who is at fault? 21

Personalization via twitter 22

overloaded by news ≥ 1 million news articles & blog posts generated every hour* * [ “Representing Documents Through Their Readers” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (2013) Khalid El-Arini, Min Xu, Emily Fox, Carlos Guestrin 23

News Recommendation Engine corpus Vector representation: Bag of words LDA topics etc. user 24

News Recommendation Engine corpus Vector representation: Bag of words LDA topics etc. user 25

user News Recommendation Engine corpus Vector representation: Bag of words LDA topics etc. 26

Challenge Most common representations don’t naturally line up with user interests Fine-grained representations (bag of words) too specific High-level topics (e.g., from a topic model) - too fuzzy and/or vague - can be inconsistent over time 27

Goal Improve recommendation performance through a more natural document representation 28

An Opportunity: News is Now Social In 2012, Guardian announced more readers visit site via Facebook than via Google search 29

badges 30

Approach Learn a document representation based on how readers publicly describe themselves 31

32

Using many tweets, can we learn that someone who identifies with music reads articles with these words: via profile badges ? 33

Given: training set of tweeted news articles from a specific period of time 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new articles music badges words 3 million articles 34

Advantages Interpretable – Clear labels – Correspond to user interests Higher-level than words 35

Advantages Interpretable – Clear labels – Correspond to user interests Higher-level than words Semantically consistent over time politics 36

Given: training set of tweeted news articles from a specific period of time 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new articles music badges words 3 million articles 37

Dictionary Learning Fleetwood Mac Nicks love album linux music gig cycling Training data : Bag-of-words representation of document Identifies badges in Twitter profile of tweeter Normalized! 38

Dictionary Learning Training Objective: 39 Bag-of-words representation of document Identifies badges in Twitter profile of tweeter “Dictionary”“Encoding”

Not convex! (because of BW term) Convex if only optimize B or W (but not both) Alternating Optimization (between B and W) How to initialize? 40 linux music gig cycling “Dictionary”“Encoding” Initialize: Use:

Suppose Badge s often co-occurs with Badge t – B s & B t are correlated From perspective of W, B’s are features. – Lasso tends to focus on one correlated feature Graph Guided Fused Lasso: 41 Many articles might be about Gig, Festival & Music simultaneously. Graph G of related Badges Co-occurance Rate On Twitter Profiles

Encoding New Articles Badge Dictionary B is already learned Given a new document j with word vector y j – Learn Badge Encoding W j : 42

1.Learn a badge dictionary from training set 2.Use badge dictionary to encode new articles music badges words Recap: Badge Dictionary Learning 43

Examining B music soccer Labour Biden September

Badges Over Time September 2010 musicBiden September

A Spectrum of Pundits Limit badges to progressive and TCOT Predict political alignments of likely readers? “top conservatives on Twitter” more conservative Took all articles by columnist Looked at encoding score progressive vs TCOT Average 46

User Study Which representation best captures user preferences over time? Study on Amazon Mechanical Turk with 112 users 1.Show users random 20 articles from Guardian, from time period 1, and obtain ratings 2.Pick random representation bag of words, high level topic, Badges 3.Represent user preferences as mean of liked articles 4.GOTO next time period Recommend according to preferences GOTO STEP 2 47

User Study better 48 High Level TopicBag of Words Badges

Recap: Personalization via twitter Sparse Dictionary Learning – Learn a new representation of articles – Encode articles using dictionary – Better than Bag of Words – Better than High Level Topics Based on social data – Badges on twitter profile & tweeting – Semantics not directly evident from text alone 49

Next Week Sequence Prediction Hidden Markov Models Conditional Random Fields Homework 1 due Tues – via Moodle 50