COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site: http://www.cse.unsw.edu.au/~cs9313/

11.2 Chapter 11: Large-scale Machine Learning

11.3 What is Machine Learning “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” – Tom Michell (1997) Example: A program for football games T : Win the game P : Goals E : (x) Players’ movements (y) Evaluation

11.4 Machine Learning Traditional Programming Machine Learning Computer Data Program Output Computer Data Output Program

11.5 Magic? No, more like gardening Seeds = Algorithms Nutrients = Data Gardener = You Plants = Programs

11.6 Types of Learning Supervised (inductive) learning Training data includes desired outputs Unsupervised learning Training data does not include desired outputs Semi-supervised learning Training data includes a few desired outputs Reinforcement learning Rewards from sequence of actions

11.7 ML in Practice Understanding domain, prior knowledge, and goals Data integration, selection, cleaning, pre-processing, etc. Learning models Interpreting results Consolidating and deploying discovered knowledge Loop

11.8 Supervised Learning Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object:  Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)}  x … vector of binary, categorical, real valued features  y … class ({+1, -1}, or a real number) XY X’Y’ Training and test set Estimate y = f(x) on X,Y. Hope that the same f(x) also works on unseen X’, Y’

11.9 Supervised Learning Idea: Pretend we do not know the data/labels we actually do know Build the model f(x) on the training data See how well f(x) does on the test data  If it does well, then apply it also to X’ Refinement: Cross validation Splitting into training/validation set is brutal Let’s split our data (X,Y) into 10-folds (buckets) Take out 1-fold for validation, train on remaining 9 Repeat this 10 times, report average performance XY X’ Validation set Training set Test set

11.10 Large Scale Machine Learning We will talk about the following two problems: Recommender Systems Support Vector Machines Main question: How to efficiently train (build a model/find model parameters)?

Part 1: Recommender Systems

11.12 Recommender Systems Application areas

11.13 Why using Recommender Systems? Value for the customer Find things that are interesting Narrow down the set of choices Help me explore the space of options Discover new things Entertainment … Value for the provider Additional and probably unique personalized service for the customer Increase trust and customer loyalty Increase sales, click trough rates, conversion etc. Opportunities for promotion, persuasion Obtain more knowledge about customers …

11.14 Real-world Check Myths from industry Amazon.com generates X percent of their sales through the recommendation lists (30 < X < 70) Netflix (DVD rental and movie streaming) generates X percent of their sales through the recommendation lists (30 < X < 70)

11.15 Recommender systems RS seen as a function Given: User model (e.g. ratings, preferences, demographics, situational context) Items (with or without description of item characteristics) Find: Relevance score. Used for ranking. Finally: Recommend items that are assumed to be relevant But: Remember that relevance might be context-dependent Characteristics of the list itself might be important (diversity)

11.16 Formal Model X = set of Customers S = set of Items Utility function u: X × S  R R = set of ratings R is a totally ordered set e.g., 0-5 stars, real number in [0,1] Utility Matrix AvatarLOTRMatrixPirates Alice Bob Carol David

11.17 Key Problems Gathering “known” ratings for matrix How to collect the data in the utility matrix Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings  We are not interested in knowing what you don’t like but what you like Evaluating extrapolation methods How to measure success/performance of recommendation methods

11.18 Gathering Ratings Explicit Ask people to rate items Doesn’t work well in practice – people can’t be bothered Implicit Learn ratings from user actions  E.g., purchase implies high rating What about low ratings?

11.19 Paradigms of recommender systems Recommender systems reduce information overload by estimating relevance

11.20 Paradigms of recommender systems Personalized recommendations

11.21 Paradigms of recommender systems Collaborative: "Tell me what's popular among my peers"

11.22 Paradigms of recommender systems Content-based: "Show me more of the same what I've liked "

11.23 Paradigms of recommender systems Knowledge-based: "Tell me what fits based on my needs"

11.24 Paradigms of recommender systems Hybrid: combinations of various inputs and/or composition of different mechanism

11.25 Recommender systems: basic techniques ProsCons Content-basedNo community required, comparison between items possible Content descriptions necessary, cold start for new users, no surprises CollaborativeNo knowledge- engineering effort, serendipity of results, learns market segments Requires some form of rating feedback, cold start for new users and new items Knowledge-basedDeterministic recommendations, assured quality, no cold- start, can resemble sales dialogue Knowledge engineering effort to bootstrap, basically static, does not react to short-term trends

11.26 Content-based Recommendations Main idea: Recommend items to customer x similar to previous items rated highly by x What do we need: Some information about the available items such as the genre ("content") Some sort of user profile describing what the user likes (the preferences) Example: Movie recommendations:  Recommend movies with same actor(s), director, genre, … Websites, blogs, news:  Recommend other sites with “similar” content

11.27 Plan of Action likes Item profiles Red Circles Triangles User profile match recommend build

11.28 Item Profiles For each item, create an item profile Profile is a set (vector) of features Movies: author, title, actor, director,… Text: Set of “important” words in document How to pick important features? Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency)  Term … Feature  Document … Item

11.29 Term-Frequency - Inverse Document Frequency (TF-IDF)

11.30 User Profiles and Prediction

11.31 Pros: Content-based Approach +: No need for data on other users No cold-start or sparsity problems +: Able to recommend to users with unique tastes +: Able to recommend new & unpopular items No first-rater problem +: Able to provide explanations Can provide explanations of recommended items by listing content-features that caused an item to be recommended

11.32 Cons: Content-based Approach –: Finding the appropriate features is hard E.g., images, movies, music –: Recommendations for new users How to build a user profile? –: Overspecialization Never recommends items outside user’s content profile People might have multiple interests Unable to exploit quality judgments of other users

11.33 Collaborative Filtering Consider user x Find set N of other users whose ratings are “similar” to x’s ratings Estimate x’s ratings based on ratings of users in N x N

11.34 User-based Nearest-Neighbor Collaborative Filtering The basic technique: Given an "active user" (Alice) and an item i not yet seen by Alice The goal is to estimate Alice's rating for this item, e.g., by  find a set of users (peers) who liked the same items as Alice in the past and who have rated item I  use, e.g. the average of their ratings to predict, if Alice will like item I  do this for all items Alice has not seen and recommend the best-rated Item1Item2Item3Item4Item5 Alice5344 ? User131233 User243435 User333154 User415521

11.35 Some first questions How do we measure similarity? How many neighbors should we consider? How do we generate a prediction from the neighbors' ratings? Item1Item2Item3Item4Item5 Alice5344 ? User131233 User243435 User333154 User415521 User-based Nearest-Neighbor Collaborative Filtering (Cont’)

11.36 Finding “Similar” Users r x = [*, _, _, *, ***] r y = [*, _, **, **, _] r x, r y as sets: r x = {1, 4, 5} r y = {1, 3, 4} r x, r y as points: r x = {1, 0, 0, 1, 3} r y = {1, 0, 2, 2, 0} r x, r y … avg. rating of x, y

11.37 Similarity Metric Intuitively we want: sim(A, B) > sim(A, C) Jaccard similarity: 1/5 < 2/4 Cosine similarity: 0.380 > 0.322 Considers missing ratings as “negative” Solution: subtract the (row) mean sim A,B vs. A,C: 0.092 > -0.559 Notice cosine sim. is correlation when data is centered at 0 Cosine similarity:

11.38 Similarity Metric (Cont’) A popular similarity measure in user-based CF: Pearson correlation Possible similarity values between -1 and 1; Item1Item2Item3Item4Item5 Alice5344 ? User131233 User243435 User333154 User415521 sim = 0,85 sim = 0,70 sim = -0,79

11.39 Rating Predictions

11.40 Memory-based and Model-based Approaches User-based CF is said to be "memory-based" the rating matrix is directly used to find neighbors / make predictions does not scale for most real-world scenarios large e-commerce sites have tens of millions of customers and millions of items Model-based approaches based on an offline pre-processing or "model-learning" phase at run-time, only the learned model is used to make predictions models are updated / re-trained periodically large variety of techniques used model-building and updating can be computationally expensive

11.41 Item-Item Collaborative Filtering So far: User-user collaborative filtering Another view: Item-item Basic idea:  Use the similarity between items (and not users) to make predictions For item i, find other similar items Estimate rating for item i based on ratings for similar items Can use same similarity metrics and prediction functions as in user-user model s ij … similarity of items i and j r xj …rating of user u on item j N(i;x)… set items rated by x similar to i

11.42 Item-Item Collaborative Filtering Example: Look for items that are similar to Item5 Take Alice's ratings for these items to predict the rating for Item5 Item1Item2Item3Item4Item5 Alice5344 ? User131233 User243435 User333154 User415521

11.43 Item-Item CF (|N|=2) 121110987654321 455311 3124452 534321423 245424 5224345 423316 users movies - unknown rating- rating between 1 to 5

11.44 Item-Item CF (|N|=2) 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users - estimate rating of movie 1 by user 5 movies

11.45 Item-Item CF (|N|=2) 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users Neighbor selection: Identify movies similar to movie 1, rated by user 5 movies 1.00 -0.18 0.41 -0.10 -0.31 0.59 sim(1,m) Here we use Pearson correlation as similarity: 1) Subtract mean rating m i from each movie i m 1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows

11.46 Item-Item CF (|N|=2) 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users Compute similarity weights: s 1,3 =0.41, s 1,6 =0.59 movies 1.00 -0.18 0.41 -0.10 -0.31 0.59 sim(1,m)

11.47 Item-Item CF (|N|=2) 121110987654321 455 2.6 311 3124452 534321423 245424 5224345 423316 users Predict by taking weighted average: r 1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6 movies

11.48 CF: Common Practice Define similarity s ij of items i and j Select k nearest neighbors N(i; x) Items most similar to i, that were rated by x Estimate rating r xi as the weighted average: Before: baseline estimate for r xi  μ = overall mean movie rating  b x = rating deviation of user x = (avg. rating of user x) – μ  b i = rating deviation of movie i

11.49 Item-Item vs. User-User AvatarLOTRMatrixPirates Alice Bob Carol David In practice, it has been observed that item-item often works better than user-user Why? Items are simpler, users have multiple tastes

11.50 Pros/Cons of Collaborative Filtering + Works for any kind of item No feature selection needed - Cold Start: Need enough users in the system to find a match - Sparsity: The user/ratings matrix is sparse Hard to find users that have rated the same items - First rater: Cannot recommend an item that has not been previously rated New items, Esoteric items - Popularity bias: Cannot recommend items to someone with unique taste Tends to recommend popular items

11.51 Hybrid Methods Implement two or more different recommenders and combine predictions Perhaps using a linear model Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

11.52 One Recommender Systems research question What should be in that list? Recommender Systems in e-Commerce

11.53 Another question both in research and practice How do we know that these are good recommendations? Recommender Systems in e-Commerce

11.54 This might lead to … What is a good recommendation? What is a good recommendation strategy? What is a good recommendation strategy for my business? Recommender Systems in e-Commerce We hope you will buy also … These have been in stock for quite a while now …

11.55 How Do We as Researchers Know? Test with real users A/B tests Example measures: sales increase, click through rates Laboratory studies Controlled experiments Example measures: satisfaction with the system (questionnaires) Offline experiments Based on historical data Example measures: prediction accuracy, coverage

11.56 Evaluation 134 355 455 3 3 222 5 211 3 3 1 movies users

11.57 Evaluation 134 355 455 3 3 2?? ? 21? 3 ? 1 Test Data Set users movies

11.58 Evaluating Predictions

11.59 Problems with Error Measures Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings: RMSE might penalize a method that does well for high ratings and badly for others

11.60 Collaborative Filtering: Complexity Expensive step is finding k most similar customers: O(|X|) Too expensive to do at runtime Could pre-compute Naïve pre-computation takes time O(k ·|X|) –X … set of customers Ways of doing this ： Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction … … Supported by Hadoop: Apache Mahout https://mahout.apache.org/users/basics/algorithms.html https://mahout.apache.org/users/basics/algorithms.html

Part 2: Support-Vector Machines

11.62 Support Vector Machines: History SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather popular since. Theoretically well motivated algorithm: developed from Statistical Learning Theory (Vapnik & Chervonenkis) since the 60s. Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition,... ) Several textbooks, e.g. ”An introduction to Support Vector Machines” by Cristianini and Shawe-Taylor is one. A large and diverse community work on them: from machine learning, optimization, statistics, neural networks, functional analysis, etc

11.63 Application Example Classification: Suppose we have 50 photographs of elephants and 50 photos of tigers. We digitize them into 100 x 100 pixel images, so we have x ∈ R n where n = 10, 000. Now, given a new (different) photograph we want to answer the question: is it an elephant or a tiger? [we assume it is one or the other.]

11.64 SVM Motivation We have 2 colors of balls on the table that we want to separate.

11.65 SVM Motivation (Cont’) We get a stick and put it on the table, this works pretty well right?

11.66 SVM Motivation (Cont’) Some villain comes and places more balls on the table, it kind of works but one of the balls is on the wrong side and there is probably a better place to put the stick now.

11.67 SVM Motivation (Cont’) SVMs try to put the stick in the best possible place by having as big a gap on either side of the stick as possible.

11.68 SVM Motivation (Cont’) Now when the villain returns the stick is still in a pretty good spot.

11.69 SVM Motivation (Cont’) There is another trick in the SVM toolbox that is even more important. Say the villain has seen how good you are with a stick so he gives you a new challenge.

11.70 SVM Motivation (Cont’) There’s no stick in the world that will let you split those balls well, so what do you do? You flip the and throw the balls into the air. Then, you grab a sheet of paper and slip it between the balls.

11.71 SVM Motivation (Cont’) Now, looking at the balls from where the villain is standing, the balls will look split by some curvy line.

11.72 SVM Motivation (Cont’) Data points: the balls Classifier: the stick (the curve line) Optimization: the biggest gap trick Kernelling: flipping the table Hyperplane: the piece of paper

11.73 SVM Liner Classifiers Want to separate “+” from “-” using a line + + + + ++ - - - - - - - Which is best linear separator (defined by w)?

11.74 Classifier Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a data point. + + + + ++ - - - - - - -

11.75 Maximum Margin The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM, called a Linear SVM (LSVM) Data points closest to the hyperplane are called support vectors + + + + ++ - - - - - - -

11.76 + + + + + + + + + - - - - - - - - - A B C Maximum Margin Distance from the separating hyperplane corresponds to the “confidence” of prediction Example: We are more sure about the class of A and B than of C

11.77 Specifying a Line and Margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

11.78 Specifying a Line and Margin Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Classify as..+1if if Universe explodes if-1 < w. x + b < 1

11.79 SVM Classifier f x y estimate w, b The classifier: Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

11.80 w ∙ x + b = 0 Distance from a point to a line A (x A (1), x A (2) ) M (x 1, x 2 ) H d(A, L) = |AH| = |(A-M) ∙ w| = |(x A (1) – x M (1) ) w (1) + (x A (2) – x M (2) ) w (2) | = x A (1) w (1) + x A (2) w (2) + b = w ∙ A + b Remember x M (1) w (1) + x M (2) w (2) = - b since M belongs to line L w d(A, L) L + What is the Margin? Let: Line L: w∙x+b = w (1) x (1) +w (2) x (2) +b=0 w = (w (1), w (2) ) Point A = (x A (1), x A (2) ) Point M on a line = (x M (1), x M (2) ) (0,0)

11.81 Maximum Margin + ++ + + + + - - - - - - - w  x + b = 0

11.82 Support Vector Machine + + + + + + + + - -- - - - - w  x+b=0   Maximizing the margin 

11.83 References Chapters 9 and 12 of Mining of Massive Datasets. Tutorial: Recommender Systems. IJCAI 2013. http://bytesizebio.net/2014/02/05/support-vector-machines-explained- well/ http://bytesizebio.net/2014/02/05/support-vector-machines-explained- well/

To be continued…

COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Similar presentations

Presentation on theme: "COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Similar presentations

Presentation on theme: "COMP9313: Big Data Management Lecturer: Xin Cao Course web site: COMP9313: Big Data Management Lecturer: Xin Cao Course web site:"— Presentation transcript:

Similar presentations

About project

Feedback