Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Similar presentations


Presentation on theme: "Introduction to Machine Learning for Information Retrieval Xiaolong Wang."— Presentation transcript:

1 Introduction to Machine Learning for Information Retrieval Xiaolong Wang

2 What is Machine Learning In short, tricks of maths Two major tasks: – Supervised Learning: a.k.a. Regression, Classification… – Unsupervised Learning: a.k.a. data manipulation, clustering …

3 Supervised Learning Label : usually manually labeled Data : data representation, usually as a vector Prediction Function : selecting one from a predefined family of functions that has the best prediction classificationregression

4 Supervised Learning Two formulations: – F1: Given a set of X i, Y i, learn a function Y i – Binary: Spam v.s. Non-spam – Numeric: Very relevant(5), somewhat relevant(4), marginal relevant(3), somewhat irrelevant(2), very irrelevant(1) X i – Number of words, occurrence of each word, … f – usually linear function

5 Supervised Learning Two formulations: – F2: Give a set of X i, Y i, learn a function such that Y i : more complex label than binary or numeric – Multiclass learning: entertainment v.s. sports v.s. politics… – Structural learning: syntactic parsing more general Y X

6 Supervised Learning Training – Optimization: Loss : difference b/w true label Y i and predicted label w T X i – Squared Loss (regression): (Y i – w T X i ) 2 – Hinge Loss (classification): max(0, 1 – Y i.w T X i ) – Logistic Loss (classification): log(1 + exp(-Y i.w T X i ))

7 Supervised Learning Training – Optimization: Regularization: Without regularization: overfitting

8 Supervised Learning Training – Optimization: Regularization: Large margin, small ||w||

9 Supervised Learning Optimization: – Art of maximization Unconstraint: – First order: Gradient descent – Second order: Newtonian method – Stochastic: stochastic gradient descent (SGD) Constraint: – Active set method – Interior Point Method – Alternative Direction Method of Multiplier (ADMM)

10 Unsupervised Learning Clustering: – PCA – kNN

11 Machine Learning for Information Retrieval Learning to Rank Topic Modeling

12 Learning to Rank http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf

13 Learning to Rank X = (q, d) – Features: e.g. Matching between Query and Document

14 Learning to Rank

15 Labels: – Pointwise: relevant vs. irrelevant; 5,4,3,2,1 – Pairwise: doc A > doc B, doc C > doc D – Listwise: permutation Acquisition: – Expert Annotation – Clickthrough: click,skip above

16 Learning to Rank

17 Prediction function: – Extract X q,d from (q, d) – Ranking document by sorting w T X q,d Loss function: – Pointwise – Pairwise – Listwise

18 Learning to Rank Pointwise: – Regression: Square loss Pairwise: – Classification: (q, d 1 ) > (q, d 2 ) => positive example X q,d1 – X q, d2 Listwise: – Optimization: NDCG@j Relevance (0/1) of document at rank i Discount of rank i Cumulative Gain Normalized

19 Topic Modeling – Factorization of Words * Documents matrix Clustering of document – Project documents (vector of # vocabulary) into lower dimension (vector of # topics) What is Topic? – Linear combination of words Nonnegative weights, sum to 1 => probability

20 Topic Modeling Generative models: story-telling – Latent Semantic Analysis, LSA – Probabilistic Latent Semantic Analysis, PLSA – Latent Dirichlet Allocation, LDA

21 Topic Modeling Latent Semantic Analysis (LSA): – Deerwester et al (1990) – Singular Value Decomposition (SVD) applied to words * documents matrix – How to interpret negative values?

22 Topic Modeling Probabilistic Latent Semantic Analysis (PLSA): – Thomas Hofmann (1999) – How words/documents are generated (as described by probability) d1, fish d1, boat d1, voyage d2, voyage d2, sky d3, trip documents topics words documents Maximal Likelihood:

23 Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin Parameter to be estimated prior likelihood Posterior probability Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior Bayesian as an inference method: Estimate r: posterior mean, or MAP Estimate new toss to be head:

24 Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What additional info we know about ? – Sparsity: each topic has nonzero probability on few words; each document has nonzero probability on few topics; Dirichlet distribution defines probability on simplex documents topics words documents Parameter of Multinomial: Nonnegative Sum to 1 simplex Dirichlet can encourage sparsity


Download ppt "Introduction to Machine Learning for Information Retrieval Xiaolong Wang."

Similar presentations


Ads by Google