Download presentation
Presentation is loading. Please wait.
Published byCharles Gregory Kelly Modified over 9 years ago
1
Introduction to Machine Learning for Information Retrieval Xiaolong Wang
2
What is Machine Learning In short, tricks of maths Two major tasks: – Supervised Learning: a.k.a. Regression, Classification… – Unsupervised Learning: a.k.a. data manipulation, clustering …
3
Supervised Learning Label : usually manually labeled Data : data representation, usually as a vector Prediction Function : selecting one from a predefined family of functions that has the best prediction classificationregression
4
Supervised Learning Two formulations: – F1: Given a set of X i, Y i, learn a function Y i – Binary: Spam v.s. Non-spam – Numeric: Very relevant(5), somewhat relevant(4), marginal relevant(3), somewhat irrelevant(2), very irrelevant(1) X i – Number of words, occurrence of each word, … f – usually linear function
5
Supervised Learning Two formulations: – F2: Give a set of X i, Y i, learn a function such that Y i : more complex label than binary or numeric – Multiclass learning: entertainment v.s. sports v.s. politics… – Structural learning: syntactic parsing more general Y X
6
Supervised Learning Training – Optimization: Loss : difference b/w true label Y i and predicted label w T X i – Squared Loss (regression): (Y i – w T X i ) 2 – Hinge Loss (classification): max(0, 1 – Y i.w T X i ) – Logistic Loss (classification): log(1 + exp(-Y i.w T X i ))
7
Supervised Learning Training – Optimization: Regularization: Without regularization: overfitting
8
Supervised Learning Training – Optimization: Regularization: Large margin, small ||w||
9
Supervised Learning Optimization: – Art of maximization Unconstraint: – First order: Gradient descent – Second order: Newtonian method – Stochastic: stochastic gradient descent (SGD) Constraint: – Active set method – Interior Point Method – Alternative Direction Method of Multiplier (ADMM)
10
Unsupervised Learning Clustering: – PCA – kNN
11
Machine Learning for Information Retrieval Learning to Rank Topic Modeling
12
Learning to Rank http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf
13
Learning to Rank X = (q, d) – Features: e.g. Matching between Query and Document
14
Learning to Rank
15
Labels: – Pointwise: relevant vs. irrelevant; 5,4,3,2,1 – Pairwise: doc A > doc B, doc C > doc D – Listwise: permutation Acquisition: – Expert Annotation – Clickthrough: click,skip above
16
Learning to Rank
17
Prediction function: – Extract X q,d from (q, d) – Ranking document by sorting w T X q,d Loss function: – Pointwise – Pairwise – Listwise
18
Learning to Rank Pointwise: – Regression: Square loss Pairwise: – Classification: (q, d 1 ) > (q, d 2 ) => positive example X q,d1 – X q, d2 Listwise: – Optimization: NDCG@j Relevance (0/1) of document at rank i Discount of rank i Cumulative Gain Normalized
19
Topic Modeling – Factorization of Words * Documents matrix Clustering of document – Project documents (vector of # vocabulary) into lower dimension (vector of # topics) What is Topic? – Linear combination of words Nonnegative weights, sum to 1 => probability
20
Topic Modeling Generative models: story-telling – Latent Semantic Analysis, LSA – Probabilistic Latent Semantic Analysis, PLSA – Latent Dirichlet Allocation, LDA
21
Topic Modeling Latent Semantic Analysis (LSA): – Deerwester et al (1990) – Singular Value Decomposition (SVD) applied to words * documents matrix – How to interpret negative values?
22
Topic Modeling Probabilistic Latent Semantic Analysis (PLSA): – Thomas Hofmann (1999) – How words/documents are generated (as described by probability) d1, fish d1, boat d1, voyage d2, voyage d2, sky d3, trip documents topics words documents Maximal Likelihood:
23
Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin Parameter to be estimated prior likelihood Posterior probability Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior Bayesian as an inference method: Estimate r: posterior mean, or MAP Estimate new toss to be head:
24
Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What additional info we know about ? – Sparsity: each topic has nonzero probability on few words; each document has nonzero probability on few topics; Dirichlet distribution defines probability on simplex documents topics words documents Parameter of Multinomial: Nonnegative Sum to 1 simplex Dirichlet can encourage sparsity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.