Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Information retrieval – LSI, pLSI and LDA

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.

Supervised Learning Recap

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Statistical Topic Modeling part 1

What is Statistical Modeling

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Latent Dirichlet Allocation a generative model for text

Scott Wen-tau Yih Joint work with Kristina Toutanova, John Platt, Chris Meek Microsoft Research.

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Scalable Text Mining with Sparse Generative Models

Language Modeling Approaches for Information Retrieval Rong Jin.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Non Negative Matrix Factorization

Text Classification, Active/Interactive learning.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Learning to Rank From Pairwise Approach to Listwise Approach.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Web-Mining Agents Topic Analysis: pLSI and LDA

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

CS 2750: Machine Learning Linear Models for Classification Prof. Adriana Kovashka University of Pittsburgh February 15, 2016.

Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.

14.0 Linguistic Processing and Latent Topic Analysis.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

KNN & Naïve Bayes Hongning Wang

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Sparse Kernel Machines

Probabilistic Models with Latent Variables

Stochastic Optimization Maximization for Latent Variable Models

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

Junghoo “John” Cho UCLA

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen, 2005 References:

Learning to Rank with Ties

Presentation transcript:

Introduction to Machine Learning for Information Retrieval Xiaolong Wang

What is Machine Learning In short, tricks of maths Two major tasks: – Supervised Learning: a.k.a. Regression, Classification… – Unsupervised Learning: a.k.a. data manipulation, clustering …

Supervised Learning Label ： usually manually labeled Data ： data representation, usually as a vector Prediction Function ： selecting one from a predefined family of functions that has the best prediction classificationregression

Supervised Learning Two formulations: – F1: Given a set of X i, Y i, learn a function Y i – Binary: Spam v.s. Non-spam – Numeric: Very relevant(5), somewhat relevant(4), marginal relevant(3), somewhat irrelevant(2), very irrelevant(1) X i – Number of words, occurrence of each word, … f – usually linear function

Supervised Learning Two formulations: – F2: Give a set of X i, Y i, learn a function such that Y i : more complex label than binary or numeric – Multiclass learning: entertainment v.s. sports v.s. politics… – Structural learning: syntactic parsing more general Y X

Supervised Learning Training – Optimization: Loss : difference b/w true label Y i and predicted label w T X i – Squared Loss (regression): (Y i – w T X i ) 2 – Hinge Loss (classification): max(0, 1 – Y i.w T X i ) – Logistic Loss (classification): log(1 + exp(-Y i.w T X i ))

Supervised Learning Training – Optimization: Regularization: Without regularization: overfitting

Supervised Learning Training – Optimization: Regularization: Large margin, small ||w||

Supervised Learning Optimization: – Art of maximization Unconstraint: – First order: Gradient descent – Second order: Newtonian method – Stochastic: stochastic gradient descent (SGD) Constraint: – Active set method – Interior Point Method – Alternative Direction Method of Multiplier (ADMM)

Unsupervised Learning Clustering: – PCA – kNN

Machine Learning for Information Retrieval Learning to Rank Topic Modeling

Learning to Rank

Learning to Rank X = (q, d) – Features: e.g. Matching between Query and Document

Learning to Rank

Labels: – Pointwise: relevant vs. irrelevant; 5,4,3,2,1 – Pairwise: doc A > doc B, doc C > doc D – Listwise: permutation Acquisition: – Expert Annotation – Clickthrough: click,skip above

Learning to Rank

Prediction function: – Extract X q,d from (q, d) – Ranking document by sorting w T X q,d Loss function: – Pointwise – Pairwise – Listwise

Learning to Rank Pointwise: – Regression: Square loss Pairwise: – Classification: (q, d 1 ) > (q, d 2 ) => positive example X q,d1 – X q, d2 Listwise: – Optimization: Relevance (0/1) of document at rank i Discount of rank i Cumulative Gain Normalized

Topic Modeling – Factorization of Words * Documents matrix Clustering of document – Project documents (vector of # vocabulary) into lower dimension (vector of # topics) What is Topic? – Linear combination of words Nonnegative weights, sum to 1 => probability

Topic Modeling Generative models: story-telling – Latent Semantic Analysis, LSA – Probabilistic Latent Semantic Analysis, PLSA – Latent Dirichlet Allocation, LDA

Topic Modeling Latent Semantic Analysis (LSA): – Deerwester et al (1990) – Singular Value Decomposition (SVD) applied to words * documents matrix – How to interpret negative values?

Topic Modeling Probabilistic Latent Semantic Analysis (PLSA): – Thomas Hofmann (1999) – How words/documents are generated (as described by probability) d1, fish d1, boat d1, voyage d2, voyage d2, sky d3, trip documents topics words documents Maximal Likelihood:

Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin Parameter to be estimated prior likelihood Posterior probability Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior Bayesian as an inference method: Estimate r: posterior mean, or MAP Estimate new toss to be head:

Topic Modeling Latent Dirichlet Allocation (LDA) – David Blei et al. (2003) – PLSA with a Dirichlet prior What additional info we know about ? – Sparsity: each topic has nonzero probability on few words; each document has nonzero probability on few topics; Dirichlet distribution defines probability on simplex documents topics words documents Parameter of Multinomial: Nonnegative Sum to 1 simplex Dirichlet can encourage sparsity