Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Slides:

Advertisements

Similar presentations

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Advertisements

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Dimensionality Reduction PCA -- SVD

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Principal Component Analysis

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Dimensional reduction, PCA

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

Reduced Support Vector Machine

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.

FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION

Unsupervised Learning

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Orthogonality and Least Squares

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Scalable Text Mining with Sparse Generative Models

An Introduction to Support Vector Machines Martin Law.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

Summarized by Soo-Jin Kim

Chapter 2 Dimensionality Reduction. Linear Methods

Presented By Wanchen Lu 2/25/2013

ME451 Kinematics and Dynamics of Machine Systems Review of Linear Algebra 2.1 through 2.4 Th, Sept. 08 © Dan Negrut, 2011 ME451, UW-Madison TexPoint fonts.

Outline Separating Hyperplanes – Separable Case

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

An Introduction to Support Vector Machines (M. Law)

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Elementary Linear Algebra Anton & Rorres, 9th Edition

Measure Independence in Kernel Space Presented by: Qiang Lou.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Chapter 23: Probabilistic Language Models April 13, 2004.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

Algorithms 2005 Ramesh Hariharan. Algebraic Methods.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Nonlinear Adaptive Kernel Methods Dec. 1, 2009 Anthony Kuh Chaopin Zhu Nate Kowahl.

Unsupervised Learning II Feature Extraction

Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Tijl De Bie John Shawe-Taylor ECS, ISIS, University of Southampton

Learning with information of features

Principal Component Analysis

Parallelization of Sparse Coding & Dictionary Learning

Feature space tansformation methods

Large scale multilingual and multimodal integration

SVMs for Document Ranking

Presentation transcript:

Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12 th -14 th, 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A A A A A A AA A

Statistical Multilingual Analysis for Retrieval and Translation (SMART) Information Society Technologies Programme Sixth Framework Programme, “Specific Target Research Project” (STReP) Start date: October 1, 2006 Duration: 3 years Objective: bring Machine Learning researchers to work on Machine Translation and CLIR The SMART Project

The SMART Consortium

Premise and Outline Two classes of methods for CLIR investigated in SMART – Methods based on dictionary adaptation for the cross-language extension of the LM approach in IR – Latent semantic methods based on Canonical Correlation Analysis Initial plan (reflected in abstract): to present both –...but it would take too long, so: Outline: – (Longish) introduction to state of the art in Canonical Correlation Analysis – A number of advances obtained by the SMART project For lexicon adaptation methods: check out deliverable D 5.1 from the project website!

Background: Canonical Correlation Analysis

Canonical Correlation Analysis Abstract view: Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts. – Since they cannot be observed directly, these concepts are latent If two spans are the translation of one another, their deep representation in terms of concepts is the same. Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation? – CCA: Assume mapping from deep to superficial representation is linear Estimate mapping from empirical data

Five documents in the world of concepts

The same five documents in two languages

Finding the first Canonical Variates ’’3’’ 1’’ 4’’2’’ 1’ 2’ 4’ 3’ 5’

Finding the first Canonical Variates Find the two directions, one for each language, such that projections of documents are maximally correlated. Assuming data matrices X and Y are (row-wise) centered: Maximal covariance to work back the rotation C 1 expressed in the basis of X and Y resp. Normalization by the variances to adjust for “stretched” dimensions

Finding the first Canonical Variate Find the two directions, one for each language, such that projections of documents are maximally correlated Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP): Complexity:

Finding further Canonical Variates Assume we already found i-1 pairs of Canonical Variates: Turns out equivalent to finding the other eigen-pairs in the same GEP

Examples from the Hansard Corpus

Kernel CCA Cubic complexity in the number of dimensions becomes soon intractable, especially with text Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors  Kernel CCA: from primal to dual formulation, since it can be proved that the w x i (resp. w y i ) is in the span of the columns of X (resp. Y)

Kernel CCA The computation is again done by solving a GEP: Complexity:

Overfitting E.g. two (centered) points in R 2 : Problem: if m · n x and m · n y then there are (infinite) trivial solutions with perfect correlation : OVERFITTING Given an arbitrary direction in the first space......we can find one with perfect correlation in the second Unit variances Unit covariance Perfect correlation... for no matter what direction!

Regularized Kernel CCA We can regularize the objective function by trading correlation against good account of variance in the two spaces:

Multiview CCA (K)CCA can take advantage of the “mutual information” between two languages but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA

Multiview CCA There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min,...). Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:. Multivariate Eigenvalue Problem

Multiview CCA Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs: – [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only – Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands

Innovations from SMART

Extensions of the Horst algorithm [Rupnik and Shawe-Taylor] – Efficient implementation linear in the number of documents – Version for finding many sets of canonical variates New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor] Sparse KCCA [Hussain and Shawe-Taylor]

Efficient Implementation of Horst algorithm Horst algorithm starts with a random set of vectors: then iteratively multiplies and renormalizes until convergence: Inner loop: k 2 matrix- vector multiplications, each O(m 2 ) Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop. Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m 2 )) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions. The inner loop can be made O(kms) instead of O(k 2 m 2 )

Extended Horst algorithm for finding many sets of canonical variates Horst algorithm only finds the first set of k canonical variates Extension (3): maintain projection matrices P i t that project ¯ k,t ’s at each iteration onto the subspace orthogonal to all previous canonical variates for space i. Finding d sets of canonical variates can be done in O(d 2 mks). This scales up!

MCCA: Experiments Experiments: mate retrieval with Europarl 10 languages, 100, ways aligned sentences for training ways aligned sentences for testing Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization. MCCA used to extract d = 100-dimensional subspaces Baseline alternatives for selecting new basis: – k-means clustering centroids on concatenated multi-lingual document vectors – CL-LSI, i.e. LSI on concatenated vectors

Some example latent vectors

MCCA experiment results Measure: recall in Top 10, averaged over 9 languages “Query” Language K-meansCL-LSIMCCA EN SP GE IT DU DA SW PT FR FI

MCCA experiment results More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence “Query” Language K-meansCL-LSIMCCA EN SP GE IT DU DA SW PT FR FI

Extension (4): Regression - CCA Given a query q in one language, find the target language vector w which is maximally correlated to it: Solution: Given this “query translation” we can then find the closest target documents using the standard cosine measure Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best

Extension (5): Sparse - KCCA Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents – Improved efficiency – Alternative regularization Same set of indices i

Sparse - KCCA For a fixed set of indices i : But how do we select i ?

Sparse – KCCA: Algorithms Algorithm 1 1.initialize 2.for i = 1 to d do Deflate kernel matrices 3.end for 4.Solve GEP for index set i Algorithm 2 Set i to the index of the top d values of Solve GEP for index set i Deflation consists in transforming the matrices to reflect a projection onto the space orthogonal to the current basis in feature space

Sparse – KCCA: Mate retrieval experiments Europarl, English-Spanish KCCA Train: sec. Test: sec. SKCCA (1) Train: 5242 sec. Test: 698 sec. SKCCA (2) Train: 1873 sec. Test: 695 sec.

SMART - Website Project presentation and deliverables D 5.1 on lexicon- based methods and D 5.2 on CCA

SMART - Dissemination and Exploitation Platforms for showcasing developed tools:

Thank you!

Shameless plug Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.

References [Hardoon and Shawe-Taylor]David Hardoon and John Shawe-Taylor, Sparse CCA for Bilingual Word Generation, in 20 th Mini-EURO Conference of the Continuous Optimization and Knowledge Based Technologies, Neringa, Lithuania, [Hussain and Shawe-Taylor]Zakria Hussain and John Shawe-Taylor, Theory of Matching Pursuit, in Neural Information Processing Systems (NIPS), Vancouver, BC, [Rupnik and Shawe-Taylor]Jan Rupnik and John Shawe-Taylor, contribution to SMART deliverable D 5.2 “Multilingual Latent Language-Independent Analysis Applied to CLTIA Tasks” (

Self-introduction Natural Language Generation Grammar Learning Text Categorization Machine Learning (kernels for text) (Statistical) Machine Translation ca. 2004

Extension (6): Primal-Dual Sparse KCCA For some applications it is better to have a primal view on one side and a dual view on the other – e.g. linking few words from one language to documents in another The optimization problem becomes: