Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Similar presentations


Presentation on theme: "Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information."— Presentation transcript:

1 Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12 th -14 th, 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A A A A A A AA A

2 Statistical Multilingual Analysis for Retrieval and Translation (SMART) Information Society Technologies Programme Sixth Framework Programme, “Specific Target Research Project” (STReP) Start date: October 1, 2006 Duration: 3 years Objective: bring Machine Learning researchers to work on Machine Translation and CLIR The SMART Project

3 The SMART Consortium

4

5 Premise and Outline Two classes of methods for CLIR investigated in SMART – Methods based on dictionary adaptation for the cross-language extension of the LM approach in IR – Latent semantic methods based on Canonical Correlation Analysis Initial plan (reflected in abstract): to present both –...but it would take too long, so: Outline: – (Longish) introduction to state of the art in Canonical Correlation Analysis – A number of advances obtained by the SMART project For lexicon adaptation methods: check out deliverable D 5.1 from the project website!

6 Background: Canonical Correlation Analysis

7 Canonical Correlation Analysis Abstract view: Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts. – Since they cannot be observed directly, these concepts are latent If two spans are the translation of one another, their deep representation in terms of concepts is the same. Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation? – CCA: Assume mapping from deep to superficial representation is linear Estimate mapping from empirical data

8 Five documents in the world of concepts 1 3 2 5 4

9 The same five documents in two languages 1 3 2 5 4 1 2 3 4 5

10 Finding the first Canonical Variates 1 3 2 5 4 1 2 3 4 5 5’’3’’ 1’’ 4’’2’’ 1’ 2’ 4’ 3’ 5’

11 Finding the first Canonical Variates Find the two directions, one for each language, such that projections of documents are maximally correlated. Assuming data matrices X and Y are (row-wise) centered: Maximal covariance to work back the rotation C 1 expressed in the basis of X and Y resp. Normalization by the variances to adjust for “stretched” dimensions

12 Finding the first Canonical Variate Find the two directions, one for each language, such that projections of documents are maximally correlated Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP): Complexity:

13 Finding further Canonical Variates Assume we already found i-1 pairs of Canonical Variates: Turns out equivalent to finding the other eigen-pairs in the same GEP

14 Examples from the Hansard Corpus

15 Kernel CCA Cubic complexity in the number of dimensions becomes soon intractable, especially with text Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors  Kernel CCA: from primal to dual formulation, since it can be proved that the w x i (resp. w y i ) is in the span of the columns of X (resp. Y)

16 Kernel CCA The computation is again done by solving a GEP: Complexity:

17 Overfitting E.g. two (centered) points in R 2 : 1 1 2 2 Problem: if m · n x and m · n y then there are (infinite) trivial solutions with perfect correlation : OVERFITTING Given an arbitrary direction in the first space......we can find one with perfect correlation in the second Unit variances Unit covariance Perfect correlation... for no matter what direction!

18 Regularized Kernel CCA We can regularize the objective function by trading correlation against good account of variance in the two spaces:

19 Multiview CCA (K)CCA can take advantage of the “mutual information” between two languages... 1 2 3 4 5 1 3 2 5 4...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA. 1 2 3 4 5 1 2 3 4 5

20 Multiview CCA There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min,...). Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:. Multivariate Eigenvalue Problem

21 Multiview CCA Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs: – [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only – Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands

22 Innovations from SMART

23 Extensions of the Horst algorithm [Rupnik and Shawe-Taylor] – Efficient implementation linear in the number of documents – Version for finding many sets of canonical variates New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor] Sparse KCCA [Hussain and Shawe-Taylor]

24 Efficient Implementation of Horst algorithm Horst algorithm starts with a random set of vectors: then iteratively multiplies and renormalizes until convergence: Inner loop: k 2 matrix- vector multiplications, each O(m 2 ) Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop. Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m 2 )) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions. The inner loop can be made O(kms) instead of O(k 2 m 2 )

25 Extended Horst algorithm for finding many sets of canonical variates Horst algorithm only finds the first set of k canonical variates Extension (3): maintain projection matrices P i t that project ¯ k,t ’s at each iteration onto the subspace orthogonal to all previous canonical variates for space i. Finding d sets of canonical variates can be done in O(d 2 mks). This scales up!

26 MCCA: Experiments Experiments: mate retrieval with Europarl 10 languages, 100,000 10-ways aligned sentences for training 7873 10-ways aligned sentences for testing Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization. MCCA used to extract d = 100-dimensional subspaces Baseline alternatives for selecting new basis: – k-means clustering centroids on concatenated multi-lingual document vectors – CL-LSI, i.e. LSI on concatenated vectors

27 Some example latent vectors

28 MCCA experiment results Measure: recall in Top 10, averaged over 9 languages “Query” Language K-meansCL-LSIMCCA EN0.74860.91290.9883 SP0.74500.91310.9855 GE0.59270.85450.9778 IT0.74480.90220.9836 DU0.71360.90210.9835 DA0.53570.85400.9874 SW0.53120.86230.9880 PT0.75110.90000.9874 FR0.73340.91160.9888 FI0.44020.77370.9830

29 MCCA experiment results More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence “Query” Language K-meansCL-LSIMCCA EN0.13190.23480.4413 SP0.12580.22260.4109 GE0.13330.24920.4158 IT0.13300.23430.4373 DU0.13390.24080.4369 DA0.13760.25170.4232 SW0.13760.24990.4038 PT0.12740.21870.4075 FR0.13000.22620.3931 FI0.13400.24900.4179

30 Extension (4): Regression - CCA Given a query q in one language, find the target language vector w which is maximally correlated to it: Solution: Given this “query translation” we can then find the closest target documents using the standard cosine measure Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best

31 Extension (5): Sparse - KCCA Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents – Improved efficiency – Alternative regularization Same set of indices i

32 Sparse - KCCA For a fixed set of indices i : But how do we select i ?

33 Sparse – KCCA: Algorithms Algorithm 1 1.initialize 2.for i = 1 to d do Deflate kernel matrices 3.end for 4.Solve GEP for index set i Algorithm 2 Set i to the index of the top d values of Solve GEP for index set i Deflation consists in transforming the matrices to reflect a projection onto the space orthogonal to the current basis in feature space

34 Sparse – KCCA: Mate retrieval experiments Europarl, English-Spanish KCCA Train: 24693 sec. Test: 27733 sec. SKCCA (1) Train: 5242 sec. Test: 698 sec. SKCCA (2) Train: 1873 sec. Test: 695 sec.

35 SMART - Website Project presentation and deliverables http://www.smart-project.eu D 5.1 on lexicon- based methods and D 5.2 on CCA

36 SMART - Dissemination and Exploitation Platforms for showcasing developed tools:

37 Thank you!

38 Shameless plug Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.

39 References [Hardoon and Shawe-Taylor]David Hardoon and John Shawe-Taylor, Sparse CCA for Bilingual Word Generation, in 20 th Mini-EURO Conference of the Continuous Optimization and Knowledge Based Technologies, Neringa, Lithuania, 2008. [Hussain and Shawe-Taylor]Zakria Hussain and John Shawe-Taylor, Theory of Matching Pursuit, in Neural Information Processing Systems (NIPS), Vancouver, BC, 2008. [Rupnik and Shawe-Taylor]Jan Rupnik and John Shawe-Taylor, contribution to SMART deliverable D 5.2 “Multilingual Latent Language-Independent Analysis Applied to CLTIA Tasks” (http://www.smart-project.eu/files/D52.pdf)http://www.smart-project.eu/files/D52.pdf

40 Self-introduction Natural Language Generation Grammar Learning Text Categorization Machine Learning (kernels for text) (Statistical) Machine Translation ca. 2004

41 Extension (6): Primal-Dual Sparse KCCA For some applications it is better to have a primal view on one side and a dual view on the other – e.g. linking few words from one language to documents in another The optimization problem becomes:


Download ppt "Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information."

Similar presentations


Ads by Google