Large scale multilingual and multimodal integration

Slides:



Advertisements
Similar presentations
Dimensionality Reduction PCA -- SVD
Advertisements

PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Slides by Olga Sorkine, Tel Aviv University. 2 The plan today Singular Value Decomposition  Basic intuition  Formal definition  Applications.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Learning for Text Categorization
Principal Component Analysis
3D Geometry for Computer Graphics
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Support Vector Machines and Kernel Methods
3D Geometry for Computer Graphics
Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.
Point set alignment Closed-form solution of absolute orientation using unit quaternions Berthold K. P. Horn Department of Electrical Engineering, University.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Some matrix stuff.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
SINGULAR VALUE DECOMPOSITION (SVD)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
From Frequency to Meaning: Vector Space Models of Semantics
CSE 554 Lecture 8: Alignment
Spectral Methods for Dimensionality
School of Computer Science & Engineering
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Information Retrieval and Web Search
Principal Component Analysis (PCA)
Two-view geometry Computer Vision Spring 2018, Lecture 10
Singular Value Decomposition
15-826: Multimedia Databases and Data Mining
Principal Component Analysis
Feature space tansformation methods
INF 141: Information Retrieval
Recuperação de Informação B
SVMs for Document Ranking
Presentation transcript:

Large scale multilingual and multimodal integration IJS, UCL, Xerox, HIIT Jan Rupnik Large scale multilingual and multimodal integration

Project overview Increase the scale of CCA: Different approaches: In number of documents In number of views Different approaches: Extending the classical definition (IJS, UCL, Xerox) Sum of correlations, max min Probabilistic approaches (Wray, HIIT) Sparsity in CCA (Zakria, UCL) Outcome: Collaboration continues within SMART project Results from pump priming project will be leveraged as part of deliverables. Preparing publications

Document representation Bag of words: Vocabulary: {wi | i = 1, …, N } Documents are represented with vectors (word space): Example: Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra” Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,) Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

Document Similarity similarity(di, dj) = <xi / ||xi||, xj / ||xj||> = cos(∢(xi, xj)) d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra” x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,) x1 x2 x3 x1 1.0 0.4 0.0 x2 x3

Canonical Correlation Analysis Input: aligned training set {(xi,yi) | xi∈ℝn, yi∈ℝm, i = 1, …,ℓ} CCA is attacking the following problem: Find directions wx∈ℝn and wy∈ℝm, along which pairs (xi,yi) are maximally correlated: Formulation (before regularization): The covariance matrix:

Solving CCA Can be transformed to generalized eigenvalue problem:

Visualization of CCA X Y wx wy

Visualization of CCA X Y wx wy

Visualization of CCA X Y wx wy

Visualization of CCA X Y wx wy

Scaling to more than 2 views Input: aligned training set of m views {(x1i, x2i, …,xmi) | i = 1, …,ℓ} Need to generalize correlation to m directions Goal of multi-view CCA: Find directions w1, w2, ..., wm that will maximise the sum of pair-wise correlation values Σi≠j corr (wTiXi, wTjXj). Formulation (before regularization):

Dual, regularisation Regularisation to avoid overfitting Dual Kernel trick More suitable since number of features usually exceeds the number of documents

Solving multi-view CCA Lagrangian techniques yield a linear algebra problem called a multiparameter eigenvalue problem.

Solving multi-view CCA Horst algorithm Start with random vectors w1,..., wm Iterate: Where Ai,j = LiKiKjLj if i≠j, else Ai,i = 1/(1- κ)2 I Li = ((1-κ)Ki + κI)-1 Local convergence guaranteed when A is symmetric and positive-definite.

More than one dimension After finding the set of best projection vectors w11,..., wm1 we would like to find the next best set of vectors w12,..., w22. They need to be uncorrelated with the first set, i.e. Corr(Kiwi1, Kiwi2) = 0. Using projection matrices we can transform this problem to a standard multiparameter eigenproblem.

Complexity of multi-view CCA Avoid matrix multiplications (do only matrix vector multiplying) and inverses (rather use an iterative method, e.g. CG) O(ℓ c m k2 ) Where: ℓ: number of documents in each language c : average number of nonzero elements in document vectors m: number of languages k: dimensionality of the common space

Output example from two-view CCA Aligned documents from English and Slovene Directions wx in wy calculated with CCA are vectors from the word space They identify common subspace in English and Slovene word space. wx wy

Multilingual search Task: given a query in one language retrieve relevant documents from a multilingual collection Solution: Using CCA and aligned training set identify a common subspace or the languages in the collection Map the documents from the collection to the common subspace Map query to the common subspace and identify relevant documents using cosine distance

Query Y X Word space Language B Word space Language A Common Subspace

Experiments Data: Evaluation: EuroParl parallel corpus Languages: English, Spanish, German, Italian, Dutch, Danish, Swedish, Portuguese, French, Finnish 100000 training documents in each language Used as input for MCCA to identify common 100-dimensional subspace 7873 test documents in each language Used to evaluate the common subspace Evaluation: Mate retrieval given a document in language A as a query, can we find it’s mate (translation) in language B Pseudo query mate retrieval We keep top 5 or top 10 words of the query document according to tf-idf weights and retrieve it’s mate document. Average precision if mate is the d-th closest document in the common subspace, we get score 1/d and average over all queries. The bigger the better, maximum is 1

Experiments Comparison with k-means clustering: We concatenate the aligned vectors and work in one view - Y. We can compute 100 clusters and use the centroids to obtain 100 concept vectors for each view by splitting them.

Experiments Comparison with LSI Concatenate Compute the singular value decomposition of matrix Y = U S VT , keep 100 columns of the left singular vectors matrix U that correspond to 100 largest singular values. Split the vectors to get 100 concept vectors for each view.

Mate retrieval MCCA EN ES DE IT NL DA SV PT FR FI 0,9723 0,9591 0,9679 0,9675 0,9732 0,9711 0,9726 0,9764 0,9531 0,9744 0,958 0,9687 0,9635 0,9658 0,9671 0,976 0,9781 0,9492 0,9607 0,9575 0,9552 0,9559 0,9606 0,9595 0,9622 0,9417 0,9696 0,968 0,9529 0,9598 0,9608 0,96 0,9705 0,9728 0,9434 0,9666 0,951 0,9585 0,9616 0,9624 0,9662 0,9403 0,9739 0,9677 0,9605 0,9651 0,9743 0,9691 0,9717 0,9546 0,9678 0,9596 0,9631 0,9642 0,9747 0,969 0,9571 0,9731 0,975 0,9576 0,9694 0,9627 0,9654 0,9665 0,977 0,9483 0,9774 0,9769 0,9612 0,9734 0,971 0,9735 0,9792 0,9545 0,9566 0,9494 0,9426 0,9454 0,9418 0,953 0,9524 Mate retrieval LSI EN ES DE IT NL DA SV PT FR FI 0,8524 0,7931 0,8468 0,8639 0,7955 0,8161 0,8212 0,8636 0,6849 0,8567 0,785 0,8875 0,822 0,745 0,7548 0,8917 0,8886 0,6303 0,7746 0,738 0,7203 0,7815 0,7831 0,7462 0,672 0,7346 0,6153 0,8415 0,8734 0,7441 0,819 0,7194 0,732 0,8659 0,8807 0,6156 0,861 0,8032 0,7973 0,8137 0,7838 0,7801 0,7754 0,8291 0,6664 0,7619 0,6705 0,7782 0,669 0,7555 0,8207 0,6205 0,695 0,6547 0,779 0,6766 0,726 0,6727 0,7505 0,8323 0,634 0,7121 0,6507 0,8271 0,8918 0,7362 0,8794 0,7879 0,713 0,7212 0,8561 0,6141 0,8724 0,8792 0,7736 0,8863 0,8497 0,7508 0,7659 0,8511 0,6455 0,6474 0,5307 0,6255 0,5481 0,6231 0,6644 0,6457 0,5082 0,5797 Clust EN ES DE IT NL DA SV PT FR FI 0,6649 0,6059 0,6819 0,721 0,5948 0,5829 0,6203 0,709 0,4358 0,73 0,5642 0,7616 0,6468 0,5074 0,5148 0,7703 0,7451 0,4008 0,4798 0,3734 0,3806 0,5482 0,6265 0,5504 0,3286 0,4423 0,417 0,7145 0,7452 0,5521 0,6509 0,5101 0,5134 0,7236 0,7541 0,3902 0,6906 0,542 0,6389 0,5733 0,6093 0,5593 0,5085 0,6204 0,4442 0,3966 0,2771 0,5538 0,2948 0,4448 0,6426 0,2502 0,34 0,408 0,4099 0,2893 0,4935 0,3127 0,4166 0,6655 0,2662 0,3551 0,4009 0,6995 0,7949 0,5182 0,7612 0,6319 0,4878 0,5011 0,7266 0,3851 0,7108 0,6827 0,557 0,7126 0,6488 0,515 0,5136 0,6418 0,3883 0,2963 0,2263 0,3856 0,2416 0,3277 0,4389 0,4244 0,2085 0,2762

Pseudo query mate retrieval, cut off all but 10 words in a query MCCA - PQ10 ES DE IT NL DA SV PT FR FI EN 0,2861 0,2666 0,2921 0,2929 0,2713 0,2652 0,2949 0,3004 0,2447 0,2365 0,2656 0,2459 0,2345 0,2363 0,2743 0,2193 0,2385 0,2478 0,2444 0,247 0,2653 0,2497 0,2317 0,2725 0,2561 0,2546 0,2972 0,291 0,2342 0,2749 0,2717 0,2895 0,245 0,2544 0,2722 0,2551 0,2318 0,2481 0,2272 0,2077 0,2701 0,2089 0,1957 Pseudo query mate retrieval, cut off all but 10 words in a query LSI - PQ10 ES DE IT NL DA SV PT FR FI EN 0,1664 0,1341 0,167 0,1556 0,1298 0,1287 0,1678 0,1648 0,1216 0,125 0,155 0,1386 0,1202 0,1178 0,1644 0,1561 0,1115 0,1573 0,1396 0,1343 0,1765 0,1675 0,1253 0,1517 0,1284 0,1244 0,1771 0,1679 0,1356 0,1318 0,1689 0,1643 0,1249 0,1448 0,1738 0,166 0,1313 0,1715 0,1683 0,1288 0,1532 0,1085 0,1129 Clust - PQ10 ES DE IT NL DA SV PT FR FI EN 0,0775 0,0696 0,0787 0,0753 0,0672 0,0662 0,0803 0,0758 0,0634 0,0655 0,0789 0,0718 0,0621 0,0636 0,0774 0,0767 0,06 0,0802 0,0788 0,0701 0,0668 0,0772 0,066 0,075 0,0639 0,0644 0,0833 0,0807 0,0643 0,0691 0,078 0,0777 0,0656 0,072 0,0806 0,0781 0,0653 0,0813 0,0784 0,067 0,0724 0,0593 0,0601

Pseudo query mate retrieval, cut off all but 5 words in a query MCCA - PQ5 ES DE IT NL DA SV PT FR FI EN 0,1788 0,1644 0,1833 0,1829 0,1681 0,1638 0,1836 0,186 0,1535 0,1453 0,1635 0,1534 0,1443 0,1467 0,1684 0,1657 0,1371 0,1536 0,1503 0,1526 0,1636 0,1501 0,1429 0,1698 0,1588 0,1575 0,184 0,1791 0,1465 0,1713 0,1672 0,1834 0,1786 0,1544 0,1602 0,1724 0,1576 0,1466 0,1555 0,1377 0,1312 0,1645 0,1286 0,1212 Pseudo query mate retrieval, cut off all but 5 words in a query LSI - PQ5 ES DE IT NL DA SV PT FR FI EN 0,1215 0,101 0,122 0,1116 0,0992 0,1231 0,0946 0,0974 0,1171 0,1044 0,0959 0,09 0,1203 0,1152 0,0897 0,1252 0,1172 0,1061 0,1015 0,1301 0,1263 0,0991 0,1104 0,0977 0,0927 0,1276 0,1228 0,0928 0,1038 0,1014 0,1246 0,1212 0,0935 0,1048 0,1286 0,1239 0,0975 0,1248 0,1201 0,0983 0,1144 0,0853 0,0884 Clust - PQ5 ES DE IT NL DA SV PT FR FI EN 0,0605 0,0549 0,061 0,0593 0,0537 0,0535 0,0612 0,052 0,053 0,0609 0,0563 0,0508 0,0512 0,0603 0,0597 0,0498 0,059 0,0596 0,0541 0,0531 0,0595 0,0509 0,057 0,0528 0,0517 0,0611 0,0607 0,0532 0,0562 0,0608 0,0624 0,0525 0,0592 0,0606 0,0533 0,0574 0,0497 0,0483

Thanks