Large scale multilingual and multimodal integration

Large scale multilingual and multimodal integration
IJS, UCL, Xerox, HIIT Jan Rupnik Large scale multilingual and multimodal integration

Project overview Increase the scale of CCA: Different approaches:
In number of documents In number of views Different approaches: Extending the classical definition (IJS, UCL, Xerox) Sum of correlations, max min Probabilistic approaches (Wray, HIIT) Sparsity in CCA (Zakria, UCL) Outcome: Collaboration continues within SMART project Results from pump priming project will be leveraged as part of deliverables. Preparing publications

Document representation
Bag of words: Vocabulary: {wi | i = 1, …, N } Documents are represented with vectors (word space): Example: Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra” Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,) Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

Document Similarity similarity(di, dj) = <xi / ||xi||, xj / ||xj||> = cos(∢(xi, xj)) d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra” x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,) x1 x2 x3 x1 1.0 0.4 0.0 x2 x3

Canonical Correlation Analysis
Input: aligned training set {(xi,yi) | xi∈ℝn, yi∈ℝm, i = 1, …,ℓ} CCA is attacking the following problem: Find directions wx∈ℝn and wy∈ℝm, along which pairs (xi,yi) are maximally correlated: Formulation (before regularization): The covariance matrix:

Solving CCA Can be transformed to generalized eigenvalue problem:

Visualization of CCA X Y wx wy

Scaling to more than 2 views
Input: aligned training set of m views {(x1i, x2i, …,xmi) | i = 1, …,ℓ} Need to generalize correlation to m directions Goal of multi-view CCA: Find directions w1, w2, ..., wm that will maximise the sum of pair-wise correlation values Σi≠j corr (wTiXi, wTjXj). Formulation (before regularization):

Dual, regularisation Regularisation to avoid overfitting Dual
Kernel trick More suitable since number of features usually exceeds the number of documents

Solving multi-view CCA
Lagrangian techniques yield a linear algebra problem called a multiparameter eigenvalue problem.

Solving multi-view CCA
Horst algorithm Start with random vectors w1,..., wm Iterate: Where Ai,j = LiKiKjLj if i≠j, else Ai,i = 1/(1- κ)2 I Li = ((1-κ)Ki + κI)-1 Local convergence guaranteed when A is symmetric and positive-definite.

More than one dimension
After finding the set of best projection vectors w11,..., wm1 we would like to find the next best set of vectors w12,..., w22. They need to be uncorrelated with the first set, i.e Corr(Kiwi1, Kiwi2) = 0. Using projection matrices we can transform this problem to a standard multiparameter eigenproblem.

Complexity of multi-view CCA
Avoid matrix multiplications (do only matrix vector multiplying) and inverses (rather use an iterative method, e.g. CG) O(ℓ c m k2 ) Where: ℓ: number of documents in each language c : average number of nonzero elements in document vectors m: number of languages k: dimensionality of the common space

Output example from two-view CCA
Aligned documents from English and Slovene Directions wx in wy calculated with CCA are vectors from the word space They identify common subspace in English and Slovene word space. wx wy

Multilingual search Task: given a query in one language retrieve relevant documents from a multilingual collection Solution: Using CCA and aligned training set identify a common subspace or the languages in the collection Map the documents from the collection to the common subspace Map query to the common subspace and identify relevant documents using cosine distance

Query Y X Word space Language B Word space Language A Common Subspace

Experiments Data: Evaluation: EuroParl parallel corpus Languages:
English, Spanish, German, Italian, Dutch, Danish, Swedish, Portuguese, French, Finnish training documents in each language Used as input for MCCA to identify common 100-dimensional subspace 7873 test documents in each language Used to evaluate the common subspace Evaluation: Mate retrieval given a document in language A as a query, can we find it’s mate (translation) in language B Pseudo query mate retrieval We keep top 5 or top 10 words of the query document according to tf-idf weights and retrieve it’s mate document. Average precision if mate is the d-th closest document in the common subspace, we get score 1/d and average over all queries. The bigger the better, maximum is 1

Experiments Comparison with k-means clustering:
We concatenate the aligned vectors and work in one view - Y. We can compute 100 clusters and use the centroids to obtain 100 concept vectors for each view by splitting them.

Experiments Comparison with LSI Concatenate
Compute the singular value decomposition of matrix Y = U S VT , keep 100 columns of the left singular vectors matrix U that correspond to 100 largest singular values. Split the vectors to get 100 concept vectors for each view.

Mate retrieval MCCA EN ES DE IT NL DA SV PT FR FI 0,9723 0,9591 0,9679
0,9675 0,9732 0,9711 0,9726 0,9764 0,9531 0,9744 0,958 0,9687 0,9635 0,9658 0,9671 0,976 0,9781 0,9492 0,9607 0,9575 0,9552 0,9559 0,9606 0,9595 0,9622 0,9417 0,9696 0,968 0,9529 0,9598 0,9608 0,96 0,9705 0,9728 0,9434 0,9666 0,951 0,9585 0,9616 0,9624 0,9662 0,9403 0,9739 0,9677 0,9605 0,9651 0,9743 0,9691 0,9717 0,9546 0,9678 0,9596 0,9631 0,9642 0,9747 0,969 0,9571 0,9731 0,975 0,9576 0,9694 0,9627 0,9654 0,9665 0,977 0,9483 0,9774 0,9769 0,9612 0,9734 0,971 0,9735 0,9792 0,9545 0,9566 0,9494 0,9426 0,9454 0,9418 0,953 0,9524 Mate retrieval LSI EN ES DE IT NL DA SV PT FR FI 0,8524 0,7931 0,8468 0,8639 0,7955 0,8161 0,8212 0,8636 0,6849 0,8567 0,785 0,8875 0,822 0,745 0,7548 0,8917 0,8886 0,6303 0,7746 0,738 0,7203 0,7815 0,7831 0,7462 0,672 0,7346 0,6153 0,8415 0,8734 0,7441 0,819 0,7194 0,732 0,8659 0,8807 0,6156 0,861 0,8032 0,7973 0,8137 0,7838 0,7801 0,7754 0,8291 0,6664 0,7619 0,6705 0,7782 0,669 0,7555 0,8207 0,6205 0,695 0,6547 0,779 0,6766 0,726 0,6727 0,7505 0,8323 0,634 0,7121 0,6507 0,8271 0,8918 0,7362 0,8794 0,7879 0,713 0,7212 0,8561 0,6141 0,8724 0,8792 0,7736 0,8863 0,8497 0,7508 0,7659 0,8511 0,6455 0,6474 0,5307 0,6255 0,5481 0,6231 0,6644 0,6457 0,5082 0,5797 Clust EN ES DE IT NL DA SV PT FR FI 0,6649 0,6059 0,6819 0,721 0,5948 0,5829 0,6203 0,709 0,4358 0,73 0,5642 0,7616 0,6468 0,5074 0,5148 0,7703 0,7451 0,4008 0,4798 0,3734 0,3806 0,5482 0,6265 0,5504 0,3286 0,4423 0,417 0,7145 0,7452 0,5521 0,6509 0,5101 0,5134 0,7236 0,7541 0,3902 0,6906 0,542 0,6389 0,5733 0,6093 0,5593 0,5085 0,6204 0,4442 0,3966 0,2771 0,5538 0,2948 0,4448 0,6426 0,2502 0,34 0,408 0,4099 0,2893 0,4935 0,3127 0,4166 0,6655 0,2662 0,3551 0,4009 0,6995 0,7949 0,5182 0,7612 0,6319 0,4878 0,5011 0,7266 0,3851 0,7108 0,6827 0,557 0,7126 0,6488 0,515 0,5136 0,6418 0,3883 0,2963 0,2263 0,3856 0,2416 0,3277 0,4389 0,4244 0,2085 0,2762

Pseudo query mate retrieval, cut off all but 10 words in a query
MCCA - PQ10 ES DE IT NL DA SV PT FR FI EN 0,2861 0,2666 0,2921 0,2929 0,2713 0,2652 0,2949 0,3004 0,2447 0,2365 0,2656 0,2459 0,2345 0,2363 0,2743 0,2193 0,2385 0,2478 0,2444 0,247 0,2653 0,2497 0,2317 0,2725 0,2561 0,2546 0,2972 0,291 0,2342 0,2749 0,2717 0,2895 0,245 0,2544 0,2722 0,2551 0,2318 0,2481 0,2272 0,2077 0,2701 0,2089 0,1957 Pseudo query mate retrieval, cut off all but 10 words in a query LSI - PQ10 ES DE IT NL DA SV PT FR FI EN 0,1664 0,1341 0,167 0,1556 0,1298 0,1287 0,1678 0,1648 0,1216 0,125 0,155 0,1386 0,1202 0,1178 0,1644 0,1561 0,1115 0,1573 0,1396 0,1343 0,1765 0,1675 0,1253 0,1517 0,1284 0,1244 0,1771 0,1679 0,1356 0,1318 0,1689 0,1643 0,1249 0,1448 0,1738 0,166 0,1313 0,1715 0,1683 0,1288 0,1532 0,1085 0,1129 Clust - PQ10 ES DE IT NL DA SV PT FR FI EN 0,0775 0,0696 0,0787 0,0753 0,0672 0,0662 0,0803 0,0758 0,0634 0,0655 0,0789 0,0718 0,0621 0,0636 0,0774 0,0767 0,06 0,0802 0,0788 0,0701 0,0668 0,0772 0,066 0,075 0,0639 0,0644 0,0833 0,0807 0,0643 0,0691 0,078 0,0777 0,0656 0,072 0,0806 0,0781 0,0653 0,0813 0,0784 0,067 0,0724 0,0593 0,0601

Pseudo query mate retrieval, cut off all but 5 words in a query
MCCA - PQ5 ES DE IT NL DA SV PT FR FI EN 0,1788 0,1644 0,1833 0,1829 0,1681 0,1638 0,1836 0,186 0,1535 0,1453 0,1635 0,1534 0,1443 0,1467 0,1684 0,1657 0,1371 0,1536 0,1503 0,1526 0,1636 0,1501 0,1429 0,1698 0,1588 0,1575 0,184 0,1791 0,1465 0,1713 0,1672 0,1834 0,1786 0,1544 0,1602 0,1724 0,1576 0,1466 0,1555 0,1377 0,1312 0,1645 0,1286 0,1212 Pseudo query mate retrieval, cut off all but 5 words in a query LSI - PQ5 ES DE IT NL DA SV PT FR FI EN 0,1215 0,101 0,122 0,1116 0,0992 0,1231 0,0946 0,0974 0,1171 0,1044 0,0959 0,09 0,1203 0,1152 0,0897 0,1252 0,1172 0,1061 0,1015 0,1301 0,1263 0,0991 0,1104 0,0977 0,0927 0,1276 0,1228 0,0928 0,1038 0,1014 0,1246 0,1212 0,0935 0,1048 0,1286 0,1239 0,0975 0,1248 0,1201 0,0983 0,1144 0,0853 0,0884 Clust - PQ5 ES DE IT NL DA SV PT FR FI EN 0,0605 0,0549 0,061 0,0593 0,0537 0,0535 0,0612 0,052 0,053 0,0609 0,0563 0,0508 0,0512 0,0603 0,0597 0,0498 0,059 0,0596 0,0541 0,0531 0,0595 0,0509 0,057 0,0528 0,0517 0,0611 0,0607 0,0532 0,0562 0,0608 0,0624 0,0525 0,0592 0,0606 0,0533 0,0574 0,0497 0,0483

Thanks

Large scale multilingual and multimodal integration

Similar presentations

Presentation on theme: "Large scale multilingual and multimodal integration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large scale multilingual and multimodal integration

Similar presentations

Presentation on theme: "Large scale multilingual and multimodal integration"— Presentation transcript:

Similar presentations

About project

Feedback