Kernel Canonical Correlation Analysis (Language Independent Document Representation) Roland Pihlakas Part of the slides is taken from.PPT with same title.

Slides:

Advertisements

Similar presentations

Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Introduction to Information Retrieval

SVM—Support Vector Machines

Machine learning continued Image source:

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

ISP 433/533 Week 2 IR Models.

Principal Component Analysis

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Support Vector Machines

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Chapter 5: Information Retrieval and Web Search

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Presented By Wanchen Lu 2/25/2013

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.

JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.

Kernel Canonical Correlation Analysis (Language Independent Document Representation) Blaz Fortuna Marko Grobelnik Dunja Mladenić Jozef Stefan Institute,

1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

A Language Independent Method for Question Classification COLING 2004.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

1 Information Retrieval LECTURE 1 : Introduction.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta

Sentiment analysis algorithms and applications: A survey

System for Semi-automatic ontology construction

Large scale multilingual and multimodal integration

Semi-Automatic Data-Driven Ontology Construction System

Information Retrieval and Web Design

Modeling IDS using hybrid intelligent systems

Presentation transcript:

Kernel Canonical Correlation Analysis (Language Independent Document Representation) Roland Pihlakas Part of the slides is taken from.PPT with same title by the authors: Blaz Fortuna, Marko Grobelnik, Dunja Mladenić Jozef Stefan Institute, Ljubljana

General goal (1/2) Most information retrieval methods depend on exact matches between words in user queries and words in documents. Such methods will, however, fail to retrieve relevant materials that do not share words with users’ queries. These methods treat words as if they are independent, although it is quite obvious that they are not.

General goal (2/2) KCCA enables to represent documents in a “language neutral way” Intuition behind KCCA: 1.Given a parallel corpus (such as Acquis)… 2.…first, we automatically identify language independent semantic concepts from text, 3.…then, we re-represent documents with the identified concepts, 4.…finally, we are able to perform cross language statistical operations (such as retrieval, classification, clustering…)  It is not neccessary to use any external dictionaries, thesauri, or knowledge bases to determine word associations. It can be used both in monolinguistic and multilinguistic retrieval.

English German French Spanish Italian SlovenianSlovak Czech Hungarian Greek Finnish Swedish Dutch Lithuanian Danish Language Independent Document Representation New document represented as text in any of the above languages New document represented in Language Neutral way …enables cross-lingual retrieval, categorization, clustering, …

Input for KCCA On input we have set of aligned documents: –For each document we have a version in each language Documents are represented as bag- of-words vectors Bag-of-words space for English language Pair of aligned documents Bag-of-words space for German language

The Output from KCCA The goal: find pairs of semantic dimensions that co-appear in documents and their translations with high correlation –Semantic dimension is a weighted set of words. These pairs are pairs of vectors, one from e.g. English bag-of-words space and one from German bag-of-words space. loss, income, company, quarter verlust, einkommen, firma, viertel wage, payment, negotiations, union zahlung, volle, gewerkschaft, verhand- lungsrunde Semantic dimensions pair Semantic dimension

What is canonical correlation analysis arg max a, b ρ, where ρ = corr(a’X, b’Y) X, Y – vectors of random variables a, b – vectors we are seeking for Typical use for canonical correlation in the psychological context is to take a two sets of variables and see what is common amongst the two sets. For example you could take two well established multidimensional personality tests such as the MMPI and the NEO. By seeing how the MMPI factors relate to the NEO factors, you could gain insight into what dimensions were common between the tests and how much variance was shared.

The Algorithm – Theory –Formally the KCCA solves: max(f x,f y ) corr(, ) –f x, f y – semantic directions for English and German –(, ) is a pair of aligned documents – is presumably the language independent semantic representation. –f x is presumably the transformation matrix for specific language to retrieve langage independent representation. –we search for f x and f y such that the resulting representations will be highly correlated.

–Formally the KCCA solves: max(f x,f y ) corr((f x, ), (f y, )) –f x, f y – semantic directions for English and German. –(, ) is a pair of aligned documents. The Algorithm – Theory

More details “Pair of documents” – in this experiment, is actually a pair of sentences, more specifically, vectors corresponding to bags of the words in the sentences, mapped to high-dimensional space. The mapping is to be done using nonlinear function Φ(x), though the Φ was not specified. The aim is to do the Kernel trick.

The Algorithm – Theory

K x, K y – kernels of the two nonlinear mappings Φ(x) and Φ(y). In the experiment, they actually used simple linear kernel k(x i, x j ) = x i T x j. They plan to use different kernels in the future. x, y – corpuses in some specific language consisting of set of documents x = {x i } N i=1 x i,y i – pair of translated texts

Examples of Semantic Dimensions from Acquis corpus: English-French Most important words from semantic dimensions automatically generated from 2000 documents: DIRECTIVE, DECISION, VEHICLES, AGREEMENT, EC, VETERINARY, PRODUCTS, HEALTH, MEAT DIRECTIVE, DECISION, VEHICULES, PRESENTE, RESIDUS, ACCORD, PRODUITS, ANIMAUX, SANITAIRE NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF, CLASSIFICATION, CUSTOMS NOMENCLATURE, COMBINEE, COLONNE, MARCHANDISES, CLASSEMENT, TARIF, TARIFAIRES EMBRYOS, ANIMALS, OVA, SEMEN, ANIMAL, CONVENTION, BOVINE, DECISION, FEEDINGSTUFFS EMBRYONS, ANIMAUX, OVULES, CONVENTION, SPERME, EQUIDES, DECISION, BOVINE, ADDITIFS SUGAR, CONVENTION, ADDITIVES, PIGMEAT, PRICE, PRICES, FEEDINGSTUFFS, SEED SUCRE, CONVENTION, PORC, ADDITIFS, PRIX, ALIMENTATION, SEMENCES, DECISION EXPORT, LICENCES, LICENCE, REFUND, VEHICLES, FISHERY, CONVENTION, CERTIFICATE, ISSUED EXPORTATION, CERTIFICATS, CERTIFICAT, PECHE, VEHICULES, LAIT, CONVENTION Veterinary Customs Agriculture Export Licences Veterinary, Transport

The experiment 1.3 million pairs of aligned text chunks (sentences or smaller fragments). Text was taken from 36 th Canadian Parliament proceedings. But: they removed some words and a few documents that appeared to be problematic when split into paragraphs. The result was 5159 x term-by-”document” English matrix and 5611 x term-by-”document” French matrix. These matrixes were still too large to perform matrix decompositions, so the experiment was done on 14 chunks of the above data and the results were averaged. Computation in Matlab took ~2-8 minutes on Pentium III 1GHz for each chunk, depending on the implementation.

Results Pseudo query tests: 5 query words, relevant documents were the test documents themselves in monolinguistic retrieval or their mates in cross-linguistic tests. K – unspecified. Probably the number of terms / dimensions.

Example applications of KCCA Cross-lingual document retrieval: retrieved documents depend only on the meaning of the query and not its language. Automatic document categorization: (for example, using SVM.) Only one classifier is learned and not a separate classifier for each language. Document clustering: documents should be grouped into clusters based on their content, not on the language they are written in. Cross-media information retrieval: in the same way we correlate two languages we can correlate text to images, text to video, text to sound, …

Related approaches Usual approach for modelling cross language Information Retrieval is Latent Semantic Indexing (LSI/SVD) on parallel corpora –…measured performance of KCCA is consistently and significantly better than LSI on the same data. [Vinokourov et. al, 2002]

Links KCCA is available within Text-Garden text- mining software environment –…available at