Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Using IR techniques to improve Automated Text Classification
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Advanced Multimedia Text Classification Tamara Berg.
Image Annotation and Feature Extraction
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Text mining.
Kernel Canonical Correlation Analysis (Language Independent Document Representation) Roland Pihlakas Part of the slides is taken from.PPT with same title.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Kernel Canonical Correlation Analysis (Language Independent Document Representation) Blaz Fortuna Marko Grobelnik Dunja Mladenić Jozef Stefan Institute,
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.
System for Semi-automatic ontology construction
An Empirical Study of Learning to Rank for Entity Search
Large scale multilingual and multimodal integration
Semi-Automatic Data-Driven Ontology Construction System
SVMs for Document Ranking
Presentation transcript:

Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval

Input Two different views of the same data: Text documents written in different languages Images with attached text …

Goal Find pairs of features from both views with highest correlations Example: words that co-appear in document and its translation car, vehicle, … Auto, Fahrzeug, … meat, chicken, beef, pork, … Fleisch, Hahnchen, Rindfleisch, Schweinerne, …

Theory behind CCA Documents are presented with pairs of vectors – one for each view Result of CCA are basis vectors for each view such that the correlation between the projections of the variables onto these basis vectors are mutually maximized

Kernelisation of CCA Method can be rewritten so feature vectors only appear inside inner-product We can use Kernel for calculating inner- product Input documents don not need to be vectors (eg. text documents together with string kernel)

Cross-Language Text Mining KCCA constructs language independent representation for text documents Good part: documents from different languages can be compared using this representation Bad part: paired dataset is needed for training (can be avoided using machine translation tools)

KCCA and LSI LSI discovers statistically most significant co- occurrences of terms in documents When word appears in a document, what other words usually also appear? KCCA matches terms from the first language with terms from the second based on co-occurrences When word appears in a document, does it also appear in its translation?

Text document retrieval Query databases with multilingual documents Documentsquery Documents from database and query are transformed into language independent representation Nearest neighbour

Experiments 36 th Canadian Parliament proceedings corpus Part of documents used for training For testing 5 most relevant keywords were extracted from a document and used as queries English query, French documents LSI30/6738/7542/7945/8149/84 KCCA68/9475/9678/9779/9881/98 retrieval accuracy (top-ranked/top-ten-ranked) [%]

Text categorization Categorize multilingual documents All documents are transformed into language independent representation Classifier is trained on transformed labelled documents

Experiments NTCIR-3 patent retrieval test collection Japanese – English SVM trained on English documents Tested both on the Japanese and English Full Eng-train Eng-test Jp-train Jp-test Average precision [%]

Image-Text Retrieval Retrieval of images based on a text query No labels associated with images Paired dataset: Image retrieved from internet Text on web page where image appeared

Experiments Querying database with images with text queries Images were split into three clusters 10 or 30 images that best match query are retrieved In first test success is when images are of same label In second test success is when images that actually matched query is retrieved dim85%91%17%60% 150 dim83%91%32%69%

Images retrieved for the text query: ”height: 6-11 weight: 235 lbs position: forward born: september 18, 1968, split, croatia college: none”

”at phoenix sky harbor on july 6, s7, n907wa phoenix suns taxis past n902aw teamwork america west america west s7, n907wa phoenix suns taxis past n901aw arizona at phoenix sky harbor on july 6, 1997.”

Feature work Use of machine translation for making paired dataset Experiments with SVEZ-IJS English- Slovene ACQUIS Corpus Sparse version of KCCA