Multilingual Synchronization focusing on Wikipedia 2011-02-27.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
To trust or not, is hardly the question! Sai Moturu.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
Using the Semantic Web for Web Searches Norman Piedade de Noronha, Mário J. Silva XLDB / LaSIGE, Faculdade de Ciências, Universidade de Lisboa.
1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
energypedia Introduction 23th July 2012.
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Search Engine Architecture
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Algorithmic Detection of Semantic Similarity WWW 2005.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Multilingual Synchronization focusing on Wikipedia
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Enriched Knowledge Service Platform and Cross-Database Search September, 2015.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Hongbo Deng, Michael R. Lyu and Irwin King
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Post-Ranking query suggestion by diversifying search Chao Wang.
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Of 24 lecture 11: ontology – mediation, merging & aligning.
Link Distribution on Wikipedia [0422]KwangHee Park.
Link Distribution in Wikipedia [0324] KwangHee Park.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Exploiting Wikipedia as External Knowledge for Document Clustering
Wikitology Wikipedia as an Ontology
Active AI Projects at WIPO
Presentation transcript:

Multilingual Synchronization focusing on Wikipedia

Introduction Wikipedia: Multilingual encyclopedia – Supports over 270 languages English, German, Spanish, French, Chinese, Arabic, … Allows cross-lingual navigation with inter-language link – Inter-language links: hyperlinks from any page in one Wikipedia language edition to one or more nearly equivalent or exactly equivalent pages in another Wikipedia language editions – Different quantity of data on each languages Wikipedia other language editions often suffer from lack of information compared to the English version – Multilingual stat on Feb » English: 3.5 million articles (Most dominant) » French: 1 million articles (3rd) » Korean: 156,290 articles (22nd)

Goal of M-Sync Multilingual Synchronization – Synchronizing contents of Wikipedia from multiple different languages Linking among multiple language contents Combining them to synthesis – The various Wikipedia editions from different languages can offer more precise and detailed information based on different intentions/backgrounds/cultures can fill the gap between different languages and to acquire the integrated knowledge

Sub-Task (NOW) Goal: – Finding correlated terms from hypertexts using multilingual topical synthesis Comparison: sum(page length) LanguagesDomain = DiseaseDomain = Settlement English7,726,72443,555,917 French3,761,92321,270,331 Spanish3,739,16215,265,574 Chinese1,472,10910,496,202 Korean842,4635,813,650 Union17,542,38196,401,674

Sub-Task Hypothesis – X is correlated with Y in L 1  X’ should be correlated with Y’ in L 2 Where Y’ is a corresponding term to Y in different language – Assumption » Inter-language links are accurate links to connect two pages about the same entity or concept in different languages Where X’ is a translating term to X in different language – X is correlated with Y according to its strength using topical synthesis

Outline of Method Input – Pages in multiple languages Output – Ordered(ranked) correlated-term sets from page or – Weighted graph with titles and links as vertices and the co-relatedness between these vertices as edges

Outline of Method Process – Selecting target Languages – Selecting target languages to used as synchronization sources » English, Spanish, French, Chinese, Korean – Selecting target languages to used as synchronization targets » Korean & English Pages – 5-clique pages from the above languages – Domain: settlement(the largest), disease(neutral) – Extracting correlated-terms from hypertexts Extracting links from pages Extracting links from history versions in a temporal manner – Translating correlated-terms into target languages – Computing weights of co-relatedness using multilingual topic synthesis

Preprocessing: Selecting Target Source languages(5) – English, Spanish, French, Chinese, Korean Extracting target pages with a 5-clique by inter-language links – Assumption: Pages founded in all 5 languages are key pages and the target to sync Enforcing consistency of a link path – If a path from X(L 1 ) to X’(L 2 ) founded once, its inverse path (X’, X) is automatically added to the output 8 en:Badminton es:Bádmintonfr:Badminton zh: 羽毛球 ko: 배드민턴 A subset of UN official languages

Extracting correlated-terms from hypertexts Hypertext pages – Containing links to other pages links – navigate to a web page with more detailed information – point to previously published web pages with similar or related content – Connectivity between pages often proven to play an important role in determining the relevance

Link types of Wikipedia internal links to other pages in the wiki – Syntax usage: [[Main Page]] external links to other websites interwiki links to other websites registered to the wiki in advance – Unlike internal links, interwiki links do not use page existence detection – Syntax usage: [[wikipedia:Sunflower]] Interlanguage links to other websites registered as other language versions of the wiki

Example of hyperlinks Example links(out-going) of Seattle: – “northwestern United States” – “Washington” – “Lake Washington” – “Michael McGinn (mayor)” – …

Extracting correlated-terms from hypertexts (out-going links) AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating

Extracting correlated-terms from hypertexts (in-coming links) Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD

Translating terms Method – Wikipedia dictionary-based We have collected the cross-lingual term pars to build bilingual word pairs – 4 dictionaries are available: EN-KO, ES-KO, FR-KO, ZH-KO Weakness – Lack of vocabularies – Google translation API-base Weakness – Terms(keywords) are too short to solve the word sense disambiguation using MT

Computing weights of co-relatedness using multilingual topic synthesis Weight of correlations – Baseline: frequency-based method – However, different Wikipedia has different viewpoints and concerns – We should give different weight of synthesized correlated term sets according to different lingual usages and frequencies Our proposed solution: – analyzing the topical distributions on each languages and – Computing weights of correlated terms by each topics’ interest – Approach » LDA-based topic distribution using links » Align cross-lingual topic clusters Intersection: common topic Difference: unique topic

Evaluation What – Comparison with Discovered new correlated terms without topical synthesis – Simple union approach Discovered new correlated terms with topical synthesis – Ranked union using calculating the strength of relatedness How – Public measures Co-occurrence Mutual information Normalized Google distance – Wikipedia oriented measures Comparison with the featured articles Comparison with the temporal manner

Comparison with featured articles Featured articles: – are considered to be the best articles in Wikipedia, as determined by Wikipedia's editors AAA BBB CCC DDD EEE FFF GGG Featured article languageX AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating Correlated TermSet

Comparison with temporal manner AAA BBB CCC DDD EEE FFF GGG Article at t n language1 AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 in t1 123 in t n language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating Correlated TermSet

Contributions To support the seed data (seed keywords) to complete articles in a multilingual manner, or to guide users in generating new articles in Wikipedia To find unknown correlated words using various sources