Multilingual Synchronization focusing on Wikipedia 2011-03-17.

Slides:

Advertisements

Similar presentations

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.

Advertisements

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

INFO 624 Week 3 Retrieval System Evaluation

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty.

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Multilingual Synchronization focusing on Wikipedia

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Search Engine Architecture

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Algorithmic Detection of Semantic Similarity WWW 2005.

Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.

Web- and Multimedia-based Information Systems Lecture 2.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Evgeniy Gabrilovich and Shaul Markovitch

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.

CSC 594 Topics in AI – Text Mining and Analytics

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.

11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Statistics Explained your guide to European statistics Luxembourg, User Support Workshop, 25 January 2010.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Link Distribution in Wikipedia [0324] KwangHee Park.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Exploiting Wikipedia as External Knowledge for Document Clustering

Wikitology Wikipedia as an Ontology

Disambiguation Algorithm for People Search on the Web

ESS multilingual glossary in Statistics Explained - progress Marc Debusschere Agenda point 16 Dissemination Working Group October 2013.

Search Engine Architecture

Fig. 1 (a) The PageRank algorithm (b) The web link structure

Active AI Projects at WIPO

Presentation transcript:

Multilingual Synchronization focusing on Wikipedia

Introduction Wikipedia: Multilingual encyclopedia – Supports over 270 languages English, German, Spanish, French, Chinese, Arabic, … Allows cross-lingual navigation with inter-language link – Inter-language links: hyperlinks from any page in one Wikipedia language edition to one or more nearly equivalent or exactly equivalent pages in another Wikipedia language editions – Different quantity of data on each languages Wikipedia other language editions often suffer from lack of information compared to the English version – Multilingual stat on Feb » English: 3.5 million articles (Most dominant) » French: 1 million articles (3rd) » Korean: 156,290 articles (22nd)

Goal of M-Sync Multilingual Synchronization – Synchronizing contents of Wikipedia from multiple different languages Linking among multiple language contents Combining them to synthesis – The various Wikipedia editions from different languages can offer more precise and detailed information based on different intentions/backgrounds/cultures can fill the gap between different languages and to acquire the integrated knowledge

Multilingual Resource Synthesis Construction Association Network from multiple language hyperlink structures – based on occurrence data A Web page makes a direct reference to another Web page via a hyperlink Definitions – Association A relation between any two words or concepts with a strength – Association Network Nodes and edges – Node: entity (concept or a named entity) – Edge: link between entities with weight(the strength of association)

Motivating Example BaldnessHair

Motivating Example BaldnessHair 탈모증털

Motivating Example BaldnessHair 탈모증털 Increase Strength

Motivating Example BaldnessHair 탈모증털 Hamilton- Norwood_scale

Motivating Example BaldnessHair 탈모증털 Hamilton- Norwood_scale Link Prediction or Keyword Recommendation

Multilingual Resource Synthesis: Construction of Association Network Hypothesis – X is associated with Y in L 1  X’ should be associated with Y’ in L 2 Where Y’ is a corresponding term to Y in different language – Assumption » Inter-language links are accurate links to connect two pages about the same entity or concept in different languages Where X’ is a translating term to X in different language – X is associated with Y according to its strength synthesis based on respective occ-score

Proposed System Workflow Association set for each English word Link Extraction English Documents Association set for each Korean word Link Extraction Korean Documents Association set for each Chinese word Link Extraction Chinese Documents Calculation of Correlations Association for each pair of English and Korean words Selection of highly Associated pairs of words Calculation of Correlations Association for each pair of Korean and Chinese words Calculation of Correlations Bilingual Dictionary

Example of Association Set Association set of “Baldness” alopecia areata adult androgenic alopecia 원형 탈모증 남성형 탈모증 머리카락 營養頭髮荷爾蒙心理

Preprocessing: Selecting Target Source languages(5) – English, Spanish, French, Chinese, Korean Extracting target pages with a 5-clique by inter-language links – Assumption: Pages founded in all 5 languages are key pages and the target to sync Enforcing consistency of a link path – If a path from X(L 1 ) to X’(L 2 ) founded once, its inverse path (X’, X) is automatically added to the output 13 en:Badminton es:Bádmintonfr:Badminton zh: 羽毛球 ko: 배드민턴 A subset of UN official languages

Link types of Wikipedia internal links to other pages in the wiki – Syntax usage: [[Main Page]] external links to other websites interwiki links to other websites registered to the wiki in advance – Unlike internal links, interwiki links do not use page existence detection – Syntax usage: [[wikipedia:Sunflower]] Interlanguage links to other websites registered as other language versions of the wiki

Experiment Link Extraction – Input: 40,000 pages in each languages – Output 5,453,959 links en 3,965,279 links fr 3,368,949 links es 2,397,959 links zh 1,347,811 links ko Association Pairs

Experiment: respective mode Compute strength of Association: LF-IDF – is motivated by TF-IDF heuristic

Experiment: synthesis mode M: Simple average NM: Average with standard deviation Demo –

Program Running Environment Link extractor – X seconds / 1 page LFIDF – 1 hours 40,000 page * 5 language – 5,453,959 links en – 3,368,949 links es – 3,965,279 links fr – 1,347,811 links ko – 2,397,959 links zh Multilingual LFIDF – 2 hours Translating – X minutes Aggregating links from multiple sources – 1.5 hours Computing multi weight

Example of hyperlinks Example links(out-going) of Seattle: – “northwestern United States” – “Washington” – “Lake Washington” – “Michael McGinn (mayor)” – …

Extracting correlated-terms from hypertexts (out-going links) AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating

Extracting correlated-terms from hypertexts (in-coming links) Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD 123 BBB AAA CCC DDD

Translating terms Method – Wikipedia dictionary-based We have collected the cross-lingual term pars to build bilingual word pairs – 4 dictionaries are available: EN-KO, ES-KO, FR-KO, ZH-KO Weakness – Lack of vocabularies – Google translation API-base Weakness – Terms(keywords) are too short to solve the word sense disambiguation using MT

Computing weights of co-relatedness using multilingual topic synthesis Weight of correlations – Baseline: frequency-based method – However, different Wikipedia has different viewpoints and concerns – We should give different weight of synthesized correlated term sets according to different lingual usages and frequencies Our proposed solution: – analyzing the topical distributions on each languages and – Computing weights of correlated terms by each topics’ interest – Approach » LDA-based topic distribution using links » Align cross-lingual topic clusters Intersection: common topic Difference: unique topic

Evaluation What – Comparison with Discovered new correlated terms without topical synthesis – Simple union approach Discovered new correlated terms with topical synthesis – Ranked union using calculating the strength of relatedness How – Public measures Co-occurrence Mutual information Normalized Google distance – Wikipedia oriented measures Comparison with the featured articles Comparison with the temporal manner

Comparison with featured articles Featured articles: – are considered to be the best articles in Wikipedia, as determined by Wikipedia's editors AAA BBB CCC DDD EEE FFF GGG Featured article languageX AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating Correlated TermSet

Comparison with temporal manner AAA BBB CCC DDD EEE FFF GGG Article at t n language1 AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD AAA BBB CCC DDD Correlated TermSet 123 in t1 123 in t n language1 language2 language3 language4language5 Correlated TermSet 123, AAA Unified TermSet Translating Correlated TermSet

Contributions To support the seed data (seed keywords) to complete articles in a multilingual manner, or to guide users in generating new articles in Wikipedia To find unknown correlated words using various sources