On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
University of Economics Prague - UEP 1 MedIEQ Web Spider and Link scoring component Marek Ruzicka Project meeting TKK, Helsinki, Finland 23.October.2006.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Web development  World Wide Web (web) is the Internet system for hypertext linking.  A hypertext document (web page) is an online document. It contains.
Aki Hecht Seminar in Databases (236826) January 2009
By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Saarbrucken / Germany ¨
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
A Language Independent Method for Question Classification COLING 2004.
Source-Selection-Free Transfer Learning
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Evgeniy Gabrilovich and Shaul Markovitch
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Introduction to HTML Year 8. What is HTML O Hyper Text Mark-up Language O The language that all the elements of a web page are written in. O It describes.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
ACES User Interface Workshop #1 Prototype Inspection 22. November 2011.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
Chapter 10: Web Basics.
A German Corpus for Similarity Detection
Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Presentation transcript:

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute for Artificial Intelligence (RACAI) FASSBL-7 Dubrovnik, Croatia October 4—6, 2010

Introduction  Multilingual Comparable Corpora (MCC) is usually easier to find and gather than parallel corpora  There are many types of MCC that pertain to the degree of relatedness: strongly, weakly, very non-parallel, etc. MCC  Our working definition (Munteanu & Marcu, 2006): a set of paired documents that even though are not translations of one another, are related and convey overlapping information  For instance news about your local favorite football team suffering a defeat last night

Document pairing in MCC  It’s very important to acknowledge that in order to be able to use large MCC, we need to pair documents from source and target languages  Suppose that we gather some type of news corpora (sports for instance) in two languages and we do that by streaming news sites in those languages  Suppose that we do not keep the documents themselves and we join them into one large document  Now if the source and target documents have 1M words per document (a very optimistic scenario), we will need at least 1M  1M = operations to word- align the documents !  But if we had 1000 documents with 1000 words each (in each of the languages) and managed to first align the documents, we would need 1000  = 10 9 op.

Wikipedia as an MCC corpus  Wikipedia is an extremely valuable resource in that is a free collection of (generally) good quality articles that have versions in many languages  Many of the articles on Wikipedia are linked with their versions in other languages, a feature that makes it an inherently large MCC corpus  English Wikipedia has 3,431,874 articles, Romanian Wikipedia has 150,797 articles  We have employed two different strategies of building MCC from Wikipedia:  using Romanian “quality articles” (very good quality articles that are complete, well written, approved by senior Wikipedia administrators)  using Princeton English WordNet (to be explained…)

MCC from Wikipedia quality articles  Having a list of Romanian quality articles …  We have gathered 128 pairs of English-Romanian documents from Wikipedia (602K/502K words) using one of the following heuristics:  Following the English link from the Romanian article gave us the English pair of the Romanian document  English articles that had the exact same name as Romanian articles (“Alicia Keys”, “Evanescence”, etc.)  We automatically translate the title of the Romanian page into an English query by using translation lexicons (we consider the first 2 translations for every Romanian content word). We retrieve the first 10 results and manually find the pair of the Romanian document but an automatic method is also available (to be described…)

MCC from Wikipedia using WordNet  Using Princeton WordNet (wordnet.princeton.edu), extract a list of named entities (literals that are capitalized and usually in the “instance_of” relation with their parents)wordnet.princeton.edu  Transform these literals in Wikipedia page names by replacing spaces with underscore (“_”) and adding the Wikipedia URL prefix en.wikipedia.org/wiki/en.wikipedia.org/wiki/  Extract all English pages we can find and for each page, the Romanian and/or German versions if they exist by following the interlingual links  We strip the HTML information from the documents retaining only the UTF-8 text and we also store the categories of each document in order to be able to select different domain corpora

Sizes of Collected MCC corpora  Using the WordNet named entities method we were able to gather the following data (in thousands of words):

Document pairing in MCC  The problem is to automatically pair documents (1:1 mapping) from the source language set with those in the target language set  In order to do this we replaced each word in every document with its translation equivalent pairs imposing a limit of at most 3 translations and also considering only those source words that have a low translation entropy score (at most 0.5)  If two candidate documents are represented as binary vectors x = (x 1, x 2, …, x n ) and y = (y 1, y 2, …, y n ) in which a position is 1 if the corresponding term is found in the document …

Percent disagreement d(x, y)  The percent disagreement measure is the best measure that differentiates the best between good pairs and bad ones (tested against Euclidean, Squared Euclidean and Manhattan distances)  We managed to obtain a 72% accuracy when aligning the 128 documents test set (the quality articles) from Romanian Wikipedia

Focused MCC crawling  Usually the task of collecting corpora from the web is undertaken once and then all the related tools and resources are forgotten …  Until a new corpus is expected to be built in which case, the whole suite of scripts is usually rewritten in order to cope with the new requirements  In order to avoid the unnecessary duplication of work, we developed a graphical web crawler that, based on a input list of URLs, crawls the web, stores the documents in text form and, optionally, runs them through a suite of NLP tools at the user’s choice

The script-based web crawler

Conclusions  Comparable corpora is easier to obtain than parallel corpora and in the ACCURAT project ( we intend to exploit comparable corpora in order to obtain parallel data that will complement and improve existing translation modelshttp://  We have collected around 46M words worth of English- Romanian comparable corpora and around 26M words of Romanian-German comparable corpora from Wikipedia  We have also developed a generic graphic web crawler that will collect even more comparable corpora from the web