An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Improved TF-IDF Ranker
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Product Feature Discovery and Ranking for Sentiment Analysis from Online Reviews. __________________________________________________________________________________________________.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Search and Information Extraction Lab IIIT Hyderabad.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Cross-language Information Retrieval
Applying Key Phrase Extraction to aid Invalidity Search
Wikitology Wikipedia as an Ontology
Multilingual Information Access in a Digital Library
Extracting Semantic Concept Relations
Searching EIT, Author Gay Robertson, 2017.
35 35 Extracting Semantic Knowledge from Wikipedia Category Names
Introduction to Search Engines
Extracting Why Text Segment from Web Based on Grammar-gram
Introduction Dataset search
Presentation transcript:

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information Extraction Lab Language Technologies Research Center IIIT Hyderabad

Outline Introduction Model of our approach – An example – Different steps – Scoring Dataset and Evaluation. – Dataset – Evaluation Results – Empirical Results – Coverage Discussion

Why CLIA? Cross lingual information access. Hindi-Wide Web and Telugu-Wide Web. Bridge the gap between the information available and languages known, CLIA systems are vital. Dictionaries form our first step in building such CLIA system. Built exclusively to translate/ transliterate user queries. Why dictionaries?

Why Wikipedia ? Rich multi-lingual content in 272 languages and growing. Well structured. Updated regularly. All languages doesn’t have the privilege of resources.  Can harness Wikipedia structure instead.

How is it done? Exploit different structural aspects of the Wikipedia Build as many resources from Wikipedia itself. Extract parallel/comparable text from each structure using the resources built. Build dictionaries using previously built dictionaries and resources.

Contd… Extract maximum information using the structure of the articles that are linked with cross lingual link.

Model of our approach

Example…

Titles (dictionary 1 ) Titles of the articles form the parallel text to build dictionary. Considered both directions for building parallel corpus  i.e.. English to Hindi and Hindi to English. Infobox (dictionary 2 ) Contain vital information especially nouns. Two step process  Keys  Values Values of the mapped keys from the info boxes are considered as parallel text.  We reduce the number of words in the value pair by removing the highly scored word pairs in dictionary 1 and stop words.

Categories (dictionary 3 ) Categories of articles linked by the ILL link form the parallel text. Considered both directions for building parallel corpus  i.e.. English to Hindi and Hindi to English Article text (dictionary 4 ) First paragraph of the text is generally the abstract of the article. Sentence pairs are filtered by Jaccard similarity metric using the dictionaries built. Mapped words in any of the dictionaries and stop words in each sentence pair are removed.

Scoring The scoring of the words from parallel text is based on the formula are i th and j th words in English and Hindi wordlists respectively are number of occurrences of respectively in the parallel text is the number of occurrences of in a single parallel text instance.

Dataset Three sets of 300 English words each are generated from existing English-Hindi and English-Telugu dictionaries.  Existing dictionaries are provided by Language Technologies Research Center (LTRC, IIIT-H) Mix of most frequently used to less frequently used.  Frequency determined using news corpus. POS tagged to perform the tag-based analysis.

Precision and recall are calculated for the dictionaries built.  precision is (ExtractedCorrectMappings) / (AllExtractedMappings)  recall is (ExtractedCorrectMappings) / (CorrectMappingsUsingAvailableDictionary)  Correctness of a mapping is determined in two ways  Automatic: Using an available dictionary  Manual: We manually evaluate the correctness of the word Two methods are required because  No language dependent processing (parsing, chunking etc).  Different word forms (plural, tense etc) are returned.  Different spellings in Wikipedia for the same word. Evaluation

Empirical Results Method Automated Eval Titles (manual) Manual Eval PrecisionRecallPrecisionRecallPrecision Set Set Set English to Hindi results Method Automated Eval Titles (manual) Manual Eval PrecisionRecallPrecisionRecallPrecision Set Set Set English to Telugu results

Existing Systems ApproachPrecisionRecallF 1 Score Existing (High precision) Existing (High recall) Our approach (Hindi)  We can see that our system with equal precision and recall, performs comparatively with the existing system.

Coverage Number of Unique words that are added to the dictionary from each structure Structure(X-axis) Vs Unique word count (Y-axis)

Discussion Titles are considered as a baseline  most of the existing CLIA systems over Wikipedia also consider titles as baseline. The precision is high when evaluated manually because  Various context-based translation for a single English word.  Different word forms of the word returned and that present in the dictionary.  Different spelling for the same word. (different characters) Also the precision of dictionaries created are in the order  Title > info box > categories > text

Cont.. Query formation in wiki-CLIA does not depend completely on dictionaries and their accuracy. The words returned by our dictionaries, if not exact translations, are related words since they are present in a related wiki article. Words returned can be used to form the query. The coverage of proper nouns that are generally not present in dictionaries is high. Their values are PrecisionRecallF-Measure (F 1 )

On-going Work Extract more parallel sentences using other structures to increase the coverage of dictionary Image meta tags, Body of the article and Anchor text. Query Formation from these dictionaries.

Questions??? Thank you.