Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

The Tiger Project: Korea Culture and Heritage DL Kim, Sung Hyuk Division of Information Science Sookmyung Women’s University, Seoul, Korea.
WordNet Team, Amrita University, Coimbatore. Name of the Project: Development of Dravidian WordNet: An Integrated Wordnet for Telugu, Tamil, Kannada and.
DRAVIDIAN WORDNET S.Arulmozi Dravidian University 29 April 2013.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
 Asian WordNet: Development and Service in Collaborative Approach Virach Sornlertlamvanich Thai Computational Linguistics Laboratory (TCL), NICT, and.
Shou Ray Information Service Co., Ltd.
Knowledge Sharing Platform Empowering Communities through regional Content and Services C. Kathiresan C-DAC, Hyderabad, India Session V : e-Content & ICT.
C SC 620 Advanced Topics in Natural Language Processing Sandiway Fong.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
Open Statistics: Envisioning a Statistical Knowledge Network Ben Shneiderman Founding Director ( ), Human-Computer Interaction.
Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.
1 Welcome & Overview 2 nd Annual Workshop “What are National Security Threats?” Kathleen D. Morrison Co-Director, JTAC Professor of Anthropology Director,
Consortium Project on Development of Dravidian WordNet: An Integrated WordNet for Telugu, Tamil, Kannada and Malayalam.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Interdisciplinary role of English in the field of medicine: integrating content and context Nataša Milosavljević, Zorica Antić University of Niš, Faculty.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
1 Indo WordNet A WordNet for Hindi Centre for Technology Development for Indian Languages Computer Science and Engineering Department, IIT Bombay.
Antonym Creation Tool Presented By Thapar University WordNet Development Team.
1/ 27 The Agriculture Ontology Service Initiative APAN Conference 20 July 2006 Singapore.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
revised CmpE 583 Fall 2006Discussion: OWL- 1 CmpE 583- Web Semantics: Theory and Practice DISCUSSION: OWL Atilla ELÇİ Computer Engineering.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
NERIL: Named Entity Recognition for Indian FIRE 2013.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Using Class Blogs to Enhance Writing. Introduction  What is a blog?  A “Blog” is the blending of the two words “web log”. It is found on the internet.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Use of WordNet and on-line dictionaries to build EN-SK synsets (experimental tool) Ján GENČI Technical University of Košice, Slovakia
Open Access to Grey Literature: Challenges and Opportunities in India By Dr. Manorama Tripathi Prof. H. N. Prasad Banaras Hindu University, Varanasi. Mr.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
10/31/20151 EASTERN MEDITERRANEAN UNIVERSITY COMPUTER ENGINEERING DEPARTMENT Presented By Duygu CELIK Supervised By Atilla ELCI Intelligent Semantic Web.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
CSA Discovery Services!! Community of Scholars PapersInvited COS Funding Opportunities.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
Marine Metadata Interoperability Acknowledgements Ongoing funding for this project is provided by the National Science Foundation.
Institute for Information Problems of the Russian academy of Sciences and its linguistic research Olga Kozhunova CML-2008, Becici, 6-13 September.
Dalit Gasul Department of Geography and Environmental Studies, University of Haifa CRI-Project Review Day, Tuesday, February 26, 2008.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
UK Interest & Input to the Factories of the Future Horizon 2020 Roadmap. © ActionPlant 2011.
Punjabi WordNet Development Thapar University & Punjabi University Patiala.
OpenACS and.LRN Conference 2008 Automatic Limited-Choice and Completion Test Creation, Assessment and Feedback in modern Learning Processes Institute for.
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
Mapping the NCI Thesaurus and the Collaborative Inter-Lingual Index Amanda Hicks University of Florida HealthInsight Workshop, Oslo, Norway.
EXTRACTING COMPLEX PREDICATES IN HINDI ACROSS PARALLEL CORPORA
Lexicons, Concept Networks, and Ontologies
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Assessment of the contribution of IIT’s:
Development of the Amphibian Anatomical Ontology
From Open Access Resources to Open Access Repository in Nepal Brief Overview Presented by Jagadish Chandra Aryal   Librarian Social Science Baha

LACONEC A Large-scale Multilingual Semantics-based Dictionary
Technology Development
WordNet: A Lexical Database for English
Multilingual Information Access in a Digital Library
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Introduction of KNS55 Platform
CSE 635 Multimedia Information Retrieval
Computational Linguistics: New Vistas
Indradhanush WordNet Project Consortium PRSG Meeting
Presentation transcript:

Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah Dept. of Computer Science & Information Technology Gauhati University

INTRODUCTION  NE Wordnet Project for Assamese & Bodo started in  NE Wordnet Project for Assamese & Bodo have been developed with expansion approach with the Original Hindi Wordnet structure against the IDs and Concept of Hindi Wordnet.

NE Wordnet Development Project outcomes till now Validation  All the Assamese and Bodo Wordnet activities have been reviewed by the Professors of the Department of Assamese, Modern Indian Language & Bodo of Gauhati University as well as other invited resource persons.

Contd….  The developed NE Wordnet structured in the form of Database, integrated with interactive Interface is ready for different NLP research and Development. Different NLP application and research related works already started using the NE WordNet.  Automatic Bilingual Dictionary Construction: Assamese-Bodo Dictionary Construction : Prototype developed at Gauhati University.  Web based Automatic Multilingual Dictionary Construction: Assamese-Bodo-Nepali-Hindi-English Dictionary Construction: Full Web based System ready: By Gauhati University Team.  Intelligent Document Categorizing System: Prototype Developed and Tested at Gauhati University: Research Paper already accepted for GWA-2010.

NE Wordnet Development Project outcomes till now Following are the glosses which are completed in Assamese language till now:  common Synset completed =  Pan Indian Synset all Completed  Universal Synset (Total= 7168) completed = 7147  Adjective Synset Completed = 2376 (Total = 3605)  Adverb Synset Completed = 174 (Total= 209)  Verb Synset Completed = 1588 (Total = 1798)  Language Specific completed = 127 (Total =1000) Total linked Number =24,338

NE Wordnet Development Project outcomes till now Following are the glosses which are completed in Bodo language till now:  common synset completed =  Pan Indian synset all Completed  Universal Synset (Total= 7168) completed = 7143  Adverb Synset Completed = 192 (Total= 209)  Adjective Synset Completed = 2473 (Total = 3605)  Verb Synset Completed = 1752 (Total = 1798)  Synset Ranker = (34378)  Language Specific = 74 Total linked number = 24,493

Problems Faced During Development for Assamese  Synset related: In common synsets of Assamese and Bodo, a few number of synsets do not have proper Assamese word to represent. So they are not entered yet. Those left synsets have been send to the expert committee to review.  Expansion from Hindi/English: The main challenge in expansion approach is in one to one mapping.

Problems Faced During Development for Bodo  Challenges in Expansion Bodo is a developing language. It does not have a very strong linguistic resource. Also literature resource is very limited. The language does not have enough vocabulary, and new and new words are being discovered, coined and added. As a result, the development of Bodo Wordnet faces typical and frequent problems, and overcoming the problems to accommodate expansion of the Hindi Wordnet with one to one mapping has been a big challenge

Workshop/conference organized and participated by the member groups: 1.Global Wordnet Conference in IIT, Mumbai from 31st Jan.-4th Feb Indo Wordnet Conference in Amrita University, Coimbatore, in June, NE Wordnet Workshop, Guwahati, Assam, Indo Wordnet Workshop, IIT Kharagpur, Attended Spell checker training, C-DAC Pune, Indo Wordnet Workshop, Shillong, CLIA developers workshop, C-DAC Pune, Multiword Expression Workshop, University of Kashmir, Srinagar, 2011

Tools, Applications & Research  During this period, language specific tools have been developed. Language specific Synset creation tools interface

 Multi_lingual_dictionary [Online Bodo, Assamese and Hindi Language]: Step1: First select the language

Step2: Type the word of the language

Step3: When word automatically come then select the word

Step4: After search the word

Published paper in conferences/journals/workshop 1.A Novel Approach for Document Classification using Assamese WordNet, Jumi Sarmah, Navanath Saharia and Shikhar K. Sarma, Global Wordnet Conference (GWC), Japan, Assamese Vocabulary and Assamese Wordnet Building: An Analysis, Shikhar Kr. Sarma, Utpal Saikia, Mayashree Mahanta, Himadri Bharali, Global Wordnet Conference (GWC), Japan, Foundation and Structure of Developing an Assamese Wordnet, Shikhar Kr. Sarma, Moromi Gogoi, Rakesh Medhi, Utpal Saikia, Global Wordnet Conference, IIT Bombay, A Wordnet for Bodo Language: Structure and Development, Shikhar Kr. Sarma, Moromi Gogoi, Biswajit Brahma, Mane Bala Ramchiary, Global Wordnet Conference, IIT Bombay, 2010

Published paper in conferences/journals/workshop 5.A Novel Approach for Document Classification using Assamese WordNet, Jumi Sarmah, Navanath Saharia and Shikhar K. Sarma, Global Wordnet Conference (GWC), Japan, Assamese Vocabulary and Assamese Wordnet Building: An Analysis, Shikhar Kr. Sarma, Utpal Saikia, Mayashree Mahanta, Himadri Bharali, Global Wordnet Conference (GWC), Japan, Foundation and Structure of Developing an Assamese Wordnet, Shikhar Kr. Sarma, Moromi Gogoi, Rakesh Medhi, Utpal Saikia, Global Wordnet Conference, IIT Bombay, A Wordnet for Bodo Language: Structure and Development, Shikhar Kr. Sarma, Moromi Gogoi, Biswajit Brahma, Mane Bala Ramchiary, Global Wordnet Conference, IIT Bombay, 2010

Contd … 9.Kinship Terms in Assamese Language, Shikhar Kumar Sarma, Utpal Saikia, Mayashree Mahanta, Indo Wordnet Workshop, IIT, Kharagpur, Formation of Kinship Terms in Bodo Langauge, Shikhar Kr. Sarma, Biswajit Brahma, Mane Bala Ramchiary, Indowordnet Workshop, IIT Kharagpur, Architecture of a Spell Checker for An Indo-Aryan Language: Assamese, Gogoi, Ambeswar. Shikhar Kr. Sarma and Kishore Baishya, International journal of Computational Linguistics, Volume (1): Issue (1), A case study of Dictionary Annotation As A Pre-procesing task to develop Assamese Spell checker, Ambeswar Gogoi and Kishore Baishya, Making of Electronic Dictionary, Linguistic Data Consortium for Indian Languages, CIIL Mysore, 2009

Conclusion  Integration and collaboration of the man powers in the field of Linguistics and Computing; Trained man power development in the field of NLP, Local Language Technology Development.  Through this project a new breed of researchers in language technologies have been trained for proper skills and knowledge sets. As in these local languages the linguistic and Literature studies in formal education are with minimum computational linkage, and with no training/exposure for interlinking of linguistics and computing, the project facilitates in developing a team of interdisciplinary researchers. The project has contributed in expertise development and awareness creation in latest in machine translation, lexical semantics, cross lingual IR etc. in specific.

THANK YOU