CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

By : Swaran Lata Country Manager,W3C India Office 6,CGO complex, Electronics Niketan New Delhi
Machine Translation II How MT works Modes of use.
CODE/ CODE SWITCHING.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages Sowmya V.B. *, Monojit Choudhury *, Kalika Bali *, Tirthankar.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Machine Translation Indo-German Workshop on Language technologies AU-KBC Research Centre, Chennai Speaker Prof.. Rajeev sangal International Institute.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine translation Context-based approach Lucia Otoyo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Chapter 1: By: Ms. Ola Al-arjani
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Chapter 10 Language and Computer English Linguistics: An Introduction.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Lexicography Lexicon has two different meanings:
POS Tagger and Chunker for Tamil
Utkal University We Work On Image Processing Speech Processing Knowledge Management.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
G. Anushiya Rachel Project Officer
Approaches to Machine Translation
Basque language: is IT right on?
Statistical NLP: Lecture 13
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Multilingual Information Access in a Digital Library
Approaches to Machine Translation
Computational Linguistics: New Vistas
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP, DOE Govt. of India. 2. IPDA, DOE Govt. of India. 3. TRCT, TDIL, MCIT 4. English-Telugu, T2TMT UPE, UGC, UOH.

1. Morphological Analyzer cum Spell Checker for Telugu A robust Morphological analyzer cum Spell Checker for Telugu. With 97% recognition rate. Tested on 5 million word corpora. For the users of Windows O.S & Linux.

2. A Multilingual Encyclopedic Electronic thesaurus for translators, MEET, a Web based linguistic application. MEET enables quick access to various synonyms. Provides equivalents in other Indian languages and English. Also provides grammatical and Semantic information. A useful application for translators. Provides access to information in Indian languages on the web. Currently includes only Marathi, Hindi, Bangla, Konkani and English. The 2 nd phase proposes to include Telugu, Kannada and Oriya. Word net for individual languages may be linked to the system.

3. Telugu Hyper Grammar. The Telugu Hyper Grammar, designed as a dynamically accessed and non-linearly organized grammar of Telugu grammar. A user can access information at a particular module from any other module. Provides access to a Morphological Analyzer, Generator and a Chunker. Can access various bilingual and bi-directional digital lexica of Telugu and other Indian Languages like Hindi, Kannada, Tamil, Marathi, Oriya, Malayalam and English.

4. English-Telugu Parallel Corpora. Parallel Corpora are a set of thematically corresponding digital texts of some selected works. Recent trends in Machine Translation are revolutionized by the use of Parallel Corpora. Parallel Corpora give way to discover similarities and differences between a pair of languages. A program for aligning parallel texts in English and Telugu is developed and in the process of testing. Selected parallel texts in Telugu, Kannada, Tamil, Marathi and Malayalam are digitized.

5. English-Telugu T2T Machine Translation System English-Telugu Machine Translation System is being built at CALTS in collaboration with, IIIT, Hyderabad; Telugu University, Hyderabad; Osmania University, Hyderabad. Uses an English-Telugu MAT lexicon of 42K. A wordform synthesizer for Telugu is developed and incorporated. It incorporates an evolutionary semantic lexicon It handles English sentences of a variety of complexity

6. MAT Lexica. Bilingual and Multidirectional. Machine Readable Dictionaries for Telugu-Hindi, Telugu-Kannada, Telugu-Tamil, Telugu-Marathi, Telugu-Oriya, Telugu-Bangla, Telugu-Malayalam, of 10K are being developed in collaboration with the Telugu Academy. The entries were based on the frequency of their occurrence in the corpus of Telugu. The Dictionaries of Telugu-Hindi, Telugu- Kannada, Telugu-Tamil are already completed. Major part of these dictionaries are developed through realigning the lexical resources existing at CALTS.

7. Collocations in Indian Languages. Collocations or specialized word sequences play a crucial role in a language. It is extremely difficult to identify and translate effectively. They present one of the most challenging tasks in Natural Language Processing. In the first phase, Telugu data was collected and analyzed. A long list of collocations are collected and checked whether the existing criteria are valid. These collocations are compared against other specialized word sequences in the language to understand their functional and distributional properties.

8. Machine Readable Dictionary of Idioms (Telugu-English). Idioms are extremely important but the most ubiquitous, and less understood categories of language. Machine-readable Idioms in English and their equivalents in Telugu and the mechanics of their recognition and transfer rules are being developed. The Machine Readable text will be implemented in XML so that access and retrieval becomes easier and faster.

9. Electronic Adult Literacy Primer for Telugu This is developed as part of CALTS participation in Arohan (a literacy campaign adopted by the university). Aimed at teaching the script or the written form of the language rather than the language itself. Based on frequency of characters in the written texts. Learning the most frequent but few characters would ensure greater coverage in learning recognition of characters. Special features include characters with animation and speech. A special attention on the presentation of allographs.

10. A generic system for morphological generation for Indian languages Morphological generators for various Indian languages particularly for Telugu, Kannada, Tamil, Malayalam, Bangla and Oriya are in different stages of development. A generic framework for wordform synthesis for Indian languages. Includes testing module to find the efficiency and coverage of the system.

11. Telugu-Tamil Machine translation system Using the available resources at CALTS a Telugu-Tamil MT is being developed. Uses the Telugu Morphological analyzer. Uses the Tamil generator developed at CALTS. Uses Telugu-Tamil dictionary developed as part of MAT Lexica. Uses verb sense disambiguator based on verbs argument structure.

12. Word Sense Disambiguation using Argument Structure: A system, based on the argument structure of Telugu verbs. Uses feature based semantic lexicon. Efficiently disambiguates polysemy of verbs in the context. Is incorporated in Telugu-Tamil MT system.

13. A case sensitive roman translation for Indian languages as overall pattern A roman transliteration Scheme for unwritten languages of India is developed. A common transliteration scheme for the scripts of Brahmi derivates and non Brahmi derivates is developed. Supra segmentals mapped on to roman characters No nonunique character mapping Allows complete conversion between various languages