Computational Linguistics: New Vistas Dr. Narayan Choudhary, ezDI, LLC. November 22, 2013 Tezpur University, Assam
What we have been doing so far Global perspective Machine Translation Information Retrieval/Extraction Spelling and Grammar Checking TTS & STT Language Learning and Teaching Sentiment Analysis Indian Perspective Barely scratched all of the above at the academic/research level Real world, industry oriented applications yet to come for most of the languages February 17, 2019 Narayan Choudhary, Tezpur University, Assam
New Vistas Application Areas being explored at the global level Adding new domains for information extraction Legal Services, market intelligence, clinical linguistics, bio-informatics Speech to speech translation (real time) Deep Linguistic Analysis Involves interfacing at all the levels of linguistics analysis February 17, 2019 Narayan Choudhary, Tezpur University, Assam
COLING: Indian Achievements Machine Translation Systems Quite a few: AnglaBharati, AnuBharati, Shakti, Anusaraka, Shiva, Mantra (domain restricted) Publically available: None of the above mentioned (or you are not going to use it anyway!) Corporate Products Google Translate, Bing Translate, Systran Need to reach the level where it can be trusted to some extent Evaluation Scores Not Available (They won’t publish it!) February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Other Tools and Resources for Indian Languages Spelling and Grammar Checker Quite a few reported (CDAC Pune, Microsoft Proofing Tools) None works good (but you can always tweak it with your own resources, if you have it) Grammar checker not reported yet Dictionaries Many reported (quite a few online bilingual and monolingual dictionaries available for many languages) Problems of standardization remains Issues of copyrights February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Speech Processing in Indian Languages Text to speech projects Quite a few Lack coherence and need improvements Many languages yet to get one Speech to Text Academic Scratches and limited domain (shrutlekhan-rajbhasha) Many languages need to be added February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Corpus Resources Global Scenario Indian Scenario Quite a few good sized corpora annotated at various levels of linguistic analysis Brown Corpus, LOB Corpus, Penn Treebank (for English), quite a few other corpora available in various other languages (Japanese, Chinese, Arabic and all the European Union languages) Indian Scenario ILCI is the first major initiative to cater to the needs of annotated corpora in all the major Indian languages Raw text (un-annotated corpora) available for research purposes (Gyan Nidhi) February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Web as a corpus Increasingly, the text available on the web is being used a corpus Market Intelligence, Sentiment Analysis, Trend Prediction etc. are done on corpus collected from the web Increasing presence of Indian languages on the web Use of web-generated corpus in industry and the academia Bottlenecks in using the web as a corpus Language Identification Text-encoding standards Heavy pre-processing required February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Linguistic Analysis: Available Tools Global Perspective (English): Quite a few tools available for automated PoS annotation, syntactic and dependency parsing, semantic role labeling Quite good accuracy on the domains trained Stanford NLP tools, GATE, OpenNLP, ClearTK Indian Perspective: Quite a few reported PoS Tagger and chunker No syntactic, dependency parsing or semantic role labelers done yet Research Works reported from IIIT Hyderabad, IIT Bombay, IIT Kharagpur No concentrated platform for linguistic analysis February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Semantic Resources WordNets Domain Specific Ontologies Major project running at IITB (Indo-WordNet) Domain Specific Ontologies Medical, Legal, Business (one can do it in Indian English, there is a need, if not enough encouragement is there for Indian languages) Argument Structure Mapping (PropBank etc.) for major Indian languages Resources for Sentiment Analysis Discourse Analysis Anaphora Resolution February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Boost-ups from linguists needed Understand the needs Create a good corpora Create tag-sets PoS, chunk tags, tree-banking guidelines for all the major languages Annotation at different levels Parts-of-speech Chunk Labels Full Syntactic Parsing Resources for semantic role labeling (PropBank, verb bank, FrameNets) Help develop morphological analyzers, pre-processors, tokenizers February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Need for Concentrated Effort Make Concentrated Effort: Sporadic effort will not yield results Connect Researchers (done best when each language’s researchers’ collaborate) Pitch in for a standard Follow the global standards first and then go for the language specific needs) Stay connected and updated, you are living in a fast changing world February 17, 2019 Narayan Choudhary, Tezpur University, Assam
Thank You for your attention! February 17, 2019 Narayan Choudhary, Tezpur University, Assam