Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Linguistics: New Vistas

Similar presentations


Presentation on theme: "Computational Linguistics: New Vistas"— Presentation transcript:

1 Computational Linguistics: New Vistas
Dr. Narayan Choudhary, ezDI, LLC. November 22, 2013 Tezpur University, Assam

2 What we have been doing so far
Global perspective Machine Translation Information Retrieval/Extraction Spelling and Grammar Checking TTS & STT Language Learning and Teaching Sentiment Analysis Indian Perspective Barely scratched all of the above at the academic/research level Real world, industry oriented applications yet to come for most of the languages February 17, 2019 Narayan Choudhary, Tezpur University, Assam

3 New Vistas Application Areas being explored at the global level
Adding new domains for information extraction Legal Services, market intelligence, clinical linguistics, bio-informatics Speech to speech translation (real time) Deep Linguistic Analysis Involves interfacing at all the levels of linguistics analysis February 17, 2019 Narayan Choudhary, Tezpur University, Assam

4 COLING: Indian Achievements
Machine Translation Systems Quite a few: AnglaBharati, AnuBharati, Shakti, Anusaraka, Shiva, Mantra (domain restricted) Publically available: None of the above mentioned (or you are not going to use it anyway!) Corporate Products Google Translate, Bing Translate, Systran Need to reach the level where it can be trusted to some extent Evaluation Scores Not Available (They won’t publish it!) February 17, 2019 Narayan Choudhary, Tezpur University, Assam

5 Other Tools and Resources for Indian Languages
Spelling and Grammar Checker Quite a few reported (CDAC Pune, Microsoft Proofing Tools) None works good (but you can always tweak it with your own resources, if you have it) Grammar checker not reported yet Dictionaries Many reported (quite a few online bilingual and monolingual dictionaries available for many languages) Problems of standardization remains Issues of copyrights February 17, 2019 Narayan Choudhary, Tezpur University, Assam

6 Speech Processing in Indian Languages
Text to speech projects Quite a few Lack coherence and need improvements Many languages yet to get one Speech to Text Academic Scratches and limited domain (shrutlekhan-rajbhasha) Many languages need to be added February 17, 2019 Narayan Choudhary, Tezpur University, Assam

7 Corpus Resources Global Scenario Indian Scenario
Quite a few good sized corpora annotated at various levels of linguistic analysis Brown Corpus, LOB Corpus, Penn Treebank (for English), quite a few other corpora available in various other languages (Japanese, Chinese, Arabic and all the European Union languages) Indian Scenario ILCI is the first major initiative to cater to the needs of annotated corpora in all the major Indian languages Raw text (un-annotated corpora) available for research purposes (Gyan Nidhi) February 17, 2019 Narayan Choudhary, Tezpur University, Assam

8 Web as a corpus Increasingly, the text available on the web is being used a corpus Market Intelligence, Sentiment Analysis, Trend Prediction etc. are done on corpus collected from the web Increasing presence of Indian languages on the web Use of web-generated corpus in industry and the academia Bottlenecks in using the web as a corpus Language Identification Text-encoding standards Heavy pre-processing required February 17, 2019 Narayan Choudhary, Tezpur University, Assam

9 Linguistic Analysis: Available Tools
Global Perspective (English): Quite a few tools available for automated PoS annotation, syntactic and dependency parsing, semantic role labeling Quite good accuracy on the domains trained Stanford NLP tools, GATE, OpenNLP, ClearTK Indian Perspective: Quite a few reported PoS Tagger and chunker No syntactic, dependency parsing or semantic role labelers done yet Research Works reported from IIIT Hyderabad, IIT Bombay, IIT Kharagpur No concentrated platform for linguistic analysis February 17, 2019 Narayan Choudhary, Tezpur University, Assam

10 Semantic Resources WordNets Domain Specific Ontologies
Major project running at IITB (Indo-WordNet) Domain Specific Ontologies Medical, Legal, Business (one can do it in Indian English, there is a need, if not enough encouragement is there for Indian languages) Argument Structure Mapping (PropBank etc.) for major Indian languages Resources for Sentiment Analysis Discourse Analysis Anaphora Resolution February 17, 2019 Narayan Choudhary, Tezpur University, Assam

11 Boost-ups from linguists needed
Understand the needs Create a good corpora Create tag-sets PoS, chunk tags, tree-banking guidelines for all the major languages Annotation at different levels Parts-of-speech Chunk Labels Full Syntactic Parsing Resources for semantic role labeling (PropBank, verb bank, FrameNets) Help develop morphological analyzers, pre-processors, tokenizers February 17, 2019 Narayan Choudhary, Tezpur University, Assam

12 Need for Concentrated Effort
Make Concentrated Effort: Sporadic effort will not yield results Connect Researchers (done best when each language’s researchers’ collaborate) Pitch in for a standard Follow the global standards first and then go for the language specific needs) Stay connected and updated, you are living in a fast changing world February 17, 2019 Narayan Choudhary, Tezpur University, Assam

13 Thank You for your attention!
February 17, 2019 Narayan Choudhary, Tezpur University, Assam


Download ppt "Computational Linguistics: New Vistas"

Similar presentations


Ads by Google