Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
CS344: Introduction to Artificial Intelligence
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Semantics (Representing Meaning)
Lecture 2 Three Adequacies Important points review.
Statistical NLP: Lecture 3
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Introduction to Computational Linguistics Lecture 2.
Hindi Wordnet at IIT Bombay Current Team: Pushpak Bhattacharyya, Prabhakar Pandey, Laxmi Kashyap, Salil Joshi, Arun Karthikeyan, Prachur Goel and many.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
PSY 369: Psycholinguistics Some basic linguistic theory part3.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Language: Form, Meanings and Functions
Linkage of Language Specific Synset Resource Center for Indian Language Technology Solutions Computer Science and Engineering.
E-Government and interoperability : the role of Machine Translation Francisco García Morán Chief IT Advisor European Commission e-Government powered.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Pushpak Bhattacharyya CSE Dept., IIT Bombay
9/8/20151 Natural Language Processing Lecture Notes 1.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
NERIL: Named Entity Recognition for Indian FIRE 2013.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
CS460/449 : Speech, Natural Language Processing and the Web/Topics in AI Programming (Lecture 1 – Introduction) Pushpak Bhattacharyya CSE Dept., IIT Bombay.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Natural Language Processing Artificial Intelligence CMSC February 28, 2002.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Linguistic Essentials
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Rules, Movement, Ambiguity
Artificial Intelligence: Natural Language
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
SYNTAX.
Levels of Linguistic Analysis
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Pushpak Bhattacharyya CSE Dept., IIT Bombay 2nd Jan, 2012
Approaches to Machine Translation
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
SYNTAX.
CHAPTER 5 This chapter introduces students to the study of linguistics. It discusses the basic categories and definitions used to study language, and the.
Natural Language Processing (NLP)
Pushpak Bhattacharyya CSE Dept., IIT Bombay 2nd Jan, 2012
Approaches to Machine Translation
Linguistic Essentials
CS246: Information Retrieval
Natural Language Processing (NLP)
Semantics Going beyond syntax.
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of Technology Bombay amritesharyai namah: Obeiscance to amma

Objective of the wordnet workshop PAN-Indian Wordnets Involving languages from the North East, the Western part, the Northern part and the Southern part of India –Sanskrit –Assamese, Bodo, Nepali, Manipuri –Hindi, Kashmiri –Marathi Konkani –Tamil, Telugu, Kannad, Malayalam –English Meeting minds of those who LOVE WORDS AND THEIR RELATIONSHIPS

Ambiguity The Crux of the problem

Stages of language processing Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

Phonetics Processing of speech Challenges –Homophones: bank (finance) vs. bank (river bank) –Near Homophones: maatraa vs. maatra (hin) –Word Boundary aajaayenge (aa jaayenge (will come) or aaj aayenge (will come today) I got [ua]plate –Phrase boundary mtech1 students are especially exhorted to attend as such seminars are integral to one's post-graduate education –Disfluency: ah, um, ahem etc.

Morphology Word formation rules from root words Nouns: Plural (boy-boys); Gender marking (czar-czarina) Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat); Modality (e.g. request khaanaa  khaaiie) First crucial first step in NLP Languages rich in morphology: e.g., Dravidian, Hungarian, Turkish Languages poor in morphology: Chinese, English Languages with rich morphology have the advantage of easier processing at higher stages of processing A task of interest to computer science: Finite State Machines for Word Morphology

Lexical Analysis Essentially refers to dictionary access and obtaining the properties of the word e.g. dog noun (lexical property) take-’s’-in-plural (morph property) animate (semantic property) 4-legged (-do-) carnivore (-do) Challenge: Lexical or word sense disambiguation

Lexical Disambiguation First step: part of Speech Disambiguation Dog as a noun (animal) Dog as a verb (to pursue) Sense Disambiguation Dog (as animal) Dog (as a very detestable person) Needs word relationships in a context The chair emphasised the need for adult education Very common in day to day communications Satellite Channel Ad: Watch what you want, when you want (two senses of watch) e.g., Ground breaking ceremony/research

Technological developments bring in new terms, additional meanings/nuances for existing terms –Justify as in justify the right margin (word processing context) –Xeroxed: a new verb –Digital Trace: a new expression –Communifaking: pretending to talk on mobile when you are actually not –Discomgooglation: anxiety/discomfort at not being able to access internet –Helicopter Parenting: over parenting

Syntax Processing Stage Structure Detection S NP VP V NP I like mangoes

Parsing Strategy Driven by grammar S-> NP VP NP-> N | PRON VP-> V NP | V PP N-> Mangoes PRON-> I V-> like

Challenges in Syntactic Processing: Structural Ambiguity Scope 1.The old men and women were taken to safe locations (old men and women) vs. ((old men) and women) 2. No smoking areas will allow Hookas inside Preposition Phrase Attachment I saw the boy with a telescope (who has the telescope?) I saw the mountain with a telescope (world knowledge: mountain cannot be an instrument of seeing) I saw the boy with the pony-tail (world knowledge: pony-tail cannot be an instrument of seeing) Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”

Structural Ambiguity… Overheard –I did not know my PDA had a phone for 3 months An actual sentence in the newspaper –The camera man shot the man with the gun when he was near Tendulkar (P.G. Wodehouse, Ring in Jeeves) Jill had rubbed ointment on Mike the Irish Terrier, taken a look at the goldfish belonging to the cook, which had caused anxiety in the kitchen by refusing its ant’s eggs… (Times of India, 26/2/08) Aid for kins of cops killed in terrorist attacks

Headache for Parsing: Garden Path sentences Garden Pathing –The horse raced past the garden fell. –The old man the boat. –Twin Bomb Strike in Baghdad kill 25 (Times of India 05/09/07)

Semantic Analysis Representation in terms of Predicate calculus/Semantic Nets/Frames/Conceptual Dependencies and Scripts John gave a book to Mary Give action: Agent: John, Object: Book, Recipient: Mary Challenge: ambiguity in semantic role labeling –(Eng) Visiting aunts can be a nuisance –(Hin) aapko mujhe mithaai khilaanii padegii (ambiguous in Marathi and Bengali too; not in Dravidian languages)

Pragmatics Very hard problem Model user intention –Tourist (in a hurry, checking out of the hotel, motioning to the service boy): Boy, go upstairs and see if my sandals are under the divan. Do not be late. I just have 15 minutes to catch the train. –Boy (running upstairs and coming back panting): yes sir, they are there. World knowledge –WHY INDIA NEEDS A SECOND OCTOBER (ToI, 2/10/07)

Discourse Processing of sequence of sentences Mother to John: John go to school. It is open today. Should you bunk? Father will be very angry. Ambiguity of open bunk what? Why will the father be angry? Complex chain of reasoning and application of world knowledge Ambiguity of father father as parent or father as headmaster

Complexity of Connected Text John was returning from school dejected – today was the math test He couldn’t control the class Teacher shouldn’t have made him responsible After all he is just a janitor

Lexical Knowledge Structures Indian Scenario

Hindi Wordnet Dravidian Language Wordnets North East Language Wordnet Marathi Wordnet Sanskrit Wordnet English Wordnet Bengali Wordnet Punjabi Wordnet Konkani Wordnet Linked Wordnets

Great Linguistic Diversity Major streams –Indo European –Dravidian –Sino Tibetan –Austro-Asiatic Some languages are ranked within 20 in the world in terms of the populations speaking them –Hindi and Urdu: 5 th (~500 milion) –Bangla: 7 th (~300 million) –Marathi 14 th (~70 million)

Major Language Processing Initiatives Mostly from the Government: Ministry of IT, Ministry of Human Resource Development, Department of Sceince and Technology Recently great drive from the industry: NLP efforts with Indian language in focus –Google –Microsoft –IBM Research Lab –Yahoo –TCS

Technology Development in Indian Languages (TDIL) Started by the Ministry of IT in resource center across the country Responsibility for two languages: one major and one minor For example, –IIT Bombay: Marathi and Konkani –IIT Kanpur: Hindi and Nepali –ISI Kolkata: Bangla and Santhaali –Anna University: Tamil

Achievements in TDIL: Lexical Resources Wordnets: Hindi and Marathi (IIT Bombay) Ontologies: Tamil concept hierarchy (Tanjavur University, AU-KBC) Semantically rich lexicons: IIT Kanpur, IIITH, IIT Bombay Corpora: Central Institute of Indian Languages (CIIL) Web Content: All 13 centers, Gujarathi content is exhaustive and of good quality

Recent Initiatives NLP Association of India: 2 years old: recently efforts are on making tools and resources freely available on the website of NLPAI LDC-IL (like the Linguistic Data Consortium at UPenn) –Approved by the planning commission National Knowledge Commission: special drive on translation (human and machine)

Recent Initiatives cntd Consortia set up already for IL-IL MT, E-IL MT and CLIA SAALP: South Asian Association for Language Processing (formed with SAARC countries)

Industry Scenario: English How to use NLP to increase the search engine performance (precision, recall, speed) Google, Rediff, Yahoo, IRL, Microsoft: all have search engine, IR, IE R & D projects outsourced from USA and being carried out in India.

Industry Scenario: Indian Language English-Hindi MT is regarded as critical IBM Research lab has massive English Hindi Parallel Corpora (news domain) –Statistical Machine Translation Microsoft India at Bangalore has opened a Multilingual Computing Division Google and Yahoo India is actively pursuing IL search engine

Related work Eurowordnet (Vossen, 1999) and Balkanet (Christodoulakis, 2002) –where synsets of multiple languages are linked among themselves and to the Princeton Wordnet (Miller et. al., 1990; Fellbaum, 1998) –through Inter-lingual Indices (ILI)

Our experience: Multilingual Wordnets for Indian Languages

Wordnet work at IIT Bombay Follow the design principle(s) of the Princeton Wordnet for English paying particular attention to language specific phenomena (such as complex predicates) Hindi Wordnet –Total Number of Synsets: >30,000 –Total Number of Unique Words: >65,000 Marathi Wordnet –Total Number of Synsets: >18,000 –Total Number of Unique Words: >30,000

HWN and MWN created using different principles (Tatsam, i.e., Sanskrit words borrowed as such: very often) HWN entry: {peR, vriksh, paadap, drum, taru, viTap, ruuksh, ruukh, adhrip, taruvar} ‘tree’ jaR,tanaa, shaakhaa, tathaa pattiyo se yukt bahuvarshiya vanaspati ‘perennial woody plant having root, stem, branches and leaves’ peR manushya ke lie bahut hi upayogii hai ‘trees are useful to men’ MWN entry: {jhaaR, vriksh, taruvar, drum, taruu, paadap} ‘tree’ mule, khoR, phaanghaa, pane ityaadiinii yokt asaa vanaspativishesh ‘perennial woody plant having root, stem, branches and leaves’ tii damuun jhaadacyaa saavlit baslii ‘Being tired/exhausted she sat under the shadow of the tree’

Hindi WN: recently made free

A glimpse of the wordnet खोड रान बा ग आंबा लिंबू मूळमूळ मुळे,खोड,फांद्या,पाने इत्यादींनी युक्त असा वनस्पतिविशेष:"झाडे पर्यावरण शुद्ध करण्याचे काम करतात" झाड, वृक्ष, तरू वनस्पती MERONYMYMERONYMY HOLONYMYHOLONYMY H Y P E R N Y M Y H Y P O N Y M Y GLOSS

Marathi WN created from Hindi: expansion approach: issues For a concept, words exist in both Hindi and Marathi: most common For a concept, words exist in Hindi but not in Marathi –{ दादा [daadaa, grandfather], बाबा [baabaa, grandfather], आजा [aajaa, grandfather], दद्दा [daddaa, grandfather], पितामह [pitaamaha, grandfather], प्रपिता [prapitaa, grandfather]} are words in Hindi for paternal grandfather. There are no equivalents in Marathi. For a concept, words exist in Marathi and not in Hindi –{ गुढीपाडवा [gudhipaadvaa, newyear], वर्षप्रतिपदा [varshpratipadaa, new year]} are words in Marathi which do not have any equivalents in Hindi.

Analogy with English {mama}: uncle from mother’s side {chacha}: uncle from father’s side No natural words in English Introduce multiwords –{uncle, maternal uncle} and {chacha, paternal uncle} Makes the lexical resource look unnatural to a native speaker Pitfall of expansion approach? WN users tend to look upon and use the lexical resource as an ordinary dictionary.

Other concerns Identical word –Faux Amis: “false friends” or “false cognates” samaadhaan- solution (Hindi), satisfaction (Marathi) shikshaa- education (Marathi), punishment (Marathi) –Narrowing of meaning –Widening of meaning Identical Meaning –Richness of vocabulary in Hindi and not in Marathi and vice versa (like the words for snow in Eskimo language)

Narrowing and Widening of meaning Same Word Same Word Marathi Hindi Marathi Hypernymy/ hyponymy Hypernymy/ hyponymy

Dictionary standardization

Large Scale Nation Wide Projects in Consortia Mode English to Indian Language Machine Translation Indian Language to Indian Language Machine Translation Cross Lingual Information Access –Each of about 800 Crores of Rupees, equivalent to about 200 million dollars –In each participation by 10 different institutes across the length and breadth of the country

Adopted Standard SensesHindiMarathiBangaliOriyaTamil (W 1, W 2, W 3, W 4, W 5, W 6 ) (W 1, W 2, W 3 ) (W 1, W 2, W 3, W 4 ) (W 1, W 2, W 3 ) (sun) ( सूर्य, सूरज, भानु, भास्कर, प्रभाकर, दिनकर, अंशुमान, अंशुमाली ) ( सूर्य, भानु, दिवाकर, भास्कर, रवि, दिनेश, दिनमणी )... (cub, lad, laddie, sonny, sonny boy) ( लड़का, बालक, बच्चा, छोकड़ा, छोरा, छोकरा, लौंडा ) ( मुलगा, पोरगा, पोर, पोरगे ) ……… (son, boy) ( पुत्र, बेटा, लड़का, लाल, सुत, बच्चा, नंदन, पूत, चिरंजीव, चिरंजी ) ( मुलगा, पुत्र, लेक, चिरंजीव, तनय ) ………

Advantages of the concept based multilingual dictionary (1/2) Economy of labor and storage –Semantic features like [±Animate, ±Human, ±Masculine, etc.] assigned to a nominal concept and not to any individual lexical item of any language –Semantic features, such as [+Stative (e.g., know), +Activity (e.g., stroll), +Accomplishment (e.g., say), +Semelfactive (e.g., knock), +Achievement (e.g., win)] are assigned to a verbal concept.

Advantages of the concept based multilingual dictionary (2/2) Bilingual pairwise dictionaries can be generated automatically. The model admits of the possibility of extracting a domain specific dictionary for all or any specific language pair. The language group which lacks competence in the pivot language- which in our case is Hindi- can benefit from the already worked out languages. –E.g. Tamil and Malayalam

Word alignment in the dictionary model Even if we choose the right sense of a word in the source language (SW1), there is still the hurdle of choosing the appropriate target language word. Lexical choice is a function of complex parameters like situational aptness and native speaker acceptability.

Example Concept: ‘the state of having no doubt of something’ –Hindi: {nishshank, anaashankita, aashankahiin,befikr, bekhtak, sangshayhiin} –Marathi: {nihshanka, nirdhaasta, nirbhrot, shankaarahita} Third member in the Hindi synset aashankahiin is appropriately mapped to the fourth member in the Marathi synset shankaarahita and not to the first one.

Links set up between words English synset Hindi synsetMarathi synset लड़का /HW1, बालक /HW2, बच्चा /HW3, छोकड़ा /HW4, छोरा /HW5, छोकरा /HW6, लौंडा /HW7 मुलगा /HW1, पोरगा /HW6, पोर /HW2, पोरगे /HW6 male- child/HW1, boy /HW2

Linguistic challenges (1/2) Using a synthetic expression –‘ornaments and other gifts given to the bride by the bridegroom on the day of wedding’ chadhaava (Hindi) – विवाहसमयी वराकडून वधुला दिले जाणारे दागिने ‘at-the-time-of- wedding–bridegroom–bride– given–ornament’ (Marathi) Using transliteration, if the synthetic expression is larger –seharaa (~garland: complicated cultural expression) –Seharaa (transliterated in Marathi) Reciprocally, maahervaashiin ‘a woman who has come to stay at her parents' place after her marriage: no equivalent in Hindi

Linguistic challenges (2/2) Singleton Hindi pivot synset  expressed through more than one finer concept in Marathi fikaa in Hindi: ‘food prepared with less sugar, salt or spice’, Marathi equivalent: three distinct words expressing three distinct finer concepts –agodh ‘less sweet’ –aLanii ‘less salty –miLamiLat ‘less spicy’. These three words cannot be taken as the members of a single synset in Marathi

Computational Aspects

Dictionary development framework

Dictionary entry template ID:: CAT:: verb CONCEPT:: be in a state of movement or action EXAMPLE:: "The room abounded with screaming children" SYNSET-ENGLISH :: (abound, burst, bristle)

Language and Task Configuration window

Synset entry and word-alignment interface

Conclusion (1/3) Linked wordnets: Immense Lexical Resource Great benefits to machine translation, cross lingual search Very useful for language teaching, pedagogy, comparative linguistics Akin to Eurowordnet, but critical differences due to typical Indian language characteristics Great Unifier of the country

Conclusion (2/3) Computational challenges: –Maintenance of multilingual data –their insertion, deletion and updating in a spatially and temporally distributed situation

Conclusion (3/3) Advantages of the framework –a linguistically sound basis of the dictionary framework –economy of representation and –avoidance of duplication of effort