Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of.

Similar presentations


Presentation on theme: "Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of."— Presentation transcript:

1 Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of Technology Bombay pb@cse.iitb.ac.in http://www.cse.iitb.ac.in/pb

2 Interlingua Methodology Directly obtain the meaning of the source sentence. Do target sentence generation from the meaning representation. John gave the book to Mary. Meaning representation: give-action: agent: john object: the book receiver: mary

3 Competing approaches Direct Transfer based

4 MT Architectures: Vauquois' triangle

5 State of Affairs Systran reports 19 different langauge pairs. 8 alright for intended use. Even fewer are capable of quality written or spoken text translation.

6 ENGLISH-SPANISH-ENGLISH...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province... en ese imperio, el arte de la cartografía logró tal perfección que el mapa de una sola provincia ocupó la totalidad de una ciudad, y el mapa del imperio, la totalidad de una provincia... in that empire, the art of the cartography obtained such perfection that the map of a single province occupied the totality of a city, and the map of the empire, the totality of a province Provided by Systran on 19/11/02

7 ENGLISH-KOREAN-ENGLISH...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province 저 제국안에, 단순한 지방의 지도가 도시 의 완전을 점유했다 고 Cartography 의 예 술은 같은 얀벽, 및 제국, 지방의 완전의 지도 를 달성했다 Inside that empire, the map of the region where it is simple occupied the perfection of the city the art of the Cartography is same, yan it attained the map of of perfection of the wall and empire and region Provided by Systran on 19/11/02

8 UNL Based MT: the scenario UNL ENGLISH HINDI FRENCH RUSSIAN ENCONVERSION DECONVERSION

9 Common language for computers to express information written in natural language (Uchida et. al. 2000) Application: Electronic language to overcome language barrier Information Distribution System Universal Networking Language

10 UNL Example agt obj plc arrange John meeting residence

11 Components of the UNL System Universal Word Relation Labels Attributes

12 Universal Word [ saayaa ] "shadow(icl>darkness)"; the place was now in shadow [ laoSamaa~ ] "shadow(icl>iota)"; not a shadow of doubt about his guilt [ saMkot ] "shadow(icl>hint)" ; the shadow of the things to come [ Cayaa ] "shadow(icl>deterrant)"; a shadow over his happiness

13 Universal Word (foreign concepts) [aput] "snow(icl>thing)"; [pukak] "snow(aoj<salt like)"; [mauja] "snow(aoj<soft, aoj<deep)"; [massak] "snow(aoj<soft)"; [mangokpok] "snow(aoj<watery)";

14 Relation agt (agent) Agt defines a thing which initiates an action. agt (do, thing) Syntax agt[":" ] "(" { |":" } "," { |":" } ")" Detailed Definition Agent is defined as the relation between: UW1 - do, and UW2 - a thing where: UW2 initiates UW1, or UW2 is thought of as having a direct role in making UW1 happen. Examples and readings agt(break(icl>do), John(icl>person)) John breaks agt(translate(icl>do), computer(icl>machine)) computer translates

15 Attributes Used to describe what is said from the speaker's point of view. In particular captures number, tense, aspect and modality information.

16 Example Attributes I see a flower UNL: obj(see(icl>do), flower(icl>thing)) I saw flowers UNL: obj(see(icl>do).@past, flower(icl>thing).@pl) Did I see flowers? UNL: obj(see(icl>do).@past.@interrogative, flower(icl>thing).@pl) Please see the flowers? UNL: obj(see(icl>do).@past.@request, flower(icl>thing).@pl.@definite)

17 The Analyser Machhine Enconverter Analysis Rules Dictionary CCCAA nini n i+1 n i+2 Node List A B E D C Node-net n i-1 n i+3

18 Strategy for Analysis Morphological Analysis Syntactico-Semantic Analysis

19 Analysis of a simple sentences > article and noun are combined and attribute@indef is added to the noun. > Right shift to put preposition with the succeeding noun. > Ram’s being a possessing noun, shift right. > These two nouns are resolved into relation pos and first noun is deleted:

20 Simple sentence (continued) > The preposition of is then combined with noun and a dynamic attribute OFRES is added to entry of genius. > Using the attribute OFRES these two nouns are resolved to relation mod and the second noun is deleted. > Shift right again and solve King’s ears, relation pof is generated. > Relation obj is generated here and then relation agt is generated between Report and ears >

21 UNL as Interlingua and Language Divergence (Dave, Parikh, Bhattacharyya, JMT, 2003) Stands for the discrepancy in representation due to the inherent characteristics of the languages. Syntactic Divergence Lexical Semantic Divergence

22 Issue of free word order jaIma nao caaorI krnaovaalao laD,ko kao laazI sao maara. jaIma nao laazI sao caaorI krnaovaalao laD,ko kao maara. caaorI krnaovaalao laD,ko kao jaIma nao laazI sao maara. caaorI krnaovaalao laD,ko kao laazI sao jaIma nao maara. laazI sao jaIma nao caaorI krnaovaalao laD,ko kao maara. Use made of the fact that in Hindi post positions stay adjacent to nouns (opposed to the preposition stranding divergence). Flexibility in parsing- hit and preserve the predicate till the end.

23 Conjuct and compound verbs Typical Indian language phenomenon. Conjunct for verb-verb, compound for other POS+verb. vah gaanao lagaI She started singing Hcalao jaaAao Go away. H$k jaaAao EStop there. HJauk jaaAao EBend down. Possibility of combinatorial explosion in the lexicon. Possible solution: wordnet?

24 Use of Lexical Resources Automatic Generation of the UW to language dictionary (Verma and Bhattacharyya, Global Wordnet Conference, Czeck Republic, 2004) Universal Word generation Semantic attribute generation Heavy use of wordnets and ontologies

25 Wordnet and Lexical Resources Approximately 12000 Hindi synsets corresponding to about 35000 root words of Hindi. Approximately 7000 Hindi synsets corresponding to about 16000 root words of Hindi. Verb Hierarchy of approximately 4000 unique words corresponding to 6000 senses.

26 Gloss AQyana kxa Hyponymy Aavaasa, inavaasa Sayana kxa rsaao [-Gar Gar, gaRh manauYyaaoM ka Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO Aitiqa gaRh baramad a Aa^M gana AaEama JaaopD,I saMr cana a Meronymy Hyponymy MeronymyMeronymy Hypernymy WordNet Sub-Graph

27 Languages under Study LanguageAnalysis StatusGeneration Status EnglishD- 60000 R- 5000 D- 60000 R- 400 HindiD- 75000 R- 5700 D- 75000 R- 6500 MarathiD- 4000 R- 2200 D- 4000 R- 6000 BengaliD- 500 R- 1800 D- 500 R- 2100

28 Conclusions Predicate preservation strategy used for English, Hindi, Marathi, Bengali (Spanish being added). Focus in marathi on morphology for Marathi. Focus on kaarak (case) system for Bengali. Extremely lexical knowledge hungry.

29 Conclusions Work going on in the creation of Indian language wordnets (Hindi, Marathi in IIT Bombay; Dravidian in Anna University). Interlingua has a the attractive possibility of being used as a knowledge representation and applying to interesting applications like summarization, text clustering, meaning based multilingual search engines.


Download ppt "Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of."

Similar presentations


Ads by Google