A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad
Need for Morphological analysis Basic information about a word’s category, gender, number etc. is provided by morph analysis Required for Machine Translation tasks Necessary for building part-of-speech taggers Accurate tools are especially required for languages that are morphologically rich
Inflectional and Derivational forms To begin with, morph analysis concentrates on inflectional forms. Inflection more regular and productive. Eg. A plural affix would attach to almost all nouns, but a derivational affix like –ness only to a few Criteria of attachment is more difficult to determine for a derivational affix
Computational analysis of derived forms Previous approaches have used strategies such as Creation of suffix table (Hoeppner, 1982) Identifying morphologically ‘active’ bases (Byrd, 1986) Using an extensive semantic ontology (Woods, 2000) Statistical approaches have focused on automatic acquisition of morphology (eg. Sharma et al for Assamese)
Productivity of Derivational suffixes Survey of some noun-forming affixes in the CIIL Marathi corpus showed how some occur more frequently than others Analysis of such suffixes would capture some linguistic knowledge -pə ɳ a, - ɪ kə, -t ̪ a, -i ː, attach more freely Suffixes like - ɪ kərə ɳ ə, -g ɪ ri, -ə ɳ ə are less frequent
Marathi morph analysis Existing Morph analyzer by Akshar Bharti 114 paradigms for nouns, verbs, pronouns, adjectives Derivational and inflectional processes operate together, hence both kinds of knowledge needed Open source tool Lttoolbox allows for easy conversion/creation of new paradigms
Building a morphological dictionary The Lttoolbox tool requires the creation of a set of correspondences between Surface Forms and Lexical forms Surface forms (SF) : forms that have undergone some morphological process Lexical forms (LF) : base forms of the words, entered in the dictionary Regularities in this correspondences form paradigms Morph analysis will take SF as input and return LF as the output Generation, i.e. vice versa is also possible
Sample paradigm A yAlA A Dictionary entry: kacar
Adding knowledge about derivational suffixes The sample paradigm given below is used to call another paradigm containing information about the derivational suffix [ lahAna=ləhanə, small, adj]
Nested paradigm The paNA paradigm is ‘called’ from the previous one: paNA paNA > ”/> paNAne paNA >
Sample Output lahAna/lahAna lahanapaNA/lahAnapaNA lahAnapaNAne/lahAnapaNA
More features Possible to call more than one paradigm at a time. Example, lahAna can take -paNA or –paNa
Present Work The morphological dictionary consists of 10 derivational suffixes in Marathi 38 derivational paradigms Total number of forms generated: 450,000 Preliminary evaluation over a set of 200 derived forms taken from a corpus shows 32% coverage
Problems Coverage can be improved if the following issues can be handled: Prefixes: needs further processing Cases of ‘Vriddhi’ cannot be handled well using paradigms. Example: pə ʋ it ̪ rə+yə =pa ʋ it ̪ ryə (pure + suf = purity) Emphatic particles like –hI and -ca Some noun forming suffixes like –Ne or –ArI are highly regular, hence better handled using an inflectional paradigm
Future work Aim at increasing coverage by addition of more suffixes Test the possibility of using ‘Metadix’ for handling cases of vowel lengthening
Download and documentation for Lttoolbox: SourceForge