Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011
Introduction and goals The basic tasks in creating a domain model: selection of domain and scope consideration of reusability finding a important terms defining classes and class hierarchy defining properties of classes and constraints creation of instances of classes Goals designing a method for semiautomatic domain creation different input documents different languages design and implementation of tool
State of the art Algorithm and tasks work with domain model different document formats different languages domain model concepts, relations domain model creation = time consuming ‐ manual creation ‐ automatic creation ‐ semiautomatic creation
Tools and methods natural language processing – NLP Stanford NLP ‐ Stanford Parser ‐ Stanford POS tagger ‐ Stanford Named Entity Recognizer multi-language environment – Google Translate WordNet (synsets) Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG
Processing of text documents An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./. An integer character constant has type int.
Processing of text documents - extraction, cleaning, translation input TXT, HTML, PDF removal of occurrences of special characters using regular expressions numeric designation of chapters and references removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+ parentheses, dashes, and other translation into English – the tools work only with english text Google Translate
Processing of text documents - annotation Stanford CoreNLP Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer machine learning over large data, statistical model of maximum entropy learned models included Activities tokenization sentence splitting POS tagging - Part-of-speech lemmatization NER - Named Entity Recognition
Example An integer character constant has type int. An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./.
Mining concepts tokens marked by POS tagger as nouns are first concept candidates one word or multi-words nouns identifying token as concept by disambiguation from WordNet assigning synset – automatic, manual using domain term for searching possible selection of incorrect synset – with other meaning
Mining relations unoriented / oriented unnamed / named WordNet – concept must have synset ‐ hyperonyms and hyponyms – IsA relations ‐ holonyms and meronyms – partOf relations ‐ relation orientation based on concept order only direct relations from text lexical-syntactic patterns decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type
Tool
Experiment ANSI/ISO C language comparison with existing manually created ontology 2 experiments all concept candidates only first 200 candidates 3 variants of experiment ‐ only candidates ‐ candidates and IsA proposals ‐ candidates and IsA proposals and NER entities
First 30 candidates type645argument182Behavior149 Value571member180result148 Character529String180Return135 function447Stream172Macro127 Pointer329Array160Declaration119 Object322Sequence160Implementation118 Expression304char158Conversion111 Identifier220Operator155Integer105 int195Number155File102 operand184Description155Reference100
Experiment VariantAddedItems in model Found concepts Found / Items Found / total in ontology Found / can be found All %38 %73 % IsA %43 %84 % IsA + NER %45 %86 % %9 %18 % IsA %15 %28 % IsA + NER %31 %59 %
Experiment Variant of experiment without IsA relations only with NER entities VariantItemsFoundConcepts / Items Concepts / total Concepts / can be found All + NER %42.8 %82.4 % NER %25.5 %49.2 %
Conclusions and further work concepts => lightweight ontology enables better automatic relations mining
Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava Petr Klimánek (was: Faculty of Science, University of Ostrava) Zdenek Velart FEECS, VSB–Technical University of Ostrava