Download presentation
Presentation is loading. Please wait.
Published byDominic Jenkins Modified over 9 years ago
1
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011
2
Introduction and goals The basic tasks in creating a domain model: selection of domain and scope consideration of reusability finding a important terms defining classes and class hierarchy defining properties of classes and constraints creation of instances of classes Goals designing a method for semiautomatic domain creation different input documents different languages design and implementation of tool
3
State of the art Algorithm and tasks work with domain model different document formats different languages domain model concepts, relations domain model creation = time consuming ‐ manual creation ‐ automatic creation ‐ semiautomatic creation
4
Tools and methods natural language processing – NLP Stanford NLP ‐ Stanford Parser ‐ Stanford POS tagger ‐ Stanford Named Entity Recognizer multi-language environment – Google Translate WordNet (synsets) Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG
5
Processing of text documents An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./. An integer character constant has type int.
6
Processing of text documents - extraction, cleaning, translation input TXT, HTML, PDF removal of occurrences of special characters using regular expressions numeric designation of chapters and references removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+ parentheses, dashes, and other translation into English – the tools work only with english text Google Translate
7
Processing of text documents - annotation Stanford CoreNLP Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer machine learning over large data, statistical model of maximum entropy learned models included Activities tokenization sentence splitting POS tagging - Part-of-speech lemmatization NER - Named Entity Recognition
8
Example An integer character constant has type int. An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./.
9
Mining concepts tokens marked by POS tagger as nouns are first concept candidates one word or multi-words nouns identifying token as concept by disambiguation from WordNet assigning synset – automatic, manual using domain term for searching possible selection of incorrect synset – with other meaning
10
Mining relations unoriented / oriented unnamed / named WordNet – concept must have synset ‐ hyperonyms and hyponyms – IsA relations ‐ holonyms and meronyms – partOf relations ‐ relation orientation based on concept order only direct relations from text lexical-syntactic patterns decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type
11
Tool
12
Experiment ANSI/ISO C language comparison with existing manually created ontology 2 experiments all concept candidates only first 200 candidates 3 variants of experiment ‐ only candidates ‐ candidates and IsA proposals ‐ candidates and IsA proposals and NER entities
13
First 30 candidates type645argument182Behavior149 Value571member180result148 Character529String180Return135 function447Stream172Macro127 Pointer329Array160Declaration119 Object322Sequence160Implementation118 Expression304char158Conversion111 Identifier220Operator155Integer105 int195Number155File102 operand184Description155Reference100
14
Experiment VariantAddedItems in model Found concepts Found / Items Found / total in ontology Found / can be found All -313739513 %38 %73 % IsA451945010 %43 %84 % IsA + NER455846510 %45 %86 % 200 - 9849 %9 %18 % IsA18021528 %15 %28 % IsA + NER196231816 %31 %59 %
15
Experiment Variant of experiment without IsA relations only with NER entities VariantItemsFoundConcepts / Items Concepts / total Concepts / can be found All + NER320444413.9 %42.8 %82.4 % 200 + NER36026573.6 %25.5 %49.2 %
16
Conclusions and further work concepts => lightweight ontology enables better automatic relations mining
17
Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava petr.saloun@vsb.cz Petr Klimánek (was: Faculty of Science, University of Ostrava) p.klimanek@gmail.com Zdenek Velart FEECS, VSB–Technical University of Ostrava zdenek.velart@gmail.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.