Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,

Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011

Introduction and goals  The basic tasks in creating a domain model:  selection of domain and scope  consideration of reusability  finding a important terms  defining classes and class hierarchy  defining properties of classes and constraints  creation of instances of classes  Goals  designing a method for semiautomatic domain creation  different input documents  different languages  design and implementation of tool

State of the art  Algorithm and tasks work with domain model  different document formats  different languages  domain model  concepts, relations  domain model creation = time consuming ‐ manual creation ‐ automatic creation ‐ semiautomatic creation

Tools and methods  natural language processing – NLP  Stanford NLP ‐ Stanford Parser ‐ Stanford POS tagger ‐ Stanford Named Entity Recognizer  multi-language environment – Google Translate  WordNet (synsets)  Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG

Processing of text documents An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./. An integer character constant has type int.

Processing of text documents - extraction, cleaning, translation  input TXT, HTML, PDF  removal of occurrences of special characters using regular expressions  numeric designation of chapters and references  removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+  parentheses, dashes, and other  translation into English – the tools work only with english text  Google Translate

Processing of text documents - annotation  Stanford CoreNLP  Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer  machine learning over large data, statistical model of maximum entropy  learned models included  Activities  tokenization  sentence splitting  POS tagging - Part-of-speech  lemmatization  NER - Named Entity Recognition

Example An integer character constant has type int. An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./.

Mining concepts  tokens marked by POS tagger as nouns are first concept candidates  one word or multi-words nouns  identifying token as concept by disambiguation from WordNet  assigning synset – automatic, manual  using domain term for searching  possible selection of incorrect synset – with other meaning

Mining relations  unoriented / oriented  unnamed / named  WordNet – concept must have synset ‐ hyperonyms and hyponyms – IsA relations ‐ holonyms and meronyms – partOf relations ‐ relation orientation based on concept order  only direct relations  from text  lexical-syntactic patterns  decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression  sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type

Experiment  ANSI/ISO C language  comparison with existing manually created ontology  2 experiments  all concept candidates  only first 200 candidates  3 variants of experiment ‐ only candidates ‐ candidates and IsA proposals ‐ candidates and IsA proposals and NER entities

First 30 candidates type645argument182Behavior149 Value571member180result148 Character529String180Return135 function447Stream172Macro127 Pointer329Array160Declaration119 Object322Sequence160Implementation118 Expression304char158Conversion111 Identifier220Operator155Integer105 int195Number155File102 operand184Description155Reference100

Experiment VariantAddedItems in model Found concepts Found / Items Found / total in ontology Found / can be found All -313739513 %38 %73 % IsA451945010 %43 %84 % IsA + NER455846510 %45 %86 % 200 - 9849 %9 %18 % IsA18021528 %15 %28 % IsA + NER196231816 %31 %59 %

Experiment  Variant of experiment without IsA relations only with NER entities VariantItemsFoundConcepts / Items Concepts / total Concepts / can be found All + NER320444413.9 %42.8 %82.4 % 200 + NER36026573.6 %25.5 %49.2 %

Conclusions and further work  concepts => lightweight ontology  enables better automatic relations mining

Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava petr.saloun@vsb.cz Petr Klimánek (was: Faculty of Science, University of Ostrava) p.klimanek@gmail.com Zdenek Velart FEECS, VSB–Technical University of Ostrava zdenek.velart@gmail.com

Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,

Similar presentations

Presentation on theme: "Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,

Similar presentations

Presentation on theme: "Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,"— Presentation transcript:

Similar presentations

About project

Feedback