Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,

Slides:



Advertisements
Similar presentations
An Ontology Creation Methodology: A Phased Approach
Advertisements

An Introduction to GATE
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Introduction to Computational Linguistics Lecture 2.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Topics in AI: Applied Natural Language Processing Information Extraction and Recommender Systems for Video Games Supervised by Dr. Noriko Tomuro Fall –
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Integration of Information Extraction with an Ontology M. Vargas-Vera, J.Domingue, Y.Kalfoglou, E. Motta and S. Buckingham Sum.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Course G Web Search Engines 3/9/2011 Wei Xu
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Mining and Summarizing Customer Reviews
ELN – Natural Language Processing Giuseppe Attardi
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Ngoc Minh Le - ePi Technology Bich Ngoc Do – ePi Technology
A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Natural language processing tools Lê Đức Trọng 1.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
MedKAT Medical Knowledge Analysis Tool December 2009.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
LREC Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis Semi-Automatic Domain Ontology Creation.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Language Identification and Part-of-Speech Tagging
Social Knowledge Mining
Extracting Semantic Concept Relations
The CoNLL-2014 Shared Task on Grammatical Error Correction
How to publish in a format that enhances literature-based discovery?
CS224N Section 3: Corpora, etc.
CS224N Section 3: Project,Corpora
Presentation transcript:

Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011

Introduction and goals  The basic tasks in creating a domain model:  selection of domain and scope  consideration of reusability  finding a important terms  defining classes and class hierarchy  defining properties of classes and constraints  creation of instances of classes  Goals  designing a method for semiautomatic domain creation  different input documents  different languages  design and implementation of tool

State of the art  Algorithm and tasks work with domain model  different document formats  different languages  domain model  concepts, relations  domain model creation = time consuming ‐ manual creation ‐ automatic creation ‐ semiautomatic creation

Tools and methods  natural language processing – NLP  Stanford NLP ‐ Stanford Parser ‐ Stanford POS tagger ‐ Stanford Named Entity Recognizer  multi-language environment – Google Translate  WordNet (synsets)  Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG

Processing of text documents An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./. An integer character constant has type int.

Processing of text documents - extraction, cleaning, translation  input TXT, HTML, PDF  removal of occurrences of special characters using regular expressions  numeric designation of chapters and references  removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+  parentheses, dashes, and other  translation into English – the tools work only with english text  Google Translate

Processing of text documents - annotation  Stanford CoreNLP  Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer  machine learning over large data, statistical model of maximum entropy  learned models included  Activities  tokenization  sentence splitting  POS tagging - Part-of-speech  lemmatization  NER - Named Entity Recognition

Example An integer character constant has type int. An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN./.

Mining concepts  tokens marked by POS tagger as nouns are first concept candidates  one word or multi-words nouns  identifying token as concept by disambiguation from WordNet  assigning synset – automatic, manual  using domain term for searching  possible selection of incorrect synset – with other meaning

Mining relations  unoriented / oriented  unnamed / named  WordNet – concept must have synset ‐ hyperonyms and hyponyms – IsA relations ‐ holonyms and meronyms – partOf relations ‐ relation orientation based on concept order  only direct relations  from text  lexical-syntactic patterns  decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression  sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type

Tool

Experiment  ANSI/ISO C language  comparison with existing manually created ontology  2 experiments  all concept candidates  only first 200 candidates  3 variants of experiment ‐ only candidates ‐ candidates and IsA proposals ‐ candidates and IsA proposals and NER entities

First 30 candidates type645argument182Behavior149 Value571member180result148 Character529String180Return135 function447Stream172Macro127 Pointer329Array160Declaration119 Object322Sequence160Implementation118 Expression304char158Conversion111 Identifier220Operator155Integer105 int195Number155File102 operand184Description155Reference100

Experiment VariantAddedItems in model Found concepts Found / Items Found / total in ontology Found / can be found All %38 %73 % IsA %43 %84 % IsA + NER %45 %86 % %9 %18 % IsA %15 %28 % IsA + NER %31 %59 %

Experiment  Variant of experiment without IsA relations only with NER entities VariantItemsFoundConcepts / Items Concepts / total Concepts / can be found All + NER %42.8 %82.4 % NER %25.5 %49.2 %

Conclusions and further work  concepts => lightweight ontology  enables better automatic relations mining

Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava Petr Klimánek (was: Faculty of Science, University of Ostrava) Zdenek Velart FEECS, VSB–Technical University of Ostrava