How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

Chapter 5: Introduction to Information Retrieval
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Requirements Engineering n Elicit requirements from customer  Information and control needs, product function and behavior, overall product performance,
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Information Retrieval in Practice
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Ch 4: Information Retrieval and Text Mining
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Introducing Assessment
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Assessing and Evaluating Learning
Overview of Search Engines
Cis-Regulatory/ Text Mining Interface Discussion.
Mining and Summarizing Customer Reviews
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Requirements Analysis
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Assessment for Optimal Learning Tace Crouse Faculty Center for Teaching and Learning University of Central Florida.
Writing Student Learning Outcomes Consider the course you teach.
A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Systems Analysis and Design in a Changing World, 6th Edition 1 Chapter 8 - Approaches to System Development.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Issues in Assessment in Higher Education: Science Higher Education Forum on Scientific Competencies Medellin-Colombia Nov 2-4, 2005 Dr Hans Wagemaker.
1 Assessment Gary Beasley Stephen L. Athans Central Carolina Community College Spring 2008.
Experimental Research Methods in Language Learning Chapter 1 Introduction and Overview.
Student Learning Outcomes
Writing Student-Centered Learning Objectives Please see Reference Document for references used in this presentation.
1 Software Engineering: A Practitioner’s Approach, 6/e Chapter 8: Analysis Modeling Software Engineering: A Practitioner’s Approach, 6/e Chapter.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Report writing final report fall Agenda Each group make a short presentation of their project addressing the challenges of the project Feedback.
A Decision-Making Tool.  Goal  Educational Objectives  Student Learning Outcomes  Performance Indicators or Criteria  Learning Activities or Strategies.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Teaching and Thinking According to Blooms Taxonomy human thinking can be broken down into six categories.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Ontology Evaluation Outline Motivation Evaluation Criteria Evaluation Measures Evaluation Approaches.
Requirement Elicitation Nisa’ul Hafidhoh Teknik Informatika
Language Identification and Part-of-Speech Tagging
Queensland University of Technology
Fundamental of Scientific Research (Research methods)
Text Based Information Retrieval
Development of the Amphibian Anatomical Ontology
Preface to the special issue on context-aware recommender systems
Terminology problems in literature mining and NLP
CS 430: Information Discovery
85. BLOOM’S TAXONOMY “Bloom’s Taxonomy is a guide to educational learning objectives. It is the primary focus of most traditional education.”
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Multimedia Information Retrieval
Object-Oriented Analysis
What is Research? A research study is a study conducted to collect and analyse information in order to increase our understanding of a topic or an issue.
Presented By: Grant Glass
Presentation transcript:

How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold

Outline Different approaches to ontology design Text mining –Manual curation –Document parsing –Potential problems –Methodology –Evaluation

Approaches to Ontology Design Inspiration Start from a premise about why an ontology is needed Then design an ontology (from personal expertise about the domain) that aims to meet the recognized need Warning: may be impractical, may lack theoretical underpinning

Approaches to Ontology Design Induction Ontology developed by observing, examining, and analyzing specific case(s) in domain Then resulting ontological characterization for specific case is applied to other cases in same domain Warning: may fit a specific case, but not be generalizable

Approaches to Ontology Design Deduction Adopt general principles and adaptively apply them for specific cases Filter/distill general notions so they are customized to particular domain subset Warning: presupposes existence + selection of appropriate general characteristics from which ontology for specific cases can be devised

Approaches to Ontology Design Synthesis Identify a base set of ontologies, no one of which subsumes any other Synthesize parts, creating a unified ontology Warning: heavily relies on developers’ synthesis skills

Approaches to Ontology Design Collaboration Development is a joint effort reflecting experiences + viewpoints of persons who intentionally cooperate to produce it Can start from a proposed ontology with iterative improvements Advantages: diverse vantage points, builds commitment by iteratively reducing participants’ objections

Collaborative Approach Preparation Anchoring Application Iterative Improvement Demonstrate uses of the ontology Define design criteria Determine boundary conditions Determine evaluation standards Specify initial seed ontology Identify diverse participants Elicit critiques & comments Revise to address feedback Iterate until consensus

Text Mining What about instantiation??? Experts can design ontology (classes, hierarchy, etc.) But need to systematically go through literature to identify instances and their properties Particularly important to accommodate diversity

Text Mining Goals: Discover new instances and properties Increase strength of existing annotations by locating additional paper evidence

FlyBase Curation Watch list of ~35 journals Each curator inspects latest issue of a journal to identify papers to curate So curation takes place on paper-by-paper basis (as opposed to topic-by-topic)

FlyBase Curation Curator fills out record for each paper Some fields require rephrasing, paraphrasing, summarization Other fields record very specific facts using terms from ontologies

FlyBase Curation Software like PaperBrowser presents enhanced display of text with recognized terms highlighted (e.g., Named Entity Recognition) Parser identifies boundaries of the NP around each term name and its grammatical relations to other NPs in the text

Document Parsing PDF is only standard electronic format in which all relevant papers are available PDF-to-text processors not aware of the typesetting of each journal, have trouble with some formatting (e.g., 2-column text, footnotes, headers, figure captions, etc.) Document parsing best done with optical character recognition (OCR) For images, can parse their captions

Potential Problems for Text Mining Lexical ambiguity (e.g., words that denote > 1 concept) Polysemy (e.g., term present in 2 papers denotes different concepts) Abbreviation (e.g., same concept, but different abbreviations in different papers)

Potential Problems for Text Mining Digit removal (e.g., 4-hydroxybutan… vs. 2- hydroxybutan…) Stemming (e.g., removing prefixes, suffixes, etc.) Stop word removal (e.g., “the”, “a”) Need a domain-specific text miner!!!

Methodology Extract textual elements from papers identifying a term in the ontology Construct patterns with reliability scores (confidence that pattern represents term) Extend pattern set with longer pattern sets Apply semantic pattern matching techniques (i.e., consider synonyms) Annotate terms based on quality of matched pattern to concept occurring in the text

Training Phase Objective: construct set of patterns that characterize indicators for annotation (1)Find terms in the “training set” papers (2)Extract significant terms/phrases that appear in the papers (3)Construct patterns based on significant terms/phrases and terms surrounding significant terms

Annotation Phase (1)Look for possible matches to the patterns in the papers (2)Compute a matching score which indicates the strength of the prediction (3)Determine the term to be associated with the pattern match (4)Order new annotation predictions by their scores, and present to user

Pattern Construction structured as { LEFT } { RIGHT } is an ordered sequence of significant terms (i.e., identifying elements) {LEFT} and {RIGHT} are sets of words that appear around significant terms (i.e., auxiliary descriptors) number of words in {LEFT} and {RIGHT} can be limited stop words not included in patterns

Pattern Construction Example: pattern template { LEFT } { RIGHT } pattern1: { increase catalytic rate } { transcription suppressing transient } pattern2: { proteins regulation transcription } { initiated search proteins }

Pattern Construction Example: pattern template { LEFT } { RIGHT } pattern1: { frontoparietal } { sphenethmoid } pattern2: { anterior ramus pterygoid } { planum antorbitale }

Pattern Scoring Calculate score representing how confidently a pattern represents a term MT = source of Patterns whose exactly matches ontology term gets higher score

Pattern Scoring Calculate score representing how confidently a pattern represents a term TT = type of individual terms in the Considers occurrence frequency of a word in among all ontology terms, and position of word in an ontology term (gets more specific from right to left)

Pattern Scoring Calculate score representing how confidently a pattern represents a term PP = term-wise paper frequency of Patterns with that is highly frequent in the paper dataset get higher scores

Evaluation Recall = correct responses by software all human responses Precision = correct responses by software all responses by software

Discussion