Download presentation
Presentation is loading. Please wait.
Published byVincent Perkins Modified over 9 years ago
1
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold
2
Outline Different approaches to ontology design Text mining –Manual curation –Document parsing –Potential problems –Methodology –Evaluation
3
Approaches to Ontology Design Inspiration Start from a premise about why an ontology is needed Then design an ontology (from personal expertise about the domain) that aims to meet the recognized need Warning: may be impractical, may lack theoretical underpinning
4
Approaches to Ontology Design Induction Ontology developed by observing, examining, and analyzing specific case(s) in domain Then resulting ontological characterization for specific case is applied to other cases in same domain Warning: may fit a specific case, but not be generalizable
5
Approaches to Ontology Design Deduction Adopt general principles and adaptively apply them for specific cases Filter/distill general notions so they are customized to particular domain subset Warning: presupposes existence + selection of appropriate general characteristics from which ontology for specific cases can be devised
6
Approaches to Ontology Design Synthesis Identify a base set of ontologies, no one of which subsumes any other Synthesize parts, creating a unified ontology Warning: heavily relies on developers’ synthesis skills
7
Approaches to Ontology Design Collaboration Development is a joint effort reflecting experiences + viewpoints of persons who intentionally cooperate to produce it Can start from a proposed ontology with iterative improvements Advantages: diverse vantage points, builds commitment by iteratively reducing participants’ objections
8
Collaborative Approach Preparation Anchoring Application Iterative Improvement Demonstrate uses of the ontology Define design criteria Determine boundary conditions Determine evaluation standards Specify initial seed ontology Identify diverse participants Elicit critiques & comments Revise to address feedback Iterate until consensus
9
Text Mining What about instantiation??? Experts can design ontology (classes, hierarchy, etc.) But need to systematically go through literature to identify instances and their properties Particularly important to accommodate diversity
10
Text Mining Goals: Discover new instances and properties Increase strength of existing annotations by locating additional paper evidence
11
FlyBase Curation Watch list of ~35 journals Each curator inspects latest issue of a journal to identify papers to curate So curation takes place on paper-by-paper basis (as opposed to topic-by-topic)
12
FlyBase Curation Curator fills out record for each paper Some fields require rephrasing, paraphrasing, summarization Other fields record very specific facts using terms from ontologies
13
FlyBase Curation Software like PaperBrowser presents enhanced display of text with recognized terms highlighted (e.g., Named Entity Recognition) Parser identifies boundaries of the NP around each term name and its grammatical relations to other NPs in the text
14
Document Parsing PDF is only standard electronic format in which all relevant papers are available PDF-to-text processors not aware of the typesetting of each journal, have trouble with some formatting (e.g., 2-column text, footnotes, headers, figure captions, etc.) Document parsing best done with optical character recognition (OCR) For images, can parse their captions
15
Potential Problems for Text Mining Lexical ambiguity (e.g., words that denote > 1 concept) Polysemy (e.g., term present in 2 papers denotes different concepts) Abbreviation (e.g., same concept, but different abbreviations in different papers)
16
Potential Problems for Text Mining Digit removal (e.g., 4-hydroxybutan… vs. 2- hydroxybutan…) Stemming (e.g., removing prefixes, suffixes, etc.) Stop word removal (e.g., “the”, “a”) Need a domain-specific text miner!!!
17
Methodology Extract textual elements from papers identifying a term in the ontology Construct patterns with reliability scores (confidence that pattern represents term) Extend pattern set with longer pattern sets Apply semantic pattern matching techniques (i.e., consider synonyms) Annotate terms based on quality of matched pattern to concept occurring in the text
18
Training Phase Objective: construct set of patterns that characterize indicators for annotation (1)Find terms in the “training set” papers (2)Extract significant terms/phrases that appear in the papers (3)Construct patterns based on significant terms/phrases and terms surrounding significant terms
19
Annotation Phase (1)Look for possible matches to the patterns in the papers (2)Compute a matching score which indicates the strength of the prediction (3)Determine the term to be associated with the pattern match (4)Order new annotation predictions by their scores, and present to user
20
Pattern Construction structured as { LEFT } { RIGHT } is an ordered sequence of significant terms (i.e., identifying elements) {LEFT} and {RIGHT} are sets of words that appear around significant terms (i.e., auxiliary descriptors) number of words in {LEFT} and {RIGHT} can be limited stop words not included in patterns
21
Pattern Construction Example: pattern template { LEFT } { RIGHT } pattern1: { increase catalytic rate } { transcription suppressing transient } pattern2: { proteins regulation transcription } { initiated search proteins }
22
Pattern Construction Example: pattern template { LEFT } { RIGHT } pattern1: { frontoparietal } { sphenethmoid } pattern2: { anterior ramus pterygoid } { planum antorbitale }
23
Pattern Scoring Calculate score representing how confidently a pattern represents a term MT = source of Patterns whose exactly matches ontology term gets higher score
24
Pattern Scoring Calculate score representing how confidently a pattern represents a term TT = type of individual terms in the Considers occurrence frequency of a word in among all ontology terms, and position of word in an ontology term (gets more specific from right to left)
25
Pattern Scoring Calculate score representing how confidently a pattern represents a term PP = term-wise paper frequency of Patterns with that is highly frequent in the paper dataset get higher scores
26
Evaluation Recall = correct responses by software all human responses Precision = correct responses by software all responses by software
27
Discussion
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.