Download presentation
Presentation is loading. Please wait.
1
Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF
2
Personalized Information Harvesting Biology domain huge (other domains too) Data collection – Many (web) sources – Only a tiny subpart wanted – Personalized view Personalized extraction ontology – Creation: Form specification – Application: Seed-based harvesting
3
Example Harvest information about large proteins in humans and the functions of these proteins – Find proteins in humans that are >20 kDa – Find all the proteins in humans that serve as receptors –... Information sources various online repositories – NCBI – Gene Cards – The Gene Ontology – GPM Proteomics Database – …
4
Extraction Ontology Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 …
5
Extraction Ontology Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 …
6
Can We Make Construction Easier? Forms – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Instance recognizers – Some pre-existing instance recognizers – Lexicons Need for a full extraction ontology?
7
Form Creation User Interface Basic form-construction facilities: single-entry field multiple-entry field nested form …
8
Created Sample Form
9
Generated Ontology View
10
Source-to-Form Mapping Establishing a Seed
11
Source-to-Form Mapping Establishing a Seed
12
Source-to-Form Mapping Establishing a Seed
13
Source-to-Form Mapping Establishing a Seed
14
Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection
15
Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name
16
Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name
17
Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
18
Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
19
Can Now Harvest Name
20
Can Now Harvest Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E
21
Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3
22
Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS
23
Harvesting Populates Ontology
24
Also helps adjust ontology constraints
25
Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
26
Larger Picture Information Harvesting – Not only for biology, but for any application – Not only from one site, but from many sites Opportunities – Extraction ontology creation – Automating site-to-site information harvesting – Automatic semantic annotation – Data/Ontology transformations
27
Extraction Ontology Creation Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …
28
Automatic Source-to-Form Mapping
29
Automatic Semantic Annotation
30
Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases
31
Automatic Source-to-Form Mapping
32
Automatic Semantic Annotation Recognize and annotate with respect to an ontology
33
Ontology Transformation OWL & RDF: standard ontology languages XML & XMLS: data exchange Forms: form filling to populate an ontology
34
Ontology Transformation Transformations to and from all
35
Contributions Personalized ontology creation Mapping from sources Information harvesting Opportunities for further work – Extraction ontology creation – Semantic Annotation – Data/Ontology transformations www.deg.byu.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.