Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of.

Similar presentations


Presentation on theme: "Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of."— Presentation transcript:

1 Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF

2 Personalized Information Harvesting Biology domain  huge (other domains too) Data collection – Many (web) sources – Only a tiny subpart wanted – Personalized view Personalized extraction ontology – Creation: Form specification – Application: Seed-based harvesting

3 Example Harvest information about large proteins in humans and the functions of these proteins – Find proteins in humans that are >20 kDa – Find all the proteins in humans that serve as receptors –... Information sources  various online repositories – NCBI – Gene Cards – The Gene Ontology – GPM Proteomics Database – …

4 Extraction Ontology Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 …

5 Extraction Ontology Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 …

6 Can We Make Construction Easier? Forms – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Instance recognizers – Some pre-existing instance recognizers – Lexicons Need for a full extraction ontology?

7 Form Creation User Interface Basic form-construction facilities: single-entry field multiple-entry field nested form …

8 Created Sample Form

9 Generated Ontology View

10 Source-to-Form Mapping Establishing a Seed

11 Source-to-Form Mapping Establishing a Seed

12 Source-to-Form Mapping Establishing a Seed

13 Source-to-Form Mapping Establishing a Seed

14 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection

15 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

16 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

17 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

18 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

19 Can Now Harvest Name

20 Can Now Harvest Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E

21 Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

22 Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

23 Harvesting Populates Ontology

24 Also helps adjust ontology constraints

25 Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

26 Larger Picture Information Harvesting – Not only for biology, but for any application – Not only from one site, but from many sites Opportunities – Extraction ontology creation – Automating site-to-site information harvesting – Automatic semantic annotation – Data/Ontology transformations

27 Extraction Ontology Creation Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …

28 Automatic Source-to-Form Mapping

29 Automatic Semantic Annotation

30 Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases

31 Automatic Source-to-Form Mapping

32 Automatic Semantic Annotation Recognize and annotate with respect to an ontology

33 Ontology Transformation OWL & RDF: standard ontology languages XML & XMLS: data exchange Forms: form filling to populate an ontology

34 Ontology Transformation Transformations to and from all

35 Contributions Personalized ontology creation Mapping from sources Information harvesting Opportunities for further work – Extraction ontology creation – Semantic Annotation – Data/Ontology transformations www.deg.byu.edu


Download ppt "Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of."

Similar presentations


Ads by Google