1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF
2 Introduction Motivation: Traditional search engines: return documents Ontology-based data extraction: return information Problem: Build extraction ontology that meet users needs Goal: Automatically build ontologies for users’ needs
3 Example Example: a biologist is interested in information about large proteins in humans and their functions Possible queries: Find proteins in humans that are >20 kDa Find all the proteins in humans that serve as receptors ... Information sources --- various online databases NCBI Gene Cards The Gene Ontology GPM Proteomics Database …
4 Extraction Ontology Regular Expression: ^\d{1,5}(\.\d{1,2})? Unit: kilodaltons?|kdas?|kds|?das?|daltons? Molecular Weight
5 User Interface Select a title for the forms
6 User Interface Binary Relationship Name Protein Name
7 User Interface Binary Relationship Molecular Weight Protein Name Protein Molecular weight
8 User Interface N-ary Relationship Chromosome number StartEnd Orientation Chromosome location Chromosome number StartEnd Orientation
9 User Interface N-ary Relationship GO GO phrase GO ID Go ID Go term
10 Protein Molecular Weight Name Chromosome location GO Chromosome number StartEndOrientation Overall Form Go ID Go term
11 Ontology View Name Chromosome location Protein Chromosome number StartEnd Orientation GO GO phrase GO ID Molecular weight
12 Protein Molecular Weight Name Chromosome location GO Chromosome number StartEndOrientation Go ID Go term Fill in the Form
13 Protein Molecular Weight Daltons Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E Chromosome location GO Chromosome number 17 StartEndOrientation 1,250,267 1,194,558 minus Fill in the Form GO: GO: Go ID Go term enzyme binding protein domain specific binding
14 Mapping Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E
15 Mapping Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E
16 Mapping Name
17 Data Frame Generation Choose from data frame library Data frames for basic values Numbers within different ranges Integers, floats, etc s, phone numbers, addresses, etc Domain specific values (DNA sequences) Units Build lexicon files
18 Data Frame Generation Find the best matched data frame from the library Find the correct units
19 Build Lexicon Files Name
20 Contribution Automatically generates ontologies depending on users’ requests Provides a tool for users to easily provide ontology seeds Automatically generates ontology views from ontology seeds Automatically map ontology concepts to source databases