Download presentation
Presentation is loading. Please wait.
1
Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University
2
2 Research Field Overview My researchSemantic Web Data Integration Schema Matching Information Extraction Bioinformatics
3
3 Information Extraction “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99]
4
4 Information Extraction “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99] “Because the WWW consists primarily of text, information extraction is central to all effort that would use the web as a resource for knowledge discovery.” [Freitag98]
5
5 Information Extraction Traditional information extraction Hidden web crawling Biological data extraction
6
6 Traditional Information Extraction Different groups of IE tools: [Laender02] –Wrapper generation tools –NLP-based and learning-based tools –Ontology-based tools
7
7 Traditional Information Extraction Wrapper generation tools –Lixto [Baumgartner01] Supervised wrapper generation Semi-automatically Not robust; Does not work well with unstructured data –ROADRUNNER [Crescenzi01] Fully automatic wrapper generation Does not generate robust and general wrappers Only works for highly regular web pages
8
8 Traditional Information Extraction NLP-based and learning-based tools –SRV [Freitag98] Top-down learner Learns based on simple and relational features Single slot filling –RAPIER [Califf99] Bottom-up learner Learns pre-filler, slot filler, and post-filler patterns Only works for free text Single slot filling
9
9 Traditional Information Extraction Ontology-based tools –BYU Ontos [Embley99] Based on domain-specific extraction ontologies Robust to changes Multiple slot filling Ontologies has to be built manually
10
10 Hidden Web Crawling Traditional IE tools: publicly indexable web pages Hidden web crawling –Crawl the hidden web according to a user’s query –HiWE (Hidden Web Exposer) [Raghavan01] Source form representation task-specific DB concepts Fill out and submit forms Retrieve information hidden behind the form
11
11 Biological Data Extraction Mainly from plain text Extract biological terms –Dictionary-based –Rule-based Extract relationships between biological terms/elements Example systems –BLAST-based name identifier [Krauthammer00] –PASTA (Protein Active Site Template Acquisition) [Gaizauskas03]
12
12 The Semantic Web Machine-understandable web Gives information a well-defined meaning Allows automation of tasks Provides biologists –Intelligent information services –Personalized web resources –Semantically empowered search engines
13
13 The Semantic Web Semantic web languages XOL (XML-based Ontology Exchange Language) SHOE (Simple HTML Ontology Extension) OML (Ontology Markup Language) RDF(S) (Resource Description Framework (Schema)) OIL (Ontology Interchange Language) DAML+OIL (DARPA Agent Markup Language + OIL) OWL (Ontology Web Language) Semantic Annotation –Old: indexing of publications in libraries –New: information extraction
14
14 Schema Matching Previous methods [Raghavan01]: –Individual matchers vs. combining matchers –Schema-based matchers vs. instance-based matchers –Learning-based matchers vs. rule-based matchers –Element-level matchers vs. structure-level matchers
15
15 Schema Matching LSD (Learning Source Description) [Doan01] –Semi-automatic –Learning-based –Both schema-level and instance-Level –Only 1-1 mappings GLUE & CGLUE [DMD+03] –Ontology alignment –CGLUE: Complex (non-1-1) mappings
16
16 Schema Matching Cupid [Madhavan01] –Rule-based matcher –Both element-level and structure-level –Schema-based –Works on hierarchical schemas with schema tree –Linguistic similarity & structure similarity –Matches tree elements by weighted similarities
17
17 Schema Matching COMA (COmbing MAtch) [Do02] –Combines different matchers –Interactive with users –Also an evaluation platform for different matchers
18
18 Biological Data Integration Challenge: –Huge amount, growing rapidly –Highly diverse in granularity and variety –Different terminologies, ID systems, units –Unstable and unpredictable –Different interface and querying capabilities
19
19 Biological Data Integration SRS (Sequence Retrieval System) [Etzold96] –Keyword-based retrieval system –Returns simple aggregation of matched records –Only works for relational databases BioKleisli [Davidson97] –Integrated digital library in biomedical domain –No global schema or ontology –A mediator works on top of source-specific wrappers –Horizontal integration
20
20 Biological Data Integration DiscoveryLink [Haas01] –Mediator-based, wrapper-oriented –Provides virtual DB access from different sources –Cannot deal with complex source data –Hard to add new sources –Requires knowledge of specific query language TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [Stevens00] –Mediator-based –Uses global ontology and schema –Maps source and target concepts manually –Not robust to changes –Hard to add new sources
21
21 Bioinformatics Biological ontology Bioinformatics data source discovery Trustworthiness and provenance
22
22 Bioinformatics Biological ontology –GO (Gene Ontology) [Ashburner00] Controlled vocabulary –Molecular Function (7278 terms) –Biological Process (8151 terms) –Cellular Component (1379 terms) Is represent knowledge hierarchically
23
23 Bioinformatics Biology Ontology –LinKBase [Verschelde03] Originally a biomedical ontology –Over 2,000,000 medical concepts –Over 5,300,000 instantiations –543 relations Expanded using GO Only describes simple binary relationships
24
24 Bioinformatics Bioinformatics data source discovery –First step in integrating or answering queries –Example System: [Rocco03]: Pre-defined classes with class descriptions Tries to map a source with a class Trustworthiness and provenance –Trustworthiness: Consistency Reliability Competence Honesty –Provenance Record History Transformations Annotations updates
25
25
26
26
27
27
28
28 Summary and Future Work Overcome drawbacks of existing systems Elaborate new algorithms to solve the problem of locating and extracting data from heterogeneous biological sources My researchSemantic Web Schema Matching Information Extraction Bioinformatics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.