Download presentation
Presentation is loading. Please wait.
1
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.
2
2 Presentation Overview Background of legacy Ontos Assumptions, challenges, concerns Framework as solution Explain framework Explain reference implementation Evaluation of system Future work and conclusion
3
3 Data Extraction Goals of data extraction Find relevant data in unstructured or semi- structured documents Map extracted data to a formal structure Approaches Wrappers (ROADRUNNER, TSIMMIS) NLP and machine learning (RAPIER, WHISK) Ontologies (Ontos)
4
4 Ontos Developed by Data Extraction Group (DEG) at BYU Based on OSM ontologies and data frames Focuses on multiple-record extraction Good precision/recall Resilient to document changes
5
5 How Ontos Works
6
6 Ontos Assumptions OSML ontologies Single- or multiple-record text documents Each document/record relevant to domain Heuristics produce accurate mappings Output to relational database
7
7 Some Current Challenges ChallengeExample New/evolving ontology featuresEnhanced data frames Variety of documentsPDF, plaintext, XML Content filteringExtract from certain HTML attributes (ALT, SRC, HREF) Locating valuesOn-the-fly lexicon Optimizing mappingsBetter heuristics; HMM-based mapping
8
8 Architectural Concerns Variety of technologies Different OSM representations Highly coupled code Difficult to install elsewhere Difficult to upgrade or extend
9
9 Thesis Statement A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research. We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.
10
10 Frameworks Abstract architecture Decouple independent functions Define interfaces Use abstract classes, interfaces, declarative configuration files Allow quick adjustment of system settings without re-coding Make a system customizable Image from http://www.mcoe.org
11
11 Creating an Extraction Framework Analyze systems Generalize functionality Define interfaces Create supporting code Document framework
12
12 Managing the Process DataExtractionEngine Main class Initialize, perform extraction, finalize ExtractionPlan Defines order of steps in the extraction process Can be imperative, declarative, or dynamic (like SQL execution plan)
13
13 Handling Documents DocumentRetriever Responsible for locating relevant documents Search engine, local filesystem, CMS DocumentStructureRecognizer Decides which DocumentStructureParser to use DocumentStructureParser Breaks document into individual records or sub- documents Record separator, table analyzer ContentFilter Normalizes document text Strips out unwanted markup, stopwords, etc.
14
14 Extracting Values ValueRecognizer Uses matching rules defined in ontology Produces set of candidate matches (like data record table) ValueMapper Accepts or rejects candidate matches Assigns accepted matches to elements of the ontology (e.g., object sets) OntologyWriter Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)
15
15 Implementing the Framework
16
16 OSMX Legacy Ontos: OSML OntologyEditor: OSM.dtd New standard is OSMX XML Schema (better constraints; validation) JAXB generates corresponding Java classes Common language for DEG tools Allows data to be stored inline with model
17
17 Managing the Process OntosEngine Main class for Ontos system Takes parameters from command line or configuration file OntosExtractionPlan Sequentially retrieves, parses, filters, and extracts from individual documents Imperative (hard- coded) algorithm
18
18 Handling Documents LocalDocumentRetriever Retrieves documents from local filesystem Filename filter excludes irrelevant files FanoutRecordSeparator Implements DocumentStructureParser Locates record boundaries and creates sub- documents HTMLFilter Removes all HTML markup from documents
19
19 Recognizing Values: DataFrameMatcher Uses data frame enhancements: Keyword affinity (left and right) Require context for left, right, or both Value phrase-specific keywords Link matches back to specific patterns Other improvements: Consistent regular expression handling Unlimited recursive macro definition
20
20 Mapping Values: HeuristicBasedMapper New algorithm Fully recursive wrt ontology structure ContextualHeuristic generates objects Connection-based heuristics (singleton, nested- group, etc.) generate relationships See paper for additional details
21
21 Output Human-readable HTML format Easier to count correct, partial, incorrect mappings
22
22 Using the Framework and Reference Implementation Adding new features Create new implementation classes Extend (subclass) existing implementations Switching feature set Change class name in config file Override class on command line
23
23 Evaluating the Framework AgeFuneralDateViewingRelationship/ RelativeName RecallPrecisionRecallPrecisionRecallPrecisionRecallPrecision New Ontos 60%50%68%76%80%63%74%43% Legacy Ontos 57%38%63%75%93%18%73%41% Four of eighteen object sets shown above. Data from Salt Lake Tribune and Arizona Daily Star Input: Obituaries ontology 25 obituaries from two newspapers
24
24 Statistics about the System FilesLines of code* Framework382868 OntologyEditor14122,249 OSMX (XML Schema)11918 OSMX (Java)**606912 Ontos296295 * Includes comments and whitespace. ** JAXB-generated classes add 197 files and 62,888 lines of code.
25
25 Future Work Algorithm improvements On-the-fly lexicons Machine learning techniques Confidence values Canonicalization Expected participation cardinality Negative-indicator keywords Integration Online search engines Semantic Web annotator and query engine Web interface to extraction engine
26
26 Contributions Design and construction of a data- extraction framework Reference implementation Ontos upgrade Pattern for future use of framework OSMX Standardized storage format http://www.deg.byu.edu/xml/osmx.xsd
27
27 Contributions Uniform codebase and language OntologyEditor migration New graphics classes Extended data frame support Modular heuristic-based mapper Concept of extraction plans Flexible research platform
28
28 Conclusion Framework gives us the flexibility we need for further data-extraction research Framework is capable of supporting Ontos functionality OSMX and reference implementation provide solid base for future research applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.