Download presentation
Presentation is loading. Please wait.
1
Data Frames Version 3 Proposal
2
Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract "\d{2}"; context "([^\$\d]|^)\d{2},[^\d]"; } 0.6, { extract "\d{2}"; context "\b'\d{2}\b"; } 0.8; end; Mileage matches [8] constant { extract "\b[1-9]\d{1,2}k"; } 0.6, { extract "[1-9]\d?,\d{3}"; } 0.3; keyword "\bmiles\b", "\bmi\.", "\bmi\b"; end; Also: except, substitute, filter phrases; lexicons
3
Kimball’s Ontology Editor Strong separation of value and keyword phrases Each phrase may be labeled Still allow negation Introduce idea of “required context” Allow keyword to be specific to a subset of the value phrases for this data frame Expressions are richer than regular expressions. Supports Boolean and proximity operators; also lexicons and macros.
4
Internal Representation Replace SQL field length with arbitrary type field This is the “internal representation” Type is either lexical or nonlexical Type could be the name of an object set in the ontology Or it could be the name of a type in whatever language will be used to implement methods (more on this later), together with a units name (e.g. “miles”, “meters”, “grams”, “pounds”)
5
Methods Add a method phrase to data frames Conceptually they are restricted derived object sets and relationship sets We only declare method signatures in data frames Another language (e.g. Java) is used to define the method body Our tool will generate a template in which the programmer can write method bodies The template will have OO structures that allow read-only access to the seamless model/data instance Keyword phrases may also apply to methods
6
Canonicalization Methods Each value phrase may have an associated canonicalization method The purpose is to convert the extracted value string into a common form The data frame may have a default canonicalization method that applies if there is no individual method for a value phrase
7
Inheritance Inheritance is defined more cleanly Generalization/specialization will indicate inheritance hierarchy The internal representation cannot be overridden in specializations Multiple parents must have the same internal representation Individual inherited phrases can be deleted or overridden New phrases can be added In the case of name conflict, we require fully qualified names to be used (no automatic disambiguation)
8
General Constraints We may decide to implement a limited form of general constraint in the ontology E.g. “Birth Date <= Death Date” Or “Event Distance.toMiles() <= 26 If so, we may want to implement operator overloading (something like C++) The general constraint issue is not core to the current data frame discussion, but it has interesting ramifications
9
Other Issues How to integrate methods and confidence values into record-assembly heuristics Ontos system will have to be rewritten Extract into model instance, not SQL tables We can always generate database tables later if we’d like Ontologies created graphically and stored as XML
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.