1 Data Design Implementation and support for Build 2b November 30, 2011 Steve Hughes
Topics Overview Key Requirements and Drivers Build 2b Deliverables Build 2b Deployment Issues Next Steps 2
PDS4 Architecture 3
Data Architecture Concepts Tagged Data Object (Information Object) Label Schema Used to Create Describes Extracted/Specialized Information Model Data Object Data Element Class has Planetary Science Data Dictionary Expressed As Product Validates
Topics Overview Key Requirements and Drivers Build 2b Deliverables Build 2b Deployment Issues Next Steps 5
DRIVERS FOR PDS4 Build 2a 6 RECOMMENDATION TO MC (2009)IMPLEMENTATION Replace PDS3 ad hoc information model with a PDS4 information model that is managed using modern tools The PDS4 Information Model has been designed and managed using the Protégé Ontology Modeling Tool. Replace ad hoc PDS3 product definitions with PDS4 products that are defined in the model The PDS4 Products and their components are defined using the modeling tool. The modeling tool provides rigorous definitions. The Product definition is based on the Open Archive Information System (OAIS) Reference Model, an ISO standard. Require data product formats to be derivations from a core set Support transformation from the core set. Four fundamental data structures have been defined. Additional data structures are subclasses of the four fundamental structures. Software written for the fundamental structures is inherited by the subclasses.
DRIVERS FOR PDS4 Build 2a 7 RECOMMENDATION TO MC (2009)IMPLEMENTATION Replace “homegrown” PDS data dictionary structure with an international standard. The PDS4 Data Dictionary structure is based on the ISO/IEC specification. Adopt a modern data language/grammar (XML) where possible for all tool implementations The PDS4 Information model is implemented in XML.
DRIVERS FOR PDS4 Build 2a 8 REQUIREMENTIMPLEMENTATION 1.3.X – Provide Data Dictionary The PDS4 data dictionary database was developed and is compliant with the ISO/IEC specification. It is used to produced both data dictionary documents and data dictionary products for the registry and data dictionary service PDS will define a standard for organizing, formatting, and documenting planetary science data The PDS4 Information Model defines the archive organization, data formats, and product labeling standards. The PDS4 Standards Reference documents additional requirements PDS will maintain a dictionary of terms, values, and relationships for standardized description of planetary science data The PDS4 Data Dictionary defined the attributes, classes, and relationships for defining planetary science data PDS will define a standard grammar for describing planetary science data XML and XML Schema 1.1 have been adopted for the PDS4 implementation.
DRIVERS FOR PDS4 Build 2a 9 REQUIREMENTIMPLEMENTATION PDS will establish minimum content requirements for a data set (primary and ancillary data) The PDS4 Information Model defines observational and ancillary product types. These products are collected into PDS4 Collections and Archive Bundles PDS will, for each mission or other major data provider, produce a list of the minimum components required for archival data The PDS4 Information Model defines the archive bundle and its product collections. The archive bundle and its collections are customized for each mission PDS will develop and maintain online interfaces for discipline-specific searching The PDS4 Information Model and Data Dictionary defines information that is needed for search PDS will develop and publish procedures for determining syntactic and semantic compliance with its standards The adoption of XML and XML Schema 1.1 provide syntactic and semantic standards They provide utilities and tools for validation.
Topics Overview Key Requirements and Drivers Build 2a Deliverables Build 2b Deployment Issues Next Steps 10
Build 2a Scope Begin supporting PDS4 label design for LADEE and MAVEN; Begin planning/testing migration Support the Policy on Acceptable PDS4 Data Formats Support transition of the central catalog to the registry infrastructure Deploy early PDS4 software tools and services 11
Build 2a Deliverables 12 Document/ArtifactProcesses 1 Introduction Data Provider 2 Concepts Document Standards Development 3 Glossary 4 Jumpstart Guide 5 Data Provider’s Handbook 6 Standards Reference 7 Data Dictionary 8 Example Products 10 Generic Schemas 11 Information Model
PDS4 Documents in Context Concepts Document Big Picture Standards Reference Requirements User Friendly XML Schemas Blueprints PDS4 Product Labels Deliverables Data Dictionary Definitions PDS4 Information Model Specification Requirements Engineering Specification Informative Data Provider’s Handbook Cookbook derive generates references creates / validates instruct generates references Registry Configuration File Object Descriptions configures generates Registry Product Tracking and Cataloging generates Introduction to PDS4 Documentation Jumpstart Glossary Data Dictionary Tutorial Complete Some TBD Legend
Data Format Deliverables vis-à-vis Policy 14 PolicyDeliverable PDS shall accept the following PDS4 data formats: Fixed-width binary and ASCII tables that are composed of identically structured records Table_Base - The Table Base class defines a heterogeneous repeating record of scalars. Table_Character and Table_Binary are defined as types of Table_Base. N-dimensional arrays of homogeneous binary elements (N<=16) Array_Base - The Array Base class defines a homogeneous N-dimensional array of scalars.
Data Format Deliverables vis-à-vis Policy 15 PolicyDeliverable Variable-width character 'spreadsheets' that are composed of repeating, M- field, stream-delimited records where the fields themselves are (separately) delimited and may have variable widths (M>0) Delimited_Table - The Delimited_Table class defines a simple table (spreadsheet) with delimited fields and records. It is defined as a type of Parsable_Byte_Stream. NAIF/SPICE files The SPICE_Kernel_Binary and SPICE_Kernel_Text classes describe SPICE files. PDS shall accept ASCII text and PDF/A formats for PDS4 documentation. PDS shall accept JPEG, GIF, and TIFF images for figures accompanying documents. PDS shall accept any of the approved structures and formats for browse products. Product_Document - A Product Document is a product consisting of a single logical document comprised of one or more document formats. ASCII Text and PDF/A are currently allowed as document formats. JPEG, GIF, TIFF, and PNG are allowed as non- science image formats.
The Deliverables from 10K
PDS4 Model
PDS4 Products
PDS4 Data Formats 19 Base Extensions/ Restrictions
PDS4 Observational Product Identification_Area Cross_Reference_Area Observation_Area File_Area Digital_Object Subject_Area Bibliographic_Reference Mission_Area Node_Area Observing_System Reference_Entry [0..1] [1] [1..*] [0.*] [0..*] [1..*] [0..*] [1] Data_Standards [1]
Data Standards Development Process Domain Knowledge PDS4 Information Model Information Modeling Tool Domain expertise was captured in the PDS4 Information Model as an ontology. The model represents a consensus of the domain experts. The model is the single source for the PDS4 Data Standards, for example the generated XML Schemas. Filter and Translator XML Schema (Generic) XML Schema (Generic) XML Schema (Generic) XML Schema (Generic)
Topics Overview Key Requirements and Drivers Build 2b Deliverables Build 2b Deployment Issues Next Steps 22
Build 2b Deployment Resolve build 2a liens (to be discussed) and generate a build 2b deployment Generate a release of the information model, companion documents and supporting tutorial material Generate new schemas Generate registry configuration information Post key documents to PDS website 23
Topics Overview Key Requirements and Drivers Build 2b Deliverables Build 2b Deployment Issues Next Steps 24
Chart of Review Comments Total: 1173
Total: 1935
Build 2a Identified Liens 27 LienBrief Explanation Need to finalize and freeze the information model for Build 2b incorporating high priority changes identified in Build 2a. Address issues found with the information model focusing primarily on the core components of the product labels and the aggregate products, collections and bundles. Need capabilities to support local data dictionary validation and the creation of schema and human-readable definition lists. There is a lack of instructions for creating, validating, and using local keywords and classes (this includes lack of support for generating human- readable definition lists for peer review).
Build 2a Identified Liens 28 LienBrief Explanation Need to baseline the current documentation; Need to provide additional information/ changes. Documents are still overlapping, not up to date, inconsistent in areas, and have gaps. Need to finalize and freeze the XML Schema for Build 2b incorporating the extension schemas currently under testing by the DDWG Newer “extension” style schemas are not yet mature enough to be used by an external data provider. They seem to be preferred over the older but stable “flat” schemas that were available for the node exercises. Both are currently produced and produce similar labels.
Topics Overview Key Requirements and Drivers Build 2b Deliverables Build 2b Deployment Issues Next Steps 29
Build 2b Actions – Jan ‘12 Finalize and freeze the information model for Build 2b incorporating high priority changes identified in Build 2a. Use existing capabilities to support local data dictionary validation and the creation of schema and human-readable definition lists. Baseline the current documentation Add any additional information/ changes to an online resource (e.g., wiki) Finalize and freeze the XML Schema for Build 2b incorporating the extension schemas currently under testing by the DDWG. 30
Conclusion The PDS4 Information Model represents the DDWG consensus. A large number of decisions resulting from much discussion were captured in the model. All had a say, not everyone always got their way. On the scheduled date the model will be frozen and the PDS4 Data Standards will be generated and deployed. The schemas, the dictionary, and all other generated artifacts will be consistent with the model. The current consensus, as reflected in the model will be operational. 31
Acknowledgements* Ed Bell Richard Chen Dan Crichton Amy Culver Patty Garcia Ed Grayzeck Ed Guinness Mitch Gordon Sean Hardman Lyle Huber Steve Hughes Chris Isbell Steve Joy * Anyone who sat through a DDWG 2-hour telecon or provided useful input. Ronald Joyner Debra Kazden Todd King Joe Mafi Mike Martin Thomas Morgan Lynn Neakrase Paul Ramirez Anne Raugh Mark Rose Elizabeth Rye Boris Semenov Dick Simpson Susie Slavney Peter Allan David Heather Michel Gangloff Santa Martinez Thomas Roatsch Alain Sarkissian
Thank You Questions and Answers 33
Backup 34
Too Many {objects, classes, schemas, …} Abstract (vacuous) classes are used for organizational purposes. These are not included in the schemas and many are being deleted. Subclasses of the four fundamental structures are used to partition the set of allowed structures, for example the Array_2D_Image subclass of Array_Base. Question to be answered, does the PDS want to provide software specific to Array_2D_Image? All Array_Base software works for any Array_2D_Image. 35
Too Many {objects, classes, schemas, …} Subclasses of a product component are used to provide specificity, for example, the subclass Bundle_Member_Entry. There are three methods, change the name, change the namespace (new file), or use optional attributes. Some specific subclasses are used for special purposes, for example Table_Field_Checksum in an Inventory. Consider using Schematron Assert statements to validate.. 36
Too Many {objects, classes, schemas, …} Some classes result from the process of normalization, for example array_axis and array_element. Emperor Joseph II: …And there are simply too many notes, that's all. Just cut a few and it will be perfect. Mozart: Which few did you have in mind, Majesty? Emperor Joseph II Mozart. 37
Action Item Flowchart
By the numbers Fundamental Data Structures – 4 Lines of Schema Code Flat 18K Master 4k-6k Classes dropped (Master) – nn SimpleTypes dropped (Master) – 200 Actionable items closed – 1.5K Actionable items open - < 50 Issues from reviews – 1k+. 39
Totals InternalIPDAExternalReadinessTotal Narrative Documentation Actionable Discussion Research Kudo System/Tools Discipline Process Total
Post Build 2b – Summer ‘12 Develop discipline level classes for the next phase of data set migration Refine the document suite and its organization. Support development of tools scheduled for the next build. Support development of data dictionary and local data dictionary services. 41
Capability Matrix 42
Capability Matrix 43
Capability Matrix 44
Capability Matrix 45