Download presentation
Presentation is loading. Please wait.
Published byEmily Dobson Modified over 11 years ago
1
© 2006 Open Grid Forum Data Format Description Language (DFDL) Progress Update Mike Beckerle, Oco, Inc.
2
© 2006 Open Grid Forum 2 Admin DFDL WG co-chairs: Mike Beckerle, Oco Martin Westhead, Avaya Two note takers please Sign the attendance sheet Note: OGF Intellectual Property Rules apply
3
© 2006 Open Grid Forum 3 OGF IPR Policies Apply I acknowledge that participation in this meeting is subject to the OGF Intellectual Property Policy. Intellectual Property Notices Note Well: All statements related to the activities of the OGF and addressed to the OGF are subject to all provisions of Appendix B of GFD-C.1, which grants to the OGF and its participants certain licenses and rights in such statements. Such statements include verbal statements in OGF meetings, as well as written and electronic communications made at any time or place, which are addressed to: the OGF plenary session, any OGF working group or portion thereof, the OGF Board of Directors, the GFSG, or any member thereof on behalf of the OGF, the ADCOM, or any member thereof on behalf of the ADCOM, any OGF mailing list, including any group list, or any other list functioning under OGF auspices, the OGF Editor or the document authoring and review process Statements made outside of a OGF meeting, mailing list or other function, that are clearly not intended to be input to an OGF activity, group or function, are not subject to these provisions. Excerpt from Appendix B of GFD-C.1: Where the OGF knows of rights, or claimed rights, the OGF secretariat shall attempt to obtain from the claimant of such rights, a written assurance that upon approval by the GFSG of the relevant OGF document(s), any party will be able to obtain the right to implement, use and distribute the technology or works when implementing, using or distributing technology based upon the specific specification(s) under openly specified, reasonable, non-discriminatory terms. The working group or research group proposing the use of the technology with respect to which the proprietary rights are claimed may assist the OGF secretariat in this effort. The results of this procedure shall not affect advancement of document, except that the GFSG may defer approval where a delay may facilitate the obtaining of such assurances. The results will, however, be recorded by the OGF Secretariat, and made available. The GFSG may also direct that a summary of the results be included in any GFD published containing the specification. OGF Intellectual Property Policies are adapted from the IETF Intellectual Property Policies that support the Internet Standards Process.
4
© 2006 Open Grid Forum 4 Agenda What is DFDL? Provide enough context for those interested in getting involved but who haven't been following along Progress review Latest on the draft specification Next steps Participation needed to complete the specification, initial implementations
5
© 2006 Open Grid Forum 5 Agenda What is DFDL? Provide enough context for those interested in getting involved but who haven't been following along Progress review Latest on the draft specification, current implementations of DFDL processors Next steps Participation needed to complete the specification, initial implementations
6
© 2006 Open Grid Forum 6 Data Interchange Formats There are two kinds: Prescriptive standards: Put your data in this format! Textual - XML Binary – ASN.1, XDR, EBML, … You use the defined encodings, syntax, … Descriptions are shareable Others: Choose your own format Textual, binary, mixture You use your own encodings, syntax, … No common description DFDL: a universal description for any format
7
© 2006 Open Grid Forum 7 What is DFDL? 1.A way of describing data… It is NOT a data format itself! 2.That can describe any data format… Textual and binary Commercial record-oriented Scientific and numeric Modern and legacy 3.While allowing high performance… Choose the right data format for the job High density, optimized I/O, random access No need to use DFDL libraries
8
© 2006 Open Grid Forum 8 Benefits of DFDL 1.Data format independent 2.Prescriptive standards not always best for job 3.Descriptions can be shared 4.Interoperability 5.Allows after-the-event description 6.Supports move to SOA 7.Spans commercial and scientific worlds 8.Supports grid computing
9
© 2006 Open Grid Forum 9 Why Grids and DFDL? Grids are about big-data and big-computation problems Simplistic solutions like use XML wont cut it! Performance and space usage Grids are about universal data interchange XML – use XML Schema Relational – use RDBMS schema Other – use DFDL
10
© 2006 Open Grid Forum 10 DFDL Goals Build on existing standards Leverage XML technology and concepts Support very efficient parsers/formatters Support round-tripping Read and write data in described format from same description Keep simple cases simple Simple descriptions should be "human readable" To the same degree as XML Schema Generality Can describe any format Allow extensions for new data formats
11
© 2006 Open Grid Forum 11 XML Synergy Use XML Schema subset & type system to describe the logical format of the data Use annotations within the XSD to describe the physical representation of the data Use XPath when referencing fields within the data This approach actively used today: IBM WebSphere Message Broker Microsoft BizTalk flat file Others
12
© 2006 Open Grid Forum 12 DFDL Features Release 1: Subset of XML Schema Rich textual & binary data capabilities Scoping rules that govern how the annotations apply Validated input and output - from XML Schema Defaults - for missing values Null capability - for out-of-band values Reference – use of a previously read value in subsequent expressions Basic math – in expressions Uncertainty – stratagems to resolve choices, optionality Arrays – one dimension Very general parsing/writing capability Future: Multi-layer – description of an intermediate rep not exposed in final result Extensibility – new type/transform specification Arrays - multi-dimension
13
© 2006 Open Grid Forum 13 DFDL Schema Model Simple Types Represent data values Complex Types Represent structures Elements Represent named fields Can be data fields or structural fields Can repeat (arrays) Sequence groups Structures where the children occur in order Choice groups Structures where just one of the children occurs (eg: C union, COBOL redefine) Wildcards Points where any element can occur Missing – multi-dimensional arrays. Deferred to a future version.
14
© 2006 Open Grid Forum 14 DFDL Schema Model typedimensionelementsimpleType simple type hierarchy on later slide sequencechoice group * wildcard * * complexType
15
© 2006 Open Grid Forum 15 XML Schema Subset Namespace management xs:import/xs:include file management Local and global xs:element declarations Optional dimensionality via maxOccurs and minOccurs Optional default value Optional nillable attribute (null value support) Local and global xs:complexType definitions Most built-in xs:simpleTypes – see later slide Local and global user-defined xs:simpleTypes by derivation Local and global xs:sequence groups (no dimensionality) Local and global xs:choice groups (no dimensionality) xs:any element wildcards xs:appinfo annotations to carry the DFDL information
16
© 2006 Open Grid Forum 16 XML Schema – Not Required Local and global xs:attribute declarations xs:attribute groups xs:complexType derivations xs:union and xs:list simple types Built-in xs:simpleTypes specific to XML or otherwise superfluous Dimensionality on non-elements (that is, model groups) Identity Constraints Substitution Groups xs:redefine file management Local and global xs:all groups
17
© 2006 Open Grid Forum 17 Supported Simple Types anySimpleType stringQNameNOTATIONfloatdoubledecimalbooleanbase64BinaryhexBinaryanyURI normalizedString token languageNameNMTOKEN NMTOKENSNCName ID IDREFENTITY IDREFSENTITIES integer long nonPositiveIntegernonNegativeInteger negativeInteger positiveIntegerunsignedLong unsignedInt unsignedShort unsignedByte int short byte datetimedateTimegYeargYearMonthgMonthgMonthDaygDayduration
18
© 2006 Open Grid Forum 18 What DFDL is Not: FAQ I have a pre-defined XML Schema. Q: Can I use DFDL to populate it from a non- XML data file? A: Only partly: DFDL is focused on data format DFDL does not provide general data transformation Populating a pre-defined XML Schema involves two separate problems: 1.Using DFDL to describe the data file format 2.Using a transformation system to transform that to conform to the pre-defined schema (not DFDL's job)
19
© 2006 Open Grid Forum 19 DFDL is only about Format ! The structure of the DFDL schema is dictated by the logical structure of the data You must work bottom up. Start from the data format, not from what you want to turn it into.
20
© 2006 Open Grid Forum 20 Example 1: XML 5 7839372 8.6E-200 -7.1E8
21
© 2006 Open Grid Forum 21 Example 1: XSD
22
© 2006 Open Grid Forum 22 Example 1: Binary 0000 0005 0077 9e8c 169a 54dd 0a1b 4a3f ce29 46f6
23
© 2006 Open Grid Forum 23 Example 1: Binary DFDL <dfdl:format represtation="binaryInteger" byteOrder="bigEndian lengthKind=fixed length=4/> <xs:element name="y" type="double dfdl:length=8 /> Long form Short form
24
© 2006 Open Grid Forum 24 Example 1: Textual 5,7839372,baseQ=8.6E-200,irl1=-7.1E8 Separators, initiators, terminators are all examples of markup
25
© 2006 Open Grid Forum 25 Example 1: Textual DFDL <dfdl:format repType= text encoding= UTF-8 numberScheme= myDecimals lengthKind= delimited /> <xs:element name="y" type="double" dfdl:initiator="baseQ="/> <xs:element name="z" type="float" dfdl:initiator="irl1="/>
26
© 2006 Open Grid Forum 26 Long Form DFDL Annotations … …
27
© 2006 Open Grid Forum 27 Short Form DFDL Annotations … <xs:element name="y" type="double" dfdl:initiator="baseQ="/> … Non-native attribute syntax, easy for users to write
28
© 2006 Open Grid Forum 28 Element Form DFDL Annotations Used for properties that themselves contain the quotation mark characters and escape characters, … baseQ= …
29
© 2006 Open Grid Forum 29 Long Form – Construct Specific … … Annotation on xs:element can be dfdl:element. Restricts properties to those sensible for elements. Annotation on xs:sequence can be dfdl:sequence. Restricts properties to those sensible for sequences
30
© 2006 Open Grid Forum 30 Annotation Elements formatDefines the data format properties that apply to part of the logical data models when using long form. (Also the construct specfic variations). defineFormatDefines a reusable data format by collecting together other annotations and associating them with a name that can be referenced from elsewhere. assertDefines truths about a DFDL model when parsing the data. discriminatorDefines an expressions to use for predicate testing to resolve uncertainties such as choice branches numberFormatDefines the scheme used for expressing numbers in the logical data model. This includes aspects such as radix, field separators, digit grouping separators, leading or trailing signs, etc. escapeSchemeDefines the scheme by which quotation marks and escape characters can be specified. This is for use with delimited text formats. propertyUsed in the syntax of dfdl:format annotations when the property value is incompatible with XML attribute content. defineVariableDefines a variable that can be referenced elsewhere. This can be used to communicate a parameter from an earlier part of a schema to a later part. setVariableSets the value of a variable that has been defined earlier in the schema. hiddenDefines a hidden element that appears in the schema for use by the DFDL processor, but is not part of the logical data model described by the schema. typeSubstitutionDefines overload of a non-annotated types with annotated meanings
31
© 2006 Open Grid Forum 31 Agenda What is DFDL? Provide enough context for those interested in getting involved but who haven't been following along Progress review Latest on the draft specification, current implementations of DFDL processors Next steps Participation needed to complete the specification, further implementations
32
© 2006 Open Grid Forum 32 Current Specification Core is currently at version 031 Supplements for Advanced decimals Datetimes Advanced delimited formats (may go away) Bi-directional text May be found at https://forge.gridforum.org/sf/go/doc15032?nav=1
33
© 2006 Open Grid Forum Core Specification 031 DFDL Information Set (Infoset) DFDL Schema Model Syntax Basics Syntax and Usage of DFDL Annotation Elements Expression language Scoping Rules DFDL Properties Introduction: The DFDL Parser and Unparser DFDL Data Syntax Grammar Framing Simple Types Properties for Nullable Elements Sequence Groups Unordered Sequence Groups Assertion and Discriminator Evaluation Choices Arrays and Optional Elements: Repeating and Variable-Occurrence Data Items Calculated Value Properties Any Element Wildcard 33
34
© 2006 Open Grid Forum Highlights Infoset Expression Language Calculated Values 34
35
© 2006 Open Grid Forum DFDL Infoset Abstract information model Simplifies explanation of Parsing of an invalid instance Unparsing of data That data might be constructed programatically, not necessarily from parsing data Clarifies data model relationship to XML data DFDL Schema is implied Unlike XML Infoset, theres no DFDL Infoset concept without a DFDL Schema 35
36
© 2006 Open Grid Forum DFDL Infoset Simpler than XML Infoset Information Items Document Element 36
37
© 2006 Open Grid Forum DFDL Infoset 37
38
© 2006 Open Grid Forum 38 DFDL Expression Language Small Subset of XPath 2.0. Constrained usage XPATH 2.0 large language Data model is DFDL Infoset Syntax { expression }
39
© 2006 Open Grid Forum 39 Expression Language Usage Set properties dynamically from instance initiator, terminator, length, separator, occursSeparator, null dfdl:assert, dfdl:discriminator inputValueCalc, outputValueCalc dfdl:setVariable
40
© 2006 Open Grid Forum 40 XPath 2.0 Subset XPath 2.0 types Limited path expressions Must return single value Limited axis XPath 1.0 functions Variables
41
© 2006 Open Grid Forum 41 Path Expressions / axis :: nodetest [ predicate ] / Axis child, parent (..), self (.) Nodetest Nametest - Name or wildcard [ predicate ] Must return array index
42
© 2006 Open Grid Forum Calculated Values DFDL supports limited layering Properties: inputValueCalc outputValueCalc outputLengthCalc Note: inputLengthCalc is just length={ … } 42
43
© 2006 Open Grid Forum Calculated Values Example Logically, the data is a date. Physically, it is stored as 6 packed- decimal digits, unsigned Like so: 43 YY D D MM
44
© 2006 Open Grid Forum Physical Layers Shape 44 YY D D MM
45
© 2006 Open Grid Forum Physical Layer Representation <sequence dfdl:representation="binary" dfdl:numberRepresentation="packedDecimal" dfdl:decimalSigned="false > <element name="mm" type="int" dfdl:numberPattern="## /> <element name="dd" type="int" dfdl:numberPattern="## /> <element name="yy" type="int" dfdl:numberPattern="## /> 45 YY D D MM
46
© 2006 Open Grid Forum Hide Physical Layer <sequence dfdl:representation="binary" dfdl:numberRepresentation="packedDecimal" dfdl:decimalSigned="false > <element name="mm" type="int" dfdl:numberPattern="## /> <element name="dd" type="int" dfdl:numberPattern="## /> <element name="yy" type="int" dfdl:numberPattern="## /> … 46 YY D D MM Logical layer Physical layer (hidden)
47
© 2006 Open Grid Forum Parsing: Calculate Logical from Physical … hidden pdate here … { fn:date(fn:concat(if (../pdate/yy > 50 ) then "19" else "20", fn:string(../pdate/yy), "-", fn:string(../pdate/mm), "-", fn:string(../pdate/dd))) } 47 YY D D MM dfdl:property annotations to avoid quoting hell
48
© 2006 Open Grid Forum Unparsing: Calculate Physical from Logical <sequence dfdl:representation="binary" dfdl:numberRepresentation="packedDecimal" dfdl:decimalSigned="false" <element name="mm" type="int" dfdl:numberPattern="##" dfdl:outputValueCalc="{ fn:month-from-date(../d) }" /> <element name="dd" type="int" dfdl:numberPattern="##" dfdl:outputValueCalc="{ fn:day-from-date(../d) }" /> <element name="yy" type="int" dfdl:numberPattern="##" dfdl:outputValueCalc="{ fn:year-from-date(../d) mod 100 }" /> 48 YY D D MM
49
© 2006 Open Grid Forum 49 Agenda What is DFDL? Provide enough context for those interested in getting involved but who haven't been following along Progress review Latest on the draft specification, current implementations of DFDL processors Next steps Participation needed Readers & editors
50
© 2006 Open Grid Forum Where are we on this spec? schema components (UML – 2 layer) regular expressions for lengths (Alan) valueCalc examples (Mike) property precedence (Steve) bring supplements up-to-date (Steve) assertions/discriminators and choice (Suman) how speculative parsing works (combining choice and variable-occurrence - currently these are separate) (IBM WTX person) reordering the properties discussion (move representation earlier, improve flow of topics) (Alan) 50 We're Done Start Over escapeSchemes (Ian P) string xml type (Ian P) variables (Mike) selectors (Suman) improvements on property descriptions (All - split tbd) envelopes and payloads (Steve) Official OGF Draft
51
© 2006 Open Grid Forum Where are we on this spec? v032 v034 51 We're Done Start Over v033 Official OGF Draft 30 to 60 days between versions Official draft by roughly August
52
© 2006 Open Grid Forum 52 Completion of Specification Anybody willing to join the DFDL WG in an active capacity ? Anybody interested in reviewing the spec ? To validate that R1 satisfies sufficient requirements
53
© 2006 Open Grid Forum 53 New Implementations The more the merrier ! Anybody interested in creating or collaborating on a reference implementation?
54
© 2006 Open Grid Forum 54 Requirements Discussion Q: How many people were planning to use DFDL to convert data into XML ? Q: How many people were not ?
55
© 2006 Open Grid Forum 55 END
56
© 2006 Open Grid Forum 56 Full Copyright Notice Copyright (C) Open Grid Forum (2004, 2007). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. The limited permissions granted above are perpetual and will not be revoked by the OGF or its successors or assignees.
57
© 2006 Open Grid Forum Data Format Description Language (DFDL) Extra Information
58
© 2006 Open Grid Forum 58 Standards Efforts in This Area ESML (Earth Science Markup Language) Users can write external files using ESML schema to describe the structure of various earth science data formats. Applications can utilize the ESML Library to parse this description file and decode the data format. ESML (Earth Science Markup Language) XSIL (Extensible Scientific Interchange Language ) The Extensible Scientific Interchange Language (XSIL) is a flexible, hierarchical, extensible, transport language for scientific data objects. XSIL (Extensible Scientific Interchange Language ) BFD (Binary Format Description) language The Binary Format Description (BFD) language is an XML dialect based on the eXtensible Scientific Interchange Language(XSIL) that supports the executable documentation of 'arbitrary' binary and ascii data sets. BFD (Binary Format Description) language BinX (Binary XML Description Language) Another language used to describe the content, structure and physical layout (endian-ness, blocksize) of binary files. BinX is designed to enable transparent transfer of data between diverse platforms. BinX (Binary XML Description Language)
59
© 2006 Open Grid Forum 59 DFDL Data Syntax Grammar Data in a format describable via a DFDL schema obeys the syntax grammar Data divided into Content – used to compute logical value Framing – padding, delimiters, separators, etc
60
© 2006 Open Grid Forum 60 Elements Document = Element Element = Prefix ElementContent Suffix ElementContent = SimpleContent | ComplexContent ComplexContent = SequenceGroup | ChoiceGroup SimpleContent = NumberContent | StringContent |CalendarContent | BinaryContent | BooleanContent
61
© 2006 Open Grid Forum 61 SequenceGroup SequenceGroup = Prefix SequenceGroupContent Suffix SequenceGroupContent = [ SequenceGroupPrefix [ GroupItem [ SequenceGroupSeparator GrouptItem]*] SequenceGroupSuffix ] GroupItem = Element | Array | ComplexContent
62
© 2006 Open Grid Forum 62 Choice Group ChoiceGroup = Prefix ChoiceGroupContent Suffix ChoiceGroupContent = ChoiceGroupResolvableItem | ChoiceGroupUnresolvableItem ChoiceGroupResolvableItem = GroupItem ChoiceGroupUnresolvableItem = Opaque
63
© 2006 Open Grid Forum 63 Array Array = ArrayContent ArrayContent = [ OccursPrefix [ Element [ OccursSeparator Element ]* ] [ OccursSeparator StopValue ] OccursSuffix ] StopValue = SimpleContent
64
© 2006 Open Grid Forum 64 Nulls, Optional, Defaults Unparsing does not use defaults to make invalid logical tree valid. (system may provide validation) Required parse elements Scalar ( minOccurs=maxOccurs=1 or neither is specified ) Array of fixed occurances Everything else is optional ( minOccurs, maxOccurs only used for validation) If a required element is not found by speculative parsing then use default value. If an optional element is not found by speculative parsing then were at the end of the variable occurrences. No defaulting occurs NULLS Decoupled from defaults useNullValueForDefault nullValueInitiatorPolicy
65
© 2006 Open Grid Forum 65 Scoping <xs:sequence dfdl:separator="," dfdl:encoding="ebcdic-cp-us"> … Are the elements of the sequence in EBCDIC, or just the separators? (lexical scoping) Are the contents of type myType' affected by the sequences properties or not? (referential transparency)
66
© 2006 Open Grid Forum 66 Scoping - Rules Scope of a dfdl:format annotation is controlled by a special applies attribute hereOnly - applies to the annotated construct only toScope - applies to the annotated construct AND to any contained or referenced constructs A construct can have two dfdl:format annotations, one with hereOnly and one with toScope Most local toScope annotation wins Referential transparency maintained One exception – global elements and groups Can override at point of reference
67
© 2006 Open Grid Forum 67 Scoping - Example 1 <dfdl:format applies=toScope representation="binaryInteger" byteOrder="bigEndian lengthKind=fixed length=4/> <xs:element name="y" type="double dfdl:applies=hereOnly dfdl:length=8 />
68
© 2006 Open Grid Forum 68 Scoping - Example 2 <dfdl:format applies=toScope representation="binaryInteger" byteOrder="bigEndian lengthKind=fixed length=4/> <xs:element name="y" type="double dfdl:applies=hereOnly dfdl:length=8 /> Referential transparency ensures that both examples have the same semantic
69
© 2006 Open Grid Forum 69 Scoping – CSV Example 1 <xs:complexType name= csv dfdl:applies= toScope dfdl:representation= text dfdl:encoding= UTF-8 dfdl:lengthKind= delimited dfdl:separator= %0A%0D dfdl:occursSeparator= %0A%0D dfdl:occursKind= delimited > <xs:sequence dfdl:applies= hereOnly dfdl:separator=, > <xs:element name="y" type="double dfdl:applies= hereOnly dfdl:initiator="baseQ="/> <xs:element name="z" type="float" dfdl:applies= hereOnly dfdl:initiator="irl1="/>
70
© 2006 Open Grid Forum 70 Scoping – CSV Example 2 <xs:complexType name= csv_type dfdl:applies= toScope dfdl:representation= text dfdl:encoding= UTF-8 dfdl:lengthKind= delimited dfdl:separator= %0A%0D dfdl:occursSeparator= %0A%0D dfdl:occursKind= delimited > <xs:element name= row maxOccurs= unbounded type= row_type /> <xs:element name="y" type="double dfdl:applies= hereOnly dfdl:initiator="baseQ="/> <xs:element name="z" type="float" dfdl:applies= hereOnly dfdl:initiator="irl1="/> Referential transparency ensures that both examples have the same semantic
71
© 2006 Open Grid Forum 71 Selectors Selectors are a method by which multiple sets of annotation elements can be specified within a single DFDL Schema. For instance you may have a binary and text representation of the same logical data format that you wish to model within a single DFDL Schema. A selector is a special pseudo-property that can be specified on any default and construct specific annotation element. For instance: The value of the selector property is chosen by the user. If there is no selector property on an annotation element it is said to have no selector (there is no default). Selectors cannot be used with the short form of properties that are specified directly on XML Schema components. When the DFDL Schema is used to parse a document the user must specify the selector they wish to use. Only annotation elements that have a matching selector value will be used to parse the document.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.