Download presentation
Presentation is loading. Please wait.
Published byLesley Phillips Modified over 9 years ago
1
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06
2
2 The problem & solutions Wide range of files and formats Standard formats Prescriptive parsers Arbitrary formats Machines need to merge, parse, and generally comprehend these various formats Potential Solutions: Data must adhere to a pre-specified format Customized programs are written for each format and version Users describe the format of their data and use tools to convert the data to a widely used and machine understandable format (e.g. XML)
3
3 Descriptive Parser solution- DFDL Data Format and Description Language Uses XML schema with DFDL specific annotations to describe the underlying data how to transform it to logical model. Example: “5, 9.35091E+02, 2.63227E+02, -6.20633E+07” 935.091 263.227 -6.20633E7
4
4 text UTF-8, Example DFDL Schema
5
5 Defuddle Parser Design An implementation of the DFDL specification
6
6 CapabilitiesCapabilities Basic Binary/text parsing of simple types Basic math operations Looping Conditional logic Use of regular expressions for separators and terminators. Input from multiple data sources. Advanced External translators Specify intermediate layers in the data which can be used for processing, but are not reflected in the output
7
7 Parsing Complex Formats Scientific formats that Defuddle capabilities have been demonstrated on: CHEMKIN solution file NWChem molecular dynamics property file NWChem electronic structure output file Microarray and Protein-Protein interaction spreadsheets Transformations within scientific workflows to avoid custom programming Other formats that we would like to see handled in the future… HDF, jpeg, etc.
8
8 What problems does Defuddle address? Integrating different data formats, for collaboration of data generated before/without standardization. Naming/identification of arbitrary file sub/super-structures Long-term preservation and reading of data when the applications used to create it are no longer available. Efficient, general data access capabilities Random access Data Virtualization Multiple descriptions of the same data Using DFDL and DFDL -1 as general subsetting/transformation mechanism Metadata Extraction
9
9 Extracting metadata SAM DFDL+XSLT Benefits of automatic provenance/annotation capture Example use: Microarray data – extracting header information Application to Provenance
10
10 DiscussionDiscussion Challenges Efficient and Generic – Is it possible? Size Variable length text Data Virtualization, providing an abstract view of the data, independent of underlying storage system Naming of data subsets, map name to reference of logical model, not physical. Eg: //step[5]/pressure … -6.20633E7
11
11 Questions?Questions? http://sdg.pnl.gov http://defuddle.pnl.gov http://forge.gridforum.org/projects/dfdl-wg Tara.Talbott@pnl.gov
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.