Grammar-based Specification and Parsing for Binary File Formats

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Advanced Decision Support for Archival Processing of Presidential E-Records: Results and Demonstration William Underwood, P.I. Georgia Tech Research Institute.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky Presented by Ramiro Lopez.
ANTLR.
Guide to Using Message Maker Robert Snelick National Institute of Standards & Technology (NIST) December 2005
Evolution of a Prototype Archival System for Preserving & Reviewing Electronic Records 2008 SAA Annual Meeting August 30, 2008.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Getting Started with ANTLR Chapter 1. Domain Specific Languages DSLs are high-level languages designed for specific tasks DSLs include data formats, configuration.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Boardr The Racing Board Game Creation Language. Project Manager: Eric Leung Language and Tools Guru: Shensi Ding System Architect: Seong Jin Park System.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
AN IMPLEMENTATION OF A REGULAR EXPRESSION PARSER
Intro to Lexing & Parsing CS 153. Two pieces conceptually: – Recognizing syntactically valid phrases. – Extracting semantic content from the syntax. E.g.,
Lexical Analysis Hira Waseem Lecture
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CS 152: Programming Language Paradigms April 2 Class Meeting Department of Computer Science San Jose State University Spring 2014 Instructor: Ron Mak
1 Chapter 1 Introduction. 2 Outlines 1.1 Overview and History 1.2 What Do Compilers Do? 1.3 The Structure of a Compiler 1.4 The Syntax and Semantics of.
Weaving a Debugging Aspect into Domain-Specific Language Grammars SAC ’05 PSC Track Santa Fe, New Mexico USA March 17, 2005 Hui Wu, Jeff Gray, Marjan Mernik,
. n COMPILERS n n AND n n INTERPRETERS. -Compilers nA compiler is a program thatt reads a program written in one language - the source language- and translates.
5-Dec-15 Sequential Files and Streams. 2 File Handling. File Concept.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
1 Compiler Construction Vana Doufexi office CS dept.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
Fangfang Cui Mar. 29, Overview 1. LL(k) 1. LL(k) Definition 2. LL(1) 3. Using a Parsing Table 4. First Sets 5. Follow Sets 6. Building a Parsing.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Comp 411 Principles of Programming Languages Lecture 3 Parsing
Lecture 9 Symbol Table and Attributed Grammars
Basic Concepts: computer, program, programming …
Intro to compilers Based on end of Ch. 1 and start of Ch. 2 of textbook, plus a few additional references.
Compiler Design (40-414) Main Text Book:
PRINCIPLES OF COMPILER DESIGN
Chapter 1 Introduction.
Introduction to Compiler Construction
Computer Science 210 Computer Organization
CS 3304 Comparative Languages
A Simple Syntax-Directed Translator
CS 326 Programming Languages, Concepts and Implementation
Lexical Analysis (Sections )
Textbook:Modern Compiler Design
String Concepts In general, a string is a series of characters treated as a unit. Computer science has long recognized the importance of strings, but it.
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 1 Introduction.
Introduction to XHTML.
CHAPTER 5 JAVA FILE INPUT/OUTPUT
Computer Science 210 Computer Organization
Bison: Parser Generator
Course supervisor: Lubna Siddiqui
Front End vs Back End of a Compilers
Syntax-Directed Definition
Programming Language Syntax 2
Computer Science 210 Computer Organization
Compilers B V Sai Aravind (11CS10008).
Representation, Syntax, Paradigms, Types
CS 3304 Comparative Languages
Fundamentals of Data Structures
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
William Lewis CSU Fresno
CS 3304 Comparative Languages
Problem Solving Skill Area 305.1
ABNF in ACL2 Alessandro Coglio Kestrel Institute Workshop 2017.
DATABASES WHAT IS A DATABASE?
Chapter 10: Compilers and Language Translation
Compiler Construction
Faculty of Computer Science and Information System
Presentation transcript:

Grammar-based Specification and Parsing for Binary File Formats William Underwood Georgia Tech Research Institute Atlanta, Georgia, USA PASIG Lightning Talk, May 23, 2013 Thank reviewers for their suggestions for not only improving the paper, but extending the research. Thank organizers for this venue to report and discuss my digital curation research. This presentation reports on an investigation of potentiallly sustainable methods for the long-term preservation of digital data and records. Specifically, it reports progress in addressing the research question: Is it possible to extend the context-free grammars used to specify the syntax of programming languages to the specification of binary file formats and to use these grammars with parsers for validating the file formats of binary files?   Filename - 1 1

Motivation: Digital Curation Tools Digital Curators need automated tools for: Validation of file formats, with pertinent error messages Extraction of metadata Viewing/playing/reading file formats Conversion of legacy formats to current/standard formats 1 Identification is required, to know which validator, metadata extractor, renderer, or converter to use. 2. Validation is required because If a file has been damaged, it may be possible to obtain an undamaged copy, or repair the file. File might need to comply with a standard format, e.g., PDF/A. Good error messages are needed in order to correct problems with records. 3. Metadata is needed for identification or characterization of records 4. Can’t render or read data without viewers/ players and readers 5. Conversion to current or standard formats is necessitated by obsolescence

Motivation: Digital Curation Tools Most all of these tools, especially those for binary file formats, are manually programmed from file format specifications. The tools become obsolete when the hardware/software platform on which the programs run becomes obsolete. The tools either need to be reprogrammed, or become unavailable. Need for a sustainable, more cost effective, digital preservation strategy for binary file formats. With sustainable methods for validation, metadata extraction, viewing/playing /reading, might not need to convert legacy to current or standard file formats. [Don’t say this here]

Research Questions Is it possible to extend the concept of context-free grammars from textual languages to binary file formats? Is it possible to specify binary file formats using these extended context-free binary array attribute grammars? Is it possible to develop a parser generator that takes a binary file grammar for a binary file format and generates a parser that can validate the file format?

Generating Parsers for Binary Formats Goal is a parser generator for binary file formats. ANTLR (Another Tool for language Recognition) is a parser generator Input: Attribute LL(k) grammar for a string language Output: Source code (e.g. Java for a recognizer of that language) Wrote functions for each data type that converts character (byte) tokens to binary data types. Uses ANTLR to demonstrate the capability of generating parsers from file format grammars

5x7plate.tif

Pretty Printed (Partial) Parse Tree

Results It is possible to extend context-free grammars for textual languages to the specification of some chunk-based and directory-based binary file formats. It is possible to create parsers from these grammars for these binary file formats. ANTLR, a parser generator for LL(k) grammars, has been successfully used to generate parsers for two chunk-based file formats and two directory-based file formats. Another family of binary file formats is directory-based. A directory-based file format consists of one or more directory tables which contain one or more directory entries. A directory entry specifies where the actual data for a type of information is located. This is the scheme used in TIFF files, OLE (Microsoft Object Linking and Embedding) files, OASIS OpenDocument and Microsoft Open Office files. The significance of success in this research task is that if binary file formats can be specified with binary file grammars, then only one parsing generator is needed to generate the parsers for verifying that a file conforms to a particular format. Similarly, the same parsing generator can be used for conversion of legacy file formats to current or standard formats. Finally, the same parser generator can be used for generating viewer/players for most file formats. This would increase the likelihood of preserving and making available into the indefinite future that digital data and those digital records encoded in binary file formats.

Further Research Questions Can semantics be incorporated into the grammars for binary file formats to enable the generation of Validators? viewers/players? file format converters?

Further Information http://perpos.gtri.gatech.edu W. Underwood. Grammar-based Recognition of Documentary Forms and Extraction of Metadata. The International Journal of Digital Curation, Vol 5, Issue 1, 2010. www.ijdc.net/index.php/ijdc/article/view/152 W. Underwood. Grammar-based Specification and Parsing of Binary File formats. International Digital Curation Journal, 2012.