CS 3304 Comparative Languages Lecture 3: Scanning 24 January 2012
Introduction Syntax: the form or structure of the expressions, statements, and program units. Semantics: the meaning of the expressions, statements, and program units. Syntax and semantics provide a language’s definition. Users of a language definition: Other language designers. Implementers. Programmers (the users of the language). Basic terminology: A sentence is a string of characters over some alphabet. A language is a set of sentences. A lexeme is the lowest level syntactic unit of a language. A token is a category of lexemes (e.g., identifier).
Defining Languages Recognizers: Generators: A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. Example: syntax analysis part of a compiler (scanning). Generators: A device that generates sentences of a language. One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.
Regular Expressions A regular expression is one of the following: A character. The empty string, denoted by ε. Two regular expressions concatenated. Two regular expressions separated by | (i.e., or). A regular expression followed by the Kleene star (concatenation of zero or more strings). Numerical literals in Pascal may be generated by the following:
Context-Free Grammars Developed by Noam Chomsky in the mid-1950s. Language generators, meant to describe the syntax of natural languages. Define a class of languages called context-free languages. Backus-Naur Form (1959): Invented by John Backus to describe Algol 58. BNF is equivalent to context-free grammars (CFGs). A CFG consists of: A set of terminals T. A set of non-terminals N. A start symbol S (a non-terminal). A set of productions.
BNF Fundamentals In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just terminals). Terminals are lexemes or tokens. A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. Nonterminals are often italic or enclosed in angle brackets. Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> Grammar: a finite non-empty set of rules. A start symbol is a special element of the nonterminals of a grammar.
Context-Free Grammar Example Expression grammar with precedence and associativity:
Parse Tree Example 1 Parse tree for expression grammar (with precedence) for 3 + 4 * 5
Parse Tree Example 2 Parse tree for expression grammar (with left associativity) for 10 - 4 - 3
Scanner Recall scanner is responsible for: Tokenizing source. Removing comments. (Often) dealing with pragmas (i.e., significant comments). Saving text of identifiers, numbers, strings. Saving source locations (file, line, column) for error messages.
Scanning Example I Suppose we are building an ad-hoc (hand-written) scanner for Pascal: We read the characters one at a time with look-ahead. If it is one of the one-character tokens: { ( ) [ ] < > , ; = + - etc } we announce that token. If it is a ., we look at the next character: If that is a dot, we announce . Otherwise, we announce . and reuse the look-ahead.
Scanning Example II If it is a <, we look at the next character if that is a = we announce <= otherwise, we announce < and reuse the look-ahead, etc. If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore: Then we check to see if it is a reserve word. If it is a digit, we keep reading until we find a non-digit: If that is not a . we announce an integer. Otherwise, we keep looking for a real number. If the character after the . is not a digit we announce an integer and reuse the . and the look-ahead.
Deterministic Finite Automaton Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. This is a deterministic finite automaton (DFA): Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. Specifically, they construct a machine that accepts the language.
The Longest Possible Token Rule We scanover and over to get one token after another. Nearly universal rule: always take the longest possible token from the input, thus: foobar is foobar and never f or foo or foob. The rule means you return only when the next character can't be used to continue the current token: The next character will generally be saved for the next token. In some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed: In Pascal, for example, when you have a 3 and you a see a dot Do you proceed (in hopes of getting 3.14)? or Do you stop (in fear of getting 3..5)? Regular expressions "generate" a regular language. DFAs "recognize” a regular language.
Building Scanners Scanners tend to be built three ways: Ad-hoc. Semi-mechanical pure DFA (usually as nested case statements). Table-driven DFA. Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): It is often easier to use perl, awk, sed or similar tools. Table-driven DFA is what lex and scangen produce: lex (flex): C code scangen: numeric tables and a separate driver (Figure 2.12). ANTLR: Java code.
Summary BNF and context-free grammars are equivalent meta- languages that are well-suited for describing the syntax of programming languages. Syntax analysis is a common part of language implementation Scanners (lexical analyzers) use pattern matching to isolate small-scale parts of a program. ANTLR provides supports for scanners (lexers), parsers, and tree-parsers.