Compilers - 2 CSCI/CMPE 3334 David Egle.

Compilers - 2 CSCI/CMPE 3334 David Egle

The Structure of a Compiler

What’s a token? A token is a syntactic category
In English: noun, verb, adjective, … In a programming language: Identifier, Integer, Keyword, Whitespace, … Parser relies on the token distinctions: identifiers are treated differently than keywords but all identifiers are treated the same, regardless of what lexeme created them

What are lexemes? Webster: Compilers:
“items in the vocabulary of a language” Compilers: same: items in the vocabulary of a language: numbers, keywords, identifiers, operators, etc. strings into which the input string is partitioned.

Lexical Analysis The input is just a sequence of characters. Example:
if (i == j) z = 0; else z = 1; More accurately, the input is the string: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Goal: find lexemes and map them to tokens: 1. partition input string into substrings (called lexemes), and 2. classify them according to their role (role = token)

Continued Scanner input: partitioned into these lexemes:
\tif(i==j)\n\t\tz = 0;\n\telse\n\t\tz = 1; partitioned into these lexemes: \t if ( i == j ) \n\t\t z = 0 ; \n\t else \n\t\t z = 1 ; mapped to a sequence of tokens IF, LPAR, ID(“i”), EQUALS, ID(“j”) … Notes: whitespace lexemes are dropped, not mapped to tokens some tokens have attributes: the lexeme and/or line number why do we need them?

Output of scanner Stream of tokens
Tokens usually given numeric values for efficiency see Figure 5.5, p 234 in text For identifiers and constants use a specific code together with the name/value NOTE that the scanner and parser work together Parser calls scanner when another token is required do not usually have separate scan and parse phases

Tokens usually given numeric values for efficiency
Stream of tokens Tokens usually given numeric values for efficiency see Figure 5.5, p 234 in text For identifiers and constants use a specific code together with the name/value NOTE that the scanner and parser work together Parser calls scanner when another token is required do not usually have separate scan and parse phases

Other duties of lexical analyzer
Read source lines Skip over comments Think about the difficulty here Print listing file if required source lines together with error messages may include a symbol table

Language dependencies
Special formatting COBOL used indents FORTRAN had special columns free format How are blanks handled? How are lines continued? Special structures Fortran: Do I = 1,23 vs DO I = 123 Keyword significance

Finite Automata Automaton is a good “visual” aid
but is not suitable as a specification (its textual description is too clumsy) Regular languages (formal specification) can be recognized by a finite-state machines Note: CS majors will cover this topic in greater detail in CSCI 4325

Finite Automata State Graphs
The start state A final state A transition a

Finite Automata Transition s1 →a s2 Is read
In state s1 on input “a” go to state s2 If end of input If in accepting state => accept Otherwise => reject If no transition possible (got stuck) => reject

Scanners as finite automata
Begin in starting state As characters are read, pass to different states if end of input stops at a final state, accept otherwise reject abc {accept} abccabc {accept} ac {reject} c 1 2 3 4 a b

Recognizing identifiers
Finding correct strings Allows underscore

Code to recognize identifier

An easier way Use the finite automaton (b) represented in tabular form
The program would traverse the table; given the current state and the incoming symbol, the new state (or error) would be determined Much cleaner and less error-prone than the code Easier to add to table than to the code

Small problem How would you check for limits on the lengths of strings being recognized? Not easily done using finite automata Would need further checking

Automata to recognize integers
(c) reads integers with leading 0’s (d) does not allow leading zeros – other than the value 0

Recognizing tokens for our grammar
To right is automata to recognize tokens Below is the tabular representation

Syntactic Analysis Recognize source statements as language constructs or build the parse tree for the statements. Two basic techniques Bottom-up Operator-precedence parsing Shift-reduce parsing LR(k) parsing Top-down Recursive-descent parsing

Syntactic Analysis – 2 Input: sequence of tokens from scanner
Output: abstract syntax tree (AST) Actually, parser first builds a parse tree AST is then built by translating the parse tree parse tree rarely built explicitly; only determined by, say, how parser pushes stuff to stack

Parse tree vs. abstract syntax tree
contains all tokens, including those that parser needs “only” to discover intended nesting: parentheses, curly braces statement termination: semicolons technically, parse tree shows concrete syntax Abstract syntax tree (AST) abstracts away artifacts of parsing, by flattening tree hierarchies, dropping tokens, etc. technically, AST shows abstract syntax

Tasks of the parser Parser performs two tasks: syntax checking
a program with a syntax error may produce an AST that’s different than intended by the programmer parse tree construction usually implicit used to build the AST

Operator-Precedence Parsing
Any terminal symbol (or any token) Precedence * » + + « * Operator-precedence Precedence relations between operators

Precedence matrix for our simple Pascal grammar

Operator-Precedence parsing
Examine pairs of consecutive operators Decides precedence of operators using table Analyses statement scanning for subexpressions whose operators have higher precedence than the surrounding operators

Operator-precedence algorithm

Example 1 BEGIN READ (VALUE); parse tree

Example 2 VARIANCE := SUMSQ DIV 100 – MEAN * MEAN;

Example 2 – cont’d

Example 2 – finished

Shift-reduce parsing Operator-precedence was one of earliest bottom- up techniques, and one of least powerful simple to construct frequently used to parse expressions Ideas behind it were later developed into a more general method known as shift-reduce parsing Shift-reduce parsers use a stack to store tokens that have not yet been recognized in terms of the grammar

Shift-reduce parser - 2 Two main actions Note that
shift – push current token onto stack reduce – recognize symbols on top of stack according to a rule of the grammar Note that shift roughly corresponds to action taken by operator- precedence in first part of IF reduce corresponds to ELSE part

Notes on shift-reduce parsers
Can be applied to class of grammars known as LR L – left to right scan of input R – rightmost derivation Can be shown that for this class of grammars, the symbols to be recognized will always appear on top of the stack (not inside it) This justifies the use of a stack in shift reduce parsing

LR(k) parsers Most powerful shift-reduce parsing technique
k indicates the number of tokens to look ahead usually k=1 Can handle a wide variety of grammars Works fast detects errors as soon as first incorrect token is encountered LR(1) is usually referred to as LR Uses a stack and a state table

LR state table Two parts
action part: what is to be performed next go-to part: what state to go to after determining a nonterminal Generation of this table is the hardest part of this parser States Terminals Nonterminals … 1 action part go-to part 2

Sample table Sn – shift and go to state n
Rn – reduce according to rule n Rules: 1) E = E + T 2) E = T 3) T = T * F 4) T = F 5) F = (E) F = i E – expression T – term F – factor i – identifier or integer S4

Algorithm Push state 0 on stack Repeat let q = current state and let t be incoming token let x = table[q, t] case x of SHIFT q: push t and new state on stack REDUCE n: reduce by rule n; push result and state on stack ACCEPT: parse complete ERROR: input error until input accepted or error

Reducing is complicated
If right hand side of rule contains k symbols, pop top k “things” (symbol and state) off the stack – this is the handle Note state remaining on top of stack – call this q If left side of rule is X, look at the go-to part of the table at [q,X] and get the state, qk Push X and qk on stack

Example Using the sample state table:
Is b * (c + d) a valid expression? What about b * (c d +) ? How soon is the error detected?

Recursive-Descent Parsing
recursive descent parser derives all strings until it matches derived string with the input string or until it is sure there is a syntax error is a top-down technique

Example (recursive descent)
Consider the grammar <E> ::= <T> + <E> | <T> <T> ::= int | int * <T> | ( <E> ) Token stream is: int5 * int2 Start with top-level non-terminal <E> Try the rules for <E> in order

Example – cont’d Try E0 → T1 + E2 Then try a rule for T1 → ( E3 )
But ( does not match input token int5 Try T1 → int . Token matches. But + after T1 does not match input token * Try T1 → int * T2 This will match but + after T1 will be unmatched Have exhausted the choices for T1 Backtrack to choice for E0

Example – cont’d Try E0 → T1 Follow same steps as before for T1
And succeed with T1 → int * T2 and T2 → int With the following parse tree E0 T1 int5 * T2 int2

A Recursive Descent Parser
Define boolean functions that check the token string for a match of A given token terminal bool term(TOKEN tok) { return in[next++] == tok; } A given production of S (the nth) bool Sn() { … } Any production of S: bool S() { … } These functions advance the pointer next

For production <E> ::= <T> + <E> bool E1() { return T() && term(PLUS) && E(); } For production <E>::= <T> bool E2() { return T(); } For all productions of E (with backtracking) bool E() { int save = next; return (next = save, E1()) || (next = save, E2()); }

Functions for non- terminal T bool T1() { return term(OPEN) && E() && term(CLOSE); } bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool T() { int save = next; return (next = save, T1()) || (next = save, T2()) || (next = save, T3()); }

To start the parser Initialize next to point to first token Invoke E() Easy to implement by hand But does not always work …

When Recursive Descent Does Not Work
A left-recursive grammar has a non-terminal S S →+ Sα for some α Recursive descent does not work in such cases It goes into an infinite loop Notes: α: a shorthand for any string of terminals, nonterminals symbol →+ is a shorthand for “can be derived in one or more steps”: S →+ Sα is same as S → … → Sα

When Recursive Descent Does Not Work
Consider a production S → S a: In the process of parsing S we try the above rule What goes wrong? A fix? S must have a non-recursive production, say S → b expand this production before you expand S → S a Problems remain performance (steps needed to parse “baaaaa”) termination (parse the error input “c”)

Solutions First, restrict backtracking
backtrack just enough to produce a sufficiently powerful r.d. parser Second, eliminate left recursion transformation that produces a different grammar the new grammar generates same strings but does it give us same parse tree as old grammar?

Summary of Recursive Descent
simple parsing strategy left-recursion must be eliminated first … but that can be done automatically unpopular because of backtracking thought to be too inefficient in practice, backtracking is (sufficiently) eliminated by restricting the grammar so, it’s good enough for small languages careful, though: order of productions important even after left- recursion eliminated try to reverse the order of E → T + E | T what goes wrong?

Predictive Parsers Like recursive-descent but parser can “predict” which production to use By looking at the next few tokens No backtracking Predictive parsers accept LL(k) grammars L means “left-to-right” scan of input L means “leftmost derivation” k means “predict based on k tokens of lookahead” In practice, LL(1) is used

Compilers - 2 CSCI/CMPE 3334 David Egle.

Similar presentations

Presentation on theme: "Compilers - 2 CSCI/CMPE 3334 David Egle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compilers - 2 CSCI/CMPE 3334 David Egle.

Similar presentations

Presentation on theme: "Compilers - 2 CSCI/CMPE 3334 David Egle."— Presentation transcript:

Similar presentations

About project

Feedback