Download presentation
Presentation is loading. Please wait.
1
Compilers - 2 CSCI/CMPE 3334 David Egle
2
The Structure of a Compiler
3
What’s a token? A token is a syntactic category
In English: noun, verb, adjective, … In a programming language: Identifier, Integer, Keyword, Whitespace, … Parser relies on the token distinctions: identifiers are treated differently than keywords but all identifiers are treated the same, regardless of what lexeme created them
4
What are lexemes? Webster: Compilers:
“items in the vocabulary of a language” Compilers: same: items in the vocabulary of a language: numbers, keywords, identifiers, operators, etc. strings into which the input string is partitioned.
5
Lexical Analysis The input is just a sequence of characters. Example:
if (i == j) z = 0; else z = 1; More accurately, the input is the string: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Goal: find lexemes and map them to tokens: 1. partition input string into substrings (called lexemes), and 2. classify them according to their role (role = token)
6
Continued Scanner input: partitioned into these lexemes:
\tif(i==j)\n\t\tz = 0;\n\telse\n\t\tz = 1; partitioned into these lexemes: \t if ( i == j ) \n\t\t z = 0 ; \n\t else \n\t\t z = 1 ; mapped to a sequence of tokens IF, LPAR, ID(“i”), EQUALS, ID(“j”) … Notes: whitespace lexemes are dropped, not mapped to tokens some tokens have attributes: the lexeme and/or line number why do we need them?
7
Output of scanner Stream of tokens
Tokens usually given numeric values for efficiency see Figure 5.5, p 234 in text For identifiers and constants use a specific code together with the name/value NOTE that the scanner and parser work together Parser calls scanner when another token is required do not usually have separate scan and parse phases
8
Tokens usually given numeric values for efficiency
Stream of tokens Tokens usually given numeric values for efficiency see Figure 5.5, p 234 in text For identifiers and constants use a specific code together with the name/value NOTE that the scanner and parser work together Parser calls scanner when another token is required do not usually have separate scan and parse phases
9
Other duties of lexical analyzer
Read source lines Skip over comments Think about the difficulty here Print listing file if required source lines together with error messages may include a symbol table
10
Language dependencies
Special formatting COBOL used indents FORTRAN had special columns free format How are blanks handled? How are lines continued? Special structures Fortran: Do I = 1,23 vs DO I = 123 Keyword significance
11
Finite Automata Automaton is a good “visual” aid
but is not suitable as a specification (its textual description is too clumsy) Regular languages (formal specification) can be recognized by a finite-state machines Note: CS majors will cover this topic in greater detail in CSCI 4325
12
Finite Automata State Graphs
The start state A final state A transition a
13
Finite Automata Transition s1 →a s2 Is read
In state s1 on input “a” go to state s2 If end of input If in accepting state => accept Otherwise => reject If no transition possible (got stuck) => reject
14
Scanners as finite automata
Begin in starting state As characters are read, pass to different states if end of input stops at a final state, accept otherwise reject abc {accept} abccabc {accept} ac {reject} c 1 2 3 4 a b
15
Recognizing identifiers
Finding correct strings Allows underscore
16
Code to recognize identifier
17
An easier way Use the finite automaton (b) represented in tabular form
The program would traverse the table; given the current state and the incoming symbol, the new state (or error) would be determined Much cleaner and less error-prone than the code Easier to add to table than to the code
18
Small problem How would you check for limits on the lengths of strings being recognized? Not easily done using finite automata Would need further checking
19
Automata to recognize integers
(c) reads integers with leading 0’s (d) does not allow leading zeros – other than the value 0
20
Recognizing tokens for our grammar
To right is automata to recognize tokens Below is the tabular representation
21
Syntactic Analysis Recognize source statements as language constructs or build the parse tree for the statements. Two basic techniques Bottom-up Operator-precedence parsing Shift-reduce parsing LR(k) parsing Top-down Recursive-descent parsing
22
Syntactic Analysis – 2 Input: sequence of tokens from scanner
Output: abstract syntax tree (AST) Actually, parser first builds a parse tree AST is then built by translating the parse tree parse tree rarely built explicitly; only determined by, say, how parser pushes stuff to stack
23
Parse tree vs. abstract syntax tree
contains all tokens, including those that parser needs “only” to discover intended nesting: parentheses, curly braces statement termination: semicolons technically, parse tree shows concrete syntax Abstract syntax tree (AST) abstracts away artifacts of parsing, by flattening tree hierarchies, dropping tokens, etc. technically, AST shows abstract syntax
24
Tasks of the parser Parser performs two tasks: syntax checking
a program with a syntax error may produce an AST that’s different than intended by the programmer parse tree construction usually implicit used to build the AST
25
Operator-Precedence Parsing
Any terminal symbol (or any token) Precedence * » + + « * Operator-precedence Precedence relations between operators
26
Precedence matrix for our simple Pascal grammar
27
Operator-Precedence parsing
Examine pairs of consecutive operators Decides precedence of operators using table Analyses statement scanning for subexpressions whose operators have higher precedence than the surrounding operators
28
Operator-precedence algorithm
29
Example 1 BEGIN READ (VALUE); parse tree
30
Example 2 VARIANCE := SUMSQ DIV 100 – MEAN * MEAN;
31
Example 2 – cont’d
32
Example 2 – cont’d
33
Example 2 – finished
34
Shift-reduce parsing Operator-precedence was one of earliest bottom- up techniques, and one of least powerful simple to construct frequently used to parse expressions Ideas behind it were later developed into a more general method known as shift-reduce parsing Shift-reduce parsers use a stack to store tokens that have not yet been recognized in terms of the grammar
35
Shift-reduce parser - 2 Two main actions Note that
shift – push current token onto stack reduce – recognize symbols on top of stack according to a rule of the grammar Note that shift roughly corresponds to action taken by operator- precedence in first part of IF reduce corresponds to ELSE part
36
Notes on shift-reduce parsers
Can be applied to class of grammars known as LR L – left to right scan of input R – rightmost derivation Can be shown that for this class of grammars, the symbols to be recognized will always appear on top of the stack (not inside it) This justifies the use of a stack in shift reduce parsing
37
LR(k) parsers Most powerful shift-reduce parsing technique
k indicates the number of tokens to look ahead usually k=1 Can handle a wide variety of grammars Works fast detects errors as soon as first incorrect token is encountered LR(1) is usually referred to as LR Uses a stack and a state table
38
LR state table Two parts
action part: what is to be performed next go-to part: what state to go to after determining a nonterminal Generation of this table is the hardest part of this parser States Terminals Nonterminals … 1 action part go-to part 2
39
Sample table Sn – shift and go to state n
Rn – reduce according to rule n Rules: 1) E = E + T 2) E = T 3) T = T * F 4) T = F 5) F = (E) F = i E – expression T – term F – factor i – identifier or integer S4
40
Algorithm Push state 0 on stack Repeat let q = current state and let t be incoming token let x = table[q, t] case x of SHIFT q: push t and new state on stack REDUCE n: reduce by rule n; push result and state on stack ACCEPT: parse complete ERROR: input error until input accepted or error
41
Reducing is complicated
If right hand side of rule contains k symbols, pop top k “things” (symbol and state) off the stack – this is the handle Note state remaining on top of stack – call this q If left side of rule is X, look at the go-to part of the table at [q,X] and get the state, qk Push X and qk on stack
42
Example Using the sample state table:
Is b * (c + d) a valid expression? What about b * (c d +) ? How soon is the error detected?
43
Recursive-Descent Parsing
recursive descent parser derives all strings until it matches derived string with the input string or until it is sure there is a syntax error is a top-down technique
44
Example (recursive descent)
Consider the grammar <E> ::= <T> + <E> | <T> <T> ::= int | int * <T> | ( <E> ) Token stream is: int5 * int2 Start with top-level non-terminal <E> Try the rules for <E> in order
45
Example – cont’d Try E0 → T1 + E2 Then try a rule for T1 → ( E3 )
But ( does not match input token int5 Try T1 → int . Token matches. But + after T1 does not match input token * Try T1 → int * T2 This will match but + after T1 will be unmatched Have exhausted the choices for T1 Backtrack to choice for E0
46
Example – cont’d Try E0 → T1 Follow same steps as before for T1
And succeed with T1 → int * T2 and T2 → int With the following parse tree E0 T1 int5 * T2 int2
47
A Recursive Descent Parser
Define boolean functions that check the token string for a match of A given token terminal bool term(TOKEN tok) { return in[next++] == tok; } A given production of S (the nth) bool Sn() { … } Any production of S: bool S() { … } These functions advance the pointer next
48
A Recursive Descent Parser
For production <E> ::= <T> + <E> bool E1() { return T() && term(PLUS) && E(); } For production <E>::= <T> bool E2() { return T(); } For all productions of E (with backtracking) bool E() { int save = next; return (next = save, E1()) || (next = save, E2()); }
49
A Recursive Descent Parser
Functions for non- terminal T bool T1() { return term(OPEN) && E() && term(CLOSE); } bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool T() { int save = next; return (next = save, T1()) || (next = save, T2()) || (next = save, T3()); }
50
A Recursive Descent Parser
To start the parser Initialize next to point to first token Invoke E() Easy to implement by hand But does not always work …
51
When Recursive Descent Does Not Work
A left-recursive grammar has a non-terminal S S →+ Sα for some α Recursive descent does not work in such cases It goes into an infinite loop Notes: α: a shorthand for any string of terminals, nonterminals symbol →+ is a shorthand for “can be derived in one or more steps”: S →+ Sα is same as S → … → Sα
52
When Recursive Descent Does Not Work
Consider a production S → S a: In the process of parsing S we try the above rule What goes wrong? A fix? S must have a non-recursive production, say S → b expand this production before you expand S → S a Problems remain performance (steps needed to parse “baaaaa”) termination (parse the error input “c”)
53
Solutions First, restrict backtracking
backtrack just enough to produce a sufficiently powerful r.d. parser Second, eliminate left recursion transformation that produces a different grammar the new grammar generates same strings but does it give us same parse tree as old grammar?
54
Summary of Recursive Descent
simple parsing strategy left-recursion must be eliminated first … but that can be done automatically unpopular because of backtracking thought to be too inefficient in practice, backtracking is (sufficiently) eliminated by restricting the grammar so, it’s good enough for small languages careful, though: order of productions important even after left- recursion eliminated try to reverse the order of E → T + E | T what goes wrong?
55
Predictive Parsers Like recursive-descent but parser can “predict” which production to use By looking at the next few tokens No backtracking Predictive parsers accept LL(k) grammars L means “left-to-right” scan of input L means “leftmost derivation” k means “predict based on k tokens of lookahead” In practice, LL(1) is used
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.