Chapter 4 Syntax.

Name: Chapter 4 Syntax.
Uploaded: 2017-10-20T17:33:06+00:00
Duration: PTM35S35
Description: Chapter 4 Syntax.

Chapter 4 Syntax

Why split lexical analysis from parsing?
Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? Simplifies design Parsers with whitespace and comments are more awkward Efficiency Only use the most powerful technique that works and nothing more No parsing sledgehammers for lexical nuts Portability More modular code More code re-use

Source Code Characteristics
Identifiers Count, max, get_num Language keywords: reserved or predefined (reserved cannot be re-defined by user) switch, if .. then.. else, printf, return, void Mathematical operators +, *, >> …. <=, =, != … Literals “Hello World” e65 Comments – ignored by compiler Whitespace

Reserved words versus predefined identifiers
Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier). Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t be). Examples of predefined identifiers in Ruby: ARGV, STRERR, TRUE, FALSE, NIL. Examples of reserved words in Ruby: alias and BEGIN begin break case class def defined? do else elsif END end ensure false for if in module next nil not or redo rescue retry return self super then true undef unless until when while yield So TRUE has value true, but can be redefined. true cannot be

Terminology of Lexical Analysis
Tokens: category - Often an enumerated type IDENTIFIER (value 20) Patterns: regular expression which defines the token Lexemes: actual string matched (captured)

Knowing Tokens are not enough…
Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information (value, name) Other data items are attributes of the tokens

Token delimiters When does a token/lexeme end? e.g xtemp=-ytemp

Ambiguity in identifying tokens
A programming language definition will state how to resolve uncertain token assignment <> Is it 1 or 2 tokens? Reserved keywords (e.g. if) take precedence over identifiers (rules are same for both) Disambiguating rules state what to do ‘Principle of longest substring’: greedy match (xtemp == 64.23)

Lexical Analysis What is hard? How do I describe the rules?
How do I find the tokens What do I do with ambiguities: is ++ one symbol or two? We said two, but that depends on my language, right? Does it take lots of lookahead/backtracking? Ex: no spaces whileamerica>china whileamerica=china When do I know how to interpret statement?

Wouldn’t it be nice? Wouldn’t it be nice if we could specify how to do lexical analysis by Here are the regular expressions which are associated with these tokens names. Please tell me the lexeme and token. And by the way… Always match the longest string Among rules that match the same number of characters, use the one which I listed first. This is what a program like lex/flex does.

In JavaCC (a commercial parser generator), you specify
SKIP : { " " | "\r" | "\t" } TOKEN : < EOL: "\n" > TOKEN : /* OPERATORS */ < PLUS: "+" > | < MINUS: "-" > | < MULTIPLY: "*" > | < DIVIDE: "/" > < FLOAT: ( (<DIGIT>)+ "." ( <DIGIT> )* ) | ( "." ( <DIGIT> )+ ) > | < INTEGER: (<DIGIT>)+ > | < #DIGIT: ["0" - "9"]> < TYPE: ("float"|"int")> | < IDENTIFIER: ( (["a"-"z"]) | (["A"-"Z"]) ) ( (["a"-"z"]) | (["A"-"Z"]) | (["0"-"9"]) | "_" )*> In JavaCC (a commercial parser generator), you specify your tokens using regular expressions: Give a name and the regular expression. System gives you a list of tokens/lexemes

Attendance is required.
Programs are required. make a choice about what you intend to do

Matching Regular Expressions
Important for lexical analysis Want to have immediate results Want to avoid backtracking match .*baba! Suppose input is animals in bandanas are barely babaababa!

Try it – write a recognizer for. baba
Try it – write a recognizer for .*baba! or a quoted string (as you don’t know which comes next) We want a “graph” of nodes and edges Nodes represent “states” such as “I have read in ba” Arcs represent – input which causes you to go from one state to the next Nodes are denoted as “final states” or “accept states” to mean the desired token is recognized Our states don’t really need to have names, but we might assign a name to them.

Deterministic Finite Automata (DFA)
A recognizer determines if an input string is a sentence in a language Uses a regular expression Turn the regular expression into a finite automaton Could be deterministic or non-deterministic

Transition diagram for identifiers Notice the accept state means you have matched
RE Identifier -> letter (letter | digit)* letter accept start letter other 1 2 digit

Visit with your neighbor
Write a finite state machine which reads in zeroes and ones and accepts the string if there are an odd number of zeroes. If you can find such a RE, we say the RE “recognizes” the language of strings of 0/1 such that there are an odd number of zeroes.

Note it does the minimal work
Note it does the minimal work. It doesn’t count, just keep track of odd and even odd even

visit with your neighbor
Write a finite state machine which reads in zeroes and ones and accepts the string if there are an odd number of zeroes Write a finite state machine which accepts a string with the same number of zeroes and ones Note the FS part of “FSA” is a finite number of states.

What does this match? How does this differ from the other kids of FSA we have seen?

Thought question Does the non-deterministic FSA have more power than a FSA? In other words, are there languages (sets of strings) that a NFSA can “recognize” (say yes or no) that a FSA can’t make a decision about? No. In fact, we can always find a deterministic automaton that recognizes/generates exactly the same language as a non-deterministic automaton.

An NFA (nondeterministic finite automaton) is similar to a DFA but it also permits multiple transitions over the same character and transitions over  (empty string) . The  transition doesn't consume any input characters, so you may jump to another state for free. When we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power.

Automating Scanner (lexical analyzer) Construction
To convert a specification into code: Write down the RE for the input language Build a big NFA Build the DFA that simulates the NFA Systematically shrink the DFA Turn it into code Scanner generators Lex and Flex work along these lines Algorithms are well-known and well-understood

From a Regular Expression to an NFA Thompson’s Construction
(a | b)* abb e a 2 3 e e start e e 1 a b b 6 7 8 9 10 e e 4 5 accept b e e empty string match consumes no input

RE NFA using Thompson’s Construction
Key idea NFA pattern for each symbol & each operator Join them with  moves in precedence order S0 S1 a S3 S4 b NFA for ab  S0 S1 a NFA for a NFA for a | b S0 S1 S2 a S3 S4 b S5  S0 S1  S3 S4 NFA for a* a Ken Thompson, CACM, 1968

Example of Thompson’s Construction
Let’s try a ( b | c )* 1. a, b, & c 2. b | c 3. ( b | c )* S0 S1 a S0 S1 b S0 S1 c S1 S2 b S3 S4 c S0 S5  S2 S3 b S4 S5 c S1 S6 S0 S7 

Example of Thompson’s Construction (con’t)
a ( b | c )* Of course, a human would design something simpler ... S0 S1 a  S4 S5 b S6 S7 c S3 S8 S2 S9 S0 S1 a b | c But, we can automate production of the more complex one ...

a How would you use a NFA? start a b b 1 2 3 accept b Non-deterministic finite state automata NFA b a start a b b accept 01 02 03 a b a Equivalent deterministic finite state automata DFA

NFA -> DFA (subset construction)
We can covert from an NFA to a DFA using subset construction. To perform this operation, let us define two functions: The -closure function takes a state and returns the set of states reachable from it, based on (one or more)  -transitions. Note that this will always include the state itself. We should be able to get from a state to any state in its -closure without consuming any input. The function move takes a state and a character, and returns the set of states reachable by one transition on this character. We can generalize both these functions to apply to sets of states by taking the union of the application to individual states. Eg. If A, B and C are states, move({A,B,C},à') = move(A,à')  move(B,à') move(C,à').

NFA -> DFA (cont) The Subset Construction Algorithm
Create the start state of the DFA by taking the ε-closure of the start state of the NFA. Perform the following for the new DFA state: For each possible input symbol: Apply move to the newly-created state and the input symbol; this will return a set of states. Apply the -closure to this set of states, possibly resulting in a new set. This set of NFA states will be a single state in the DFA. Each time we generate a new DFA state, we must apply step 2 to it. The process is complete when applying step 2 does not yield any new states. The finish states of the DFA are those which contain any of the finish states of the NFA.

Use algorithm to transform
start a b b 1 2 3 accept b Non-deterministic finite state automata NFA b a start a b b accept 01 02 03 a b a Equivalent deterministic finite state automata DFA

Transition Table (DFA)
A B 01 02 03

Writing a lexical analyzer
The DFA helps us to write the scanner so that we recognize a symbol as early as possible without much lookahead. Figure 4.1 in your text gives a good example of what a scanner might look like.

LEX (FLEX) Tool for generating programs which recognize lexical patterns in text Takes regular expressions and turns them into a program

Lexical Errors Only a small percentage of errors can be recognized during Lexical Analysis Consider if (good == !“bad”)

Which of the following errors could be found during lexical analysis?
Line ends inside literal string Illegal character in input file calling an undefined method missing operator missing paren unquoted string using an unopened file

Limitations of REs REs can describe many languages but not all
In regular expression lingo, a language is a set of strings which is acceptable. Alphabet is set of tokens of the language. For example :Alphabet = {a,b}, My language is the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …} Can you represent the set of legal strings using RE?

<div id =“menu”> HI <\div>
With your neighbor try With RE can you match nested paired Tags in HTML? Can you write the regular expression to match the language consisting of all nested tags (including user defined tags?) <div id =“menu”> HI <\div> <div id =“menu”><li><a href=“s.php">Officers</a></li> <\div>

In general What does a lexical error mean?
Strategies for dealing with: “Panic-mode” Delete chars from input until something matches Inserting characters Re-ordering characters Replacing characters For an error like “illegal character” then we should report it sensibly

Syntax Analysis also known as Parsing
Grouping together tokens into larger structures Analogous to lexical analysis Input: Tokens (output of Lexical Analyzer) Output: Structured representation of original program

A Context Free Grammar A grammar is a four tuple (, N,P,S) where
 is the terminal alphabet N is the non terminal alphabet P is the set of productions S is a designated start symbol in N

Parsing Need to express series of added operands
Basic need: a way to communicate what is allowed in a way that is easy to translate. Expression  number plus Expression | number Similar to regular definitions: Concatenation Choice No * – repetition by recursion

BNF Grammar – way Algol was defined (and other languages since)
Expression  number Operator number Operator  + | - | * | / Structure on the left is defined to consist of the choices on the right hand side Meta-symbols:  | Different conventions for writing BNF Grammars: <expression> ::= number <operator> number Expression  number Operator number

Derivations Derivation:
Sequence of replacements of structure names by choices on the RHS of grammar rules Begin: start symbol End: string of token symbols Each step one replacement is made Exp  Exp Op Exp | number Op  + | - | * | /

Example Derivation How to we get: 4*5*21 Note the different arrows:
 Derivation applies grammar rules  Used to define grammar rules Non-terminals: Exp, Op Terminals: number, * Terminals: because they terminate the derivation

E  ( E ) | a What sentences does this grammar generate? An example derivation: E  ( E ) ((E))  ((a)) Note that this is what we couldn’t achieve with regular definitions

Recursive Grammars At seats, try using grammar to generate anbn abn
At seats, try using grammar to generate palindromes of = {a,b} abbaaaabba aba aabaa

derives ,  ,   ,     ,      ….
E  E  |  derives ,  ,   ,     ,      …. All strings beginning with  followed by zero or more repetitions of   *

Write a context free grammar for
L = {am bn ck}, where k = m + n S  a S c | D Db D c | ε

Write a CFG for (a(ab)*b)* S  a C b S | ε C  a b C | ε

Abstract Syntax Trees Parse trees contain surplus information
+ exp exp op exp 3 4 This is all the information we actually need Token sequence number + number 3 4

An exercise Consider the grammar S->(L) | a L->L,S |S
What are the terminals, non-terminals and start symbol Find leftmost and rightmost derivations and parse trees for the following sentences (a,a) (a, (a,a)) (a, ((a,a), (a,a)))

Parsing token sequence: id + id * id
E  E + E | E * E | ( E ) | - E | id How many ways can you find a tree which matches the expression id * id + id ?

We want structure of parse tree to help us in evaluating
We build the parse tree As we traverse the parse tree, we generate the corresponding code.

Ambiguity If the same sentence has two distinct parse trees, the grammar is ambiguous In English, the phrase ``small dogs and cats'' is ambiguous as we aren't sure if the cats are small or not. `I have a picture of Joe at my house' is also ambiguous A language is said to be ambiguous if no unambiguous grammar exists for it.

Ambiguous Grammars Problem – no unique structure is expressed
A grammar that generates a string with 2 distinct parse trees is called an ambiguous grammar 2+3*4 = 2 + (3*4) = 14 2+3*4 = (2+3) * 4 = 20 How does the grammar relate to meaning? Our experience of math says interpretation 1 is correct but the grammar does not express this: E  E + E | E * E | ( E ) | - E | id

Ambiguity is bad!!!! Goal – rewrite the grammar so you can never get two trees for the same input. If you can do that, PROBLEM SOLVED!!! We won’t spend time studying that, but realize that is usually the goal.

Show that this grammar is ambiguous.
Consider the grammar S  aS | aSbS | ε How do you do that? Think of ONE SENTENCE that you think is problematic. Show TWO different parse trees for the sentence. S=> aS => aaSbS =>aabS => aab OR S=> aSbS => aaSbS =>aabS => aab

Is this grammar ambiguous?
S  A1B A0A|ε B 0B|1B |ε No, for any string, consider the symbols one at a time. In every case, the symbol could only have come from one production.

Precedence Using the previous grammar, show a parse tree for
a+b*c (notice how precedence matters) a*b+c (notice how precedence matters) a-b-c (notice how associativity matters)

Associativity ? 30 or 40 Subtraction is left associative, left to right (=30) E  E + E | E - E |Term Does not tell us how to split up (ambiguity) E  E + Term | E – Term | Term Forces left associativity via left recursion E  Term + E| Term – E | Term Forces right associativity via right recursion Precedence & associativity remove ambiguity of arithmetic expressions Which is what our math teachers took years telling us!

Extended BNF Notation Notation for repetition and optional features.
{item} expresses repetition: expr ® expr + term | term becomes expr ® term { + term } [item] expresses optional features: if-stmt ® if( expr ) stmt | if( expr ) stmt else stmt becomes if-stmt ® if( expr ) stmt [ else stmt ]

Syntax (railroad) Diagrams
An alternative to EBNF. Rarely seen any more: EBNF is much more compact. Example (if-statement, p. 101):

Formal Methods of Describing Syntax
1950: Noam Chomsky (linguist) described generative devices which describe four classes of languages (in order of decreasing power) recursively enumerable x y where x and y can be any string of nonterminals and terminals. context-sensitive x  y where x and y can be string of terminals and non-terminals but y x in length. Can recognize anbncn context-free - nonterminals appear singly on left-side of productions. Any nonterminal can be replaced by its right hand side regardless of the context it appears in. Ex: A Condition in a while statement is treated the same as a condition in an assignment or in an if, this is context free regular (only allow productions of the form NaT or N a) Can recognize anbm Chomsky was interested in the theoretic nature of natural languages.

Regular Grammar (like regular expression)
A CFG is called regular if every production has one of the forms below AaB A ε

Write the following regular expressions into regular grammars
a* b a+b*

Context Sensitive Allows for left hand side to be more than just a single non-terminal. It allows for “context” Context - sensitive : context dependent Rules have the form xYz->xuz with Y being a non-terminal and x,u,z being terminals or non-terminals.

Consider the following context sensitive grammar
. Consider the following context sensitive grammar G=(S,B,C,x,y,z,P,S) where P are S ®xSBC S® xyC CB ®BC yB® yy yC® yz zC® zz At seats, what strings does this generate?

How is Parsing done? Recursive descent (top down).
Bottom up – tries to match input with the right hand side of a rule. Sometimes called shift-reduce parsers.

Predictive Parsing Which rule to use?
I need to generate a symbol, which rule? Top down parsing LL(1) parsing Table driven predictive parsing (no recursion) versus recursive descent parsing where each nonterminal is associated with a procedure call No backtracking E -> E + T | T T -> T * F | F F -> (E) | id

Two grammar problems Eliminating left recursion (without changing associativity) E-> E+a | a I can’t tell which rule to use E-> E+a or E-> a as both generate same first symbol removing two rules with same prefix E->aB|aC I can’t tell which rule to use as both generate same symbol

Removing Left Recursion
Before A --> A x A --> y After A --> yB B --> x B B --> 

Removing common prefix (left factoring)
Stmt -> if Exp then Stmt else Stmt | if Expr then Stmt Change so you don’t have to pick rule until later Before: A -> ab1 | ab2 After: A -> aA’ A’ -> b1 | b2

Table Driven Predictive Parsing
id + id * id Input a + b $ Predictive Parsing Program X Output Y Z $ Stack Parsing Table Partial derivation Not yet matched

The idea behind Parsing
You start out with your derivation and “someone smart” tells you which production you should use. You have a non-terminal E and you want to match a “e”. If only one rule from E produces something that starts with an “e”, you know which one to use.

Parse id + id * id Leftmost derivation and parse tree using the grammar E -> TP P -> +TP | e T -> FU U -> *FU | e F -> (E) | id

Input Symbol Non Terminal id + ( ) * $ E E->TP E->TP P P->+TP P->e P->e T T->FU T->FU U U->e U->*FU U->e U->e F F->id F->(E) Use parse table to derive: a*(a+a)*a

Chapter 4 Syntax.

Similar presentations

Presentation on theme: "Chapter 4 Syntax."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 Syntax.

Similar presentations

Presentation on theme: "Chapter 4 Syntax."— Presentation transcript:

Similar presentations

About project

Feedback