CS 404 Introduction to Compiler Design

CS 404 Introduction to Compiler Design
Lecture 2 Ahmed Ezzat Finite Automata, Context Free Grammar (CFG) CS 404 Ahmed Ezzat

Finite Automata Evaluate regular expressions
Recognize certain languages and reject others Two kinds of FA: Non-deterministic FA (NFA) Deterministic FA (DFA) CS 404 Ahmed Ezzat

A Finite Automata Consists of
An input alphabet, e.g., Σ = {a,b, …} A set of states, e.g., S = {s0, s1, s2, …} A set of transitions from states to states, labeled by elements of Σ or ∈ A start state, e.g., s0 A set of final states, e.g., F = {s1, s2} CS 404 Ahmed Ezzat

FA and Language An FA accepts string x if and only if there is some path in the transition graph from the start state to a final state, such that the edge labels along this path spells x. The set of strings an FA accepts is said to be the language defined by this FA. CS 404 Ahmed Ezzat

NFA CS 404 Ahmed Ezzat

DFA and NFA What we have defined is called NFA
A DFA is a special case of NFA No states has an ∈ transition. For each state S and input symbol a, there is at most one edge labeled a leaving S. CS 404 Ahmed Ezzat

NFA to DFA ἑ Yes No a b c 1 4 3 2 3 NFA a aἑ bἑ cἑ 3 1 a a b c a c c
- 4 3 2, 1 4, 3 Yes No ab b abab a cc cb c caa ccab ccacc ccac abacab a b c 1 4 3 2 3 NFA Chart representing the graph a 1 2,1 - 4,3 3 aἑ bἑ cἑ 4,3 3 2,1 1 a a b c a c 4,3 c 3 DFA CS 404 Ahmed Ezzat

NFA, DFA and Regular Expressions
A DFA is an NFA (without ∈) Each NFA can be converted into a DFA One can construct an NFA from a regular expression FAs are used by lexical analyzer to recognize tokens CS 404 Ahmed Ezzat

Syntax Analysis Syntax Analysis is also called parsing
Create hierarchical structures (parse trees) Use “grammars” to define the structures Comparing with lexer, parser only accepts syntactically correct sentences CS 404 Ahmed Ezzat

Grammars A grammar is a formal way to specify a set of valid sentences in a language L Just like a regular expression is a formal way to define a token in a language L A syntax analyzer (or parser) is a software tool that recognizes all valid sentences in L Just like a lexical analyzer is a software tool that recognizes all valid lexemes in a language L CS 404 Ahmed Ezzat

Context Free Grammars (CFG)
A context free grammar has four components: A set of terminal symbols, e.g., T = { a, b, … } A set of non-terminal symbols, e.g. N = {S, A, B, …} A set of productions where each consists of a non-terminal on the left side, and terminal or non-terminal on the right hand side. e.g., A  aB A start symbol, which is a non-terminal, e.g., S CS 404 Ahmed Ezzat

Formal Definition of a CFG
There is a finite set of symbols that form the strings, i.e. there is a finite alphabet. The alphabet symbols are called terminals (think of a parse tree and terminals are the leafs) There is a finite set of variables, sometimes called non-terminals or syntactic categories. Each variable represents a language (i.e. a set of strings). One of the variables is the start symbol. Other variables may exist to help define the language. There is a finite set of productions or production rules that represent the recursive definition of the language. Each production rule is defined as follows: Has a single variable that is being defined to the left of the production Has the production symbol  Has a string of zero or more terminals or variables, called the body of the production. To form strings we can substitute each variable’s production in for the body where it appears. CS 404 Ahmed Ezzat

CFG Notations A CFG G may then be represented by these four components, denoted G = (V,T,R,S) V is the set of variables T is the set of terminals R is the set of production rules S is the start symbol. CS 404 Ahmed Ezzat

Sample CFG EI // Expression is an identifier
EE+E // Add two expressions EE*E // Multiply two expressions E(E) // Add parenthesis I L // Identifier is a Letter I ID // Identifier + Digit I IL // Identifier + Letter D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 // Digits L  a | b | c | … A | B | … | Z // Letters Note Identifiers are regular; could describe as (letter)(letter + digit)* CS 404 Ahmed Ezzat

Recursive Inference The process of coming up with strings that satisfy individual productions and then concatenating them together according to more general rules; this is called recursive inference. This is a bottom-up process For example, parsing the identifier “r5” Rule 8 tells us that D  5 Rule 9 tells us that L  r Rule 5 tells us that IL so Ir Apply recursive inference using rule 6 for IID and get I  rD. Use D5 to get Ir5. Finally, we know from rule 1 that EI, so r5 is also an expression. CS 404 Ahmed Ezzat

Derivations A derivation is a sequence of applications of rules from P, resulting in a string of terminals (i.e., a sentence) Basically, we treat a production as a re-writing rule and we replace the non-terminal in the LHS with the RHS There can be more than one derivations for a sentence CS 404 Ahmed Ezzat

More on derivations  derives in one step
* derives in zero or more steps + derives in one or more steps α * α for any string α If α * β and β  γ, then α * γ CS 404 Ahmed Ezzat

Derivation Similar to recursive inference, but top-down instead of bottom-up Expand start symbol first and work way down in such a way that it matches the input string For example, given a*(a+b1) we can derive this by: E  E*E  I*E  L*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(L+E)  a*(a+E)  a*(a+I)  a*(a+ID)  a*(a+LD)  a*(a+bD)  a*(a+b1) Note that at each step of the productions we could have chosen any one of the variables to replace with a more specific rule. CS 404 Ahmed Ezzat

Multiple Derivation We saw an example of  in deriving a*(a+b1)
We could have used * to condense the derivation. E.g. we could just go straight to E * E*(E+E) or even straight to the final step E * a*(a+b1) Going straight to the end is not recommended on a homework or exam problem if you are supposed to show the derivation CS 404 Ahmed Ezzat

Leftmost Derivation In the previous example we used a derivation called a leftmost derivation. We can specifically denote a leftmost derivation using the subscript “lm”, as in: lm or *lm A leftmost derivation is simply one in which we replace the leftmost variable in a production body by one of its production bodies first, and then work our way from left to right. CS 404 Ahmed Ezzat

Rightmost Derivation Not surprisingly, we also have a rightmost derivation which we can specifically denote via: rm or *rm A rightmost derivation is one in which we replace the rightmost variable by one of its production bodies first, and then work our way from right to left. CS 404 Ahmed Ezzat

Rightmost Derivation Example
a*(a+b1) was already shown previously using a leftmost derivation. We can also come up with a rightmost derivation, but we must make replacements in different order: E rm E*E rm E * (E) rm E*(E+E) rm E*(E+I) rm E*(E+ID) rm E*(E+I1) rm E*(E+L1) rm E*(E+b1) rm E*(I+b1) rm E*(L+b1) rm E*(a+b1) rm I*(a+b1) rm L*(a+b1) rm a*(a+b1) CS 404 Ahmed Ezzat

Left or Right? Does it matter which method you use? Answer: No
Any derivation has an equivalent leftmost and rightmost derivation. That is, A * . iff A *lm  and A *rm . CS 404 Ahmed Ezzat

Language of Context Free Grammar
The language that is represented by a CFG G(V,T,P,S) may be denoted by L(G), is a Context Free Language (CFL) and contains all strings X such that S  *X In other words, L(G) consists of terminal strings that have derivations from the start symbol: L(G) = { w in T | S *G w } Note that the CFL / L(G) consists solely of terminals from G. CS 404 Ahmed Ezzat

Parse Tree A parse tree is a graphical (top-down) representation for a derivation: The root is the start symbol Each leaf is a terminal symbol or ∈ Each internal node is a non-terminal If A is an internal node and X1X2..Xn are A’s children nodes, then AX1X2…Xn is a production used in the derivation CS 404 Ahmed Ezzat

Sample Parse Tree Sample parse tree for the CFG for 1110111:
P   | 0 | 1 | 0P0 | 1P1 EI EE+E EE*E E(E) I L I ID I IL D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 L  a | b | c | … A | B | … Z CS 404 Ahmed Ezzat

Ambiguity A grammar that produces more than one parse tree for some sentence is said to be ambiguous Sometimes we can re-write the rules in P to make a grammar un-ambiguous Example: write rules to reflect the precedence of the operators S  AS | ε A  A1 | 0A1 | 01 CS 404 Ahmed Ezzat

Other Types of Grammars
Regular Grammars (RG) Context Free Grammars (CFG) Context Sensitive Grammars (CSG) Unrestricted Grammars (UG) L(RG) c= L(CFG) c= L(CSG) c= L(UG) CS 404 Ahmed Ezzat

Use CFG and Parsing CFG is used to define the structure of a program (a language) Parsing is used to test whether a sentence belongs to a valid language Parsing can be done by hand Parsing algorithms (next lecture) CS 404 Ahmed Ezzat

END CS 404 Ahmed Ezzat

CS 404 Introduction to Compiler Design

Similar presentations

Presentation on theme: "CS 404 Introduction to Compiler Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 404 Introduction to Compiler Design

Similar presentations

Presentation on theme: "CS 404 Introduction to Compiler Design"— Presentation transcript:

Similar presentations

About project

Feedback