Lexical Analysis & Syntactic Analysis

Slides:

Advertisements

Similar presentations

Parsing II : Top-down Parsing

Advertisements

CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Context-Free Grammars Lecture 7

Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.

EECS 6083 Intro to Parsing Context Free Grammars

1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.

1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.

CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.

BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.

Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.

Introduction to Parsing Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Introduction to Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

作者 : 陳鍾誠單位 : 金門技術學院資管系 URL : 日期 : 2016/6/4 程式語言的語法 Grammar.

Lexical Analysis: Finite Automata CS 471 September 5, 2007.

CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.

Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.

Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.

Syntax Analyzer (Parser)

1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.

Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.

CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.

CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.

1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.

Introduction to Parsing

CS 3304 Comparative Languages

Chapter 3 – Describing Syntax

lec02-parserCFG May 8, 2018 Syntax Analyzer

Introduction to Parsing

Parsing & Context-Free Grammars

CS 404 Introduction to Compiler Design

G. Pullaiah College of Engineering and Technology

Programming Languages Translator

CS510 Compiler Lecture 4.

Lexical and Syntax Analysis

Introduction to Parsing (adapted from CS 164 at Berkeley)

Textbook:Modern Compiler Design

Parsing IV Bottom-up Parsing

Chapter 3 – Describing Syntax

Syntax Specification and Analysis

PROGRAMMING LANGUAGES

Syntax versus Semantics

Parsing Techniques.

CSE 3302 Programming Languages

Parsing & Context-Free Grammars Hal Perkins Autumn 2011

Compiler Design 4. Language Grammars

Lexical and Syntax Analysis

(Slides copied liberally from Ruth Anderson, Hal Perkins and others)

Top-Down Parsing CS 671 January 29, 2008.

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Lecture 7: Introduction to Parsing (Syntax Analysis)

R.Rajkumar Asst.Professor CSE

CS 3304 Comparative Languages

LL and Recursive-Descent Parsing Hal Perkins Autumn 2011

CS 3304 Comparative Languages

Parsing IV Bottom-up Parsing

Introduction to Parsing

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Parsing

4b Lexical analysis Finite Automata

Parsing & Context-Free Grammars Hal Perkins Summer 2004

LL and Recursive-Descent Parsing Hal Perkins Autumn 2009

LL and Recursive-Descent Parsing Hal Perkins Winter 2008

lec02-parserCFG May 27, 2019 Syntax Analyzer

Parsing & Context-Free Grammars Hal Perkins Autumn 2005

COMPILER CONSTRUCTION

Presentation transcript:

Lexical Analysis & Syntactic Analysis CS 671 January 24, 2008

Intermediate code generator Last Time Source program Lexical Analyzer Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants) Characters read from a file are buffered – helps decrease latency due to i/o. Lexical analyzer manages the buffer Makes use of the theory of regular languages and finite state machines Lex and Flex are tools that construct lexical analyzers from regular expression specifications Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Finite Automata Takes an input string and determines whether it’s a valid sentence of a language A finite automaton has a finite set of states Edges lead from one state to another Edges are labeled with a symbol One state is the start state One or more states are the final state 26 edges 1 2 i f 1 a-z 0-9 ID IF

\“ [^”]* \” Finite Automata Automaton (DFA) can be represented as: A transition table \“ [^”]* \” A graph ” non-” 1 2 Non-” ” ” 1 2

Implementation Non-“ “ “ 1 2 boolean accept_state[NSTATES] = { … }; 1 2 boolean accept_state[NSTATES] = { … }; int trans_table[NSTATES][NCHARS] = { … }; int state = 0; while (state != ERROR_STATE) { c = input.read(); if (c < 0) break; state = table[state][c]; } return accept_state[state];

RegExp  Finite Automaton Can we build a finite automaton for every regular expression? Strategy: consider every possible kind of regular expression (define by induction) a a 1 R1R2 R1|R2 ?

Deterministic vs. Nondeterministic Deterministic finite automata (DFA) – No two edges from the same state are labeled with the same symbol Nondeterministic finite automata (NFA) – may have arrows labeled with  (which does not consume input) b a   a a b

DFA vs. NFA DFA: action of automaton on each input symbol is fully determined obvious table-driven implementation NFA: automaton may have choice on each step automaton accepts a string if there is any way to make choices to arrive at accepting state / every path from start state to an accept state is a string accepted by automaton not obvious how to implement efficiently!

RegExp  NFA -? [0-9]+ (-|) [0-9][0-9]* 0,1,2… - 0,1,2…  

Inductive Construction a a R1 R2 R1R2  R1 R2 R1|R2  R R*

Executing NFA Problem: how to execute NFA efficiently? “strings accepted are those for which there is some corresponding path from start state to an accept state” Conclusion: search all paths in graph consistent with the string Idea: search paths in parallel Keep track of subset of NFA states that search could be in after seeing string prefix “Multiple fingers” pointing to graph

Example Input string: -23 NFA States _____ Terminology: -closure - set of all reachable states without consuming any input -closure of 0 is {0,1} 0,1,2… - 0,1,2…  1 2 3 

NFADFA Conversion Can convert NFA directly to DFA by same approach Create one DFA for each distinct subset of NFA states that could arise States: {0,1}, {1}, {2, 3} 0,1,2… {0,1} - {1} -  1 2 3 0,1,2… 0,1,2… 0,1,2…  {2,3} 0,1,2…

DFA Minimization DFA construction can produce large DFA with many states Lexer generators perform additional phase of DFA minimization to reduce to minimum possible size 1 What does this DFA do? 1 1 Can it be simplified?

Automatic Scanner Construction To convert a specification into code Write down the RE for the input language Build a big NFA Build the DFA that simulates the NFA Systematically shrink the DFA Turn it into code Scanner generators Lex and flex work along these lines Algorithms are well known and understood Key issue is interface to the parser

Building a Lexer Specification “if” “while” [a-zA-Z][a-zA-Z0-9]* [0-9][0-9]* ( ) NFA for each RE Giant NFA Giant DFA Table-driven code

Lexical Analysis Summary Regular expressions efficient way to represent languages used by lexer generators Finite automata describe the actual implementation of a lexer Process Regular expressions (+priority) converted to NFA NFA converted to DFA

Where Are We? Do tokens conform to the language syntax? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract Syntax Tree (AST) Lexical Analysis Syntactic Analysis if Semantic Analysis == ; = b a “Hi” Do tokens conform to the language syntax?

Intermediate code generator Phases of a Compiler Parser Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure – an AST The parser imposes the syntax rules of the language Work should be linear in the size of the input (else unusable)  type consistency cannot be checked in this phase Deterministic context free languages and pushdown automata for the basis Bison and yacc allow a user to construct parsers from CFG specifications Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

What is Parsing? Parsing: Recognizing whether a sentence (or program) is grammatically well formed and identifying the function of each component “I gave him the book” sentence indirect object subject:I verb:gave object:him noun phrase article:the noun:book

Tree Representations a = 5+3 ; b = (print ( a , a–1 ) , 10*a); print(b) CompoundStm CompoundStm AssignStm OpExp _ NumExp Plus AssignStm PrintStm LastExpList IdExp _ _ EseqExp PrintStm OpExp PairExpList NumExp Times IdExp IdExp LastExpList _ _ _ OpExp IdExp Minus NumExp _ _

Overview of Syntactic Analysis Input: stream of tokens Output: abstract syntax tree Implementation: Parse token stream to traverse concrete syntax (parse tree) During traversal, build abstract syntax tree Abstract syntax tree removes extra syntax a + b  (a) + (b)  ((a)+((b))) bin_op + a b

What Parsing Doesn’t Do Doesn’t check: type agreement, variable declaration, variable initialization, etc. int x = true; int y; z = f(y); Deferred until semantic analysis

Specifying Language Syntax First problem: how to describe language syntax precisely and conveniently Last time: can describe tokens using regular expressions Regular expressions easy to implement, efficient (by converting to DFA) Why not use regular expressions (on tokens) to specify programming language syntax?

Need a More Powerful Representation Programming languages are not regular cannot be described by regular expressions • Consider: language of all strings that contain balanced parentheses DFA has only finite number of states Cannot perform unbounded counting ( ( ( ( ( ) ) ) ) )

Context-Free Grammars A specification of the balanced-parenthesis language: S → ( S ) S S → ε The definition is recursive A context-free grammar More expressive than regular expressions S = (S) ε = ((S) S) ε = ((ε) ε) ε = (()) If a grammar accepts a string, there is a derivation of that string using the productions of the grammar

Context-Free Grammar Terminology Terminals Token or ε Non-terminals Syntactic variables Start symbol A special nonterminal is designated (S) Productions Specify how non-terminals may be expanded to form strings LHS: single non-terminal, RHS: string of terminals or non-terminals Vertical bar is shorthand for multiple productions S  (S) S S  ε

Sum Grammar S → E + S | E E → number | (S ) e.g. (1 + 2 + (3+4))+5 _ productions _ non-terminals: _ terminals: start symbol S

Develop a Context-Free Grammar for … 1. anbncn 2. ambncm+n

Constructing a Derivation Start from start symbol (S) Productions are used to derive a sequence of tokens from the start symbol For arbitrary strings α, β and γ and a production A → β A single step of derivation is αAγ  αβγ –i.e., substitute β for an occurrence of A (S + E) + E → (E + S + E) + E (A = S, β = E + S)

S→ E + S | E E →number | ( S ) Derivation Example Derive (1+2+ (3+4))+5: S → E + S →

Derivation  Parse Tree • Tree representation of the derivation • Leaves of tree are terminals; in-order traversal yields string • Internal nodes: non-terminals • No information about order of derivation steps E + S ( S ) E E + S 5 1 E + S 2 E S→ E + S | E E →number | ( S ) (1+2+ (3+4))+5 ( S ) E + S E 3 4

Parse Tree vs. AST S Parse Tree, aka concrete syntax Abstract Syntax Tree E + S + ( S ) E + 5 E + S 5 1 E + S 1 + 2 E 2 + ( S ) 3 4 E + S Discards/abstracts unneeded information E 4 4

Derivation Order Can choose to apply productions in any order; select any non-terminal A αAγ  αβγ • Two standard orders: left- and right-most -- useful for different kinds of automatic parsing • Leftmost derivation: In the string, find the left-most non-terminal and apply a production to it E + S→1 + S • Rightmost derivation: Always choose rightmost non-terminal E + S→E + E + S

Example S →E + S | E E →number | ( S ) Left-Most Derivation S→E+S→(S) + S →(E + S )+ S →(1 + S)+S → (1+E+S)+S→ (1+2+S)+S →(1+2+E)+S →(1+2+(S))+S →(1+2+(E+S))+S → (1+2+(3+S))+S → (1+2+(3+E))+S →(1+2+(3+4))+S →(1+2+(3+4))+E →(1+2+(3+4))+5 Right-Most Derivation S→E+S→E+E→E+5 →(S)+5 →(E+S)+5 → (E+E+S)+5 → (E+E+E)+5 →(E+E+(S))+5 → (E+E+(E+S))+5→ (E+E+(E+E))+5 → (E+E+(E+4))+5 →(E+E+(3+4))+5→ (E+2+(3+4))+5 →(1+2+(3+4))+5 Same parse tree: same productions chosen, different order

Associativity In example grammar, left-most and right-most derivations produced identical parse trees • + operator associates to right in parse tree regardless of derivation order + + 5 (1+2+(3+4))+5 1 + 2 + 3 4

Another Example Let’s derive the string x - 2 * y # Production rule 1 3 4 5 6 7 expr → expr op expr | number | identifier op → + | - | * | / Rule Sentential form - expr 1 expr op expr 3 <id,x> op expr 5 <id,x> - expr 1 <id,x> - expr op expr 2 <id,x> - <num,2> op expr 6 <id,x> - <num,2> * expr 3 <id,x> - <num,2> * <id,y>

Left vs. Right derivations Two derivations of x – 2 * y Rule Sentential form - 1 3 5 2 6 expr expr op expr <id, x> op expr <id,x> - expr <id,x> - expr op expr <id,x> - <num,2> op expr <id,x> - <num,2> * expr <id,x> - <num,2> * <id,y> Rule Sentential form - 1 3 6 2 5 expr expr op expr expr op <id,y> expr * <id,y> expr op expr * <id,y> expr op <num,2> * <id,y> expr - <num,2> * <id,y> <id,x> - <num,2> * <id,y> Left-most derivation Right-most derivation

Right-Most Derivation Problem: evaluates as (x – 2) * y Right-most derivation expr op y * x - 2 Parse tree Rule Sentential form - 1 3 6 2 5 expr expr op expr expr op <id,y> expr * <id,y> expr op expr * <id,y> expr op <num,2> * <id,y> expr - <num,2> * <id,y> <id,x> - <num,2> * <id,y>

Left-Most Derivation Solution: evaluates as x – (2 * y) expr op x - 2 * y Parse tree Rule Sentential form - 1 3 5 2 6 expr expr op expr <id, x> op expr <id,x> - expr <id,x> - expr op expr <id,x> - <num,2> op expr <id,x> - <num,2> * expr <id,x> - <num,2> * <id,y>

Impact of Ambiguity Different parse trees correspond to different evaluations! Meaning of program not defined + * = ? = ? 1 * + 3 2 3 1 2

Derivations and Precedence Problem: Two different valid derivations Shape of tree implies its meaning One captures semantics we want – precedence Can we express precedence in grammar? Notice: operations deeper in tree evaluated first Idea: add an intermediate production New production isolates different levels of precedence Force higher precedence “deeper” in the grammar

Eliminating Ambiguity • Often can eliminate ambiguity by adding non-terminals & allowing recursion only on right or left Exp → Exp + Term | Term Term → Term * num | num • New Term enforces precedence • Left-recursion : left-associativity E E + T T T * 3 1 2

Adding precedence A complete view: Observations: Larger: requires more rewriting to reach terminals Produces same parse tree under both left and right derivations # Production rule 1 2 3 4 5 6 7 8 expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → number | identifier Level 1: lower precedence – higher in the tree Level 2: higher precedence – deeper in the tree

Expression example Now right derivation yields x – (2 * y) Right-most derivation Parse tree expr op x - 2 * y term fact Rule Sentential form - 2 4 8 6 7 3 expr expr - term expr - term * factor expr - term * <id,y> expr - factor * <id,y> expr - <num,2> * <id,y> term - <num,2> * <id,y> factor - <num,2> * <id,y> <id,x> - <num,2> * <id,y>

Ambiguous grammars A grammar is ambiguous iff: There are multiple leftmost or multiple rightmost derivations for a single sentential form Note: leftmost and rightmost derivations may differ, even in an unambiguous grammar Intuitively: We can choose different non-terminals to expand But each non-terminal should lead to a unique set of terminal symbols Classic example: if-then-else ambiguity

If-then-else Grammar: Problem: nested if-then-else statements Each one may or may not have else How to match each else with if # Production rule 1 2 3 stmt → if expr then stmt | if expr then stmt else stmt | …other statements…

If-then-else Ambiguity if expr1 then if expr2 then stmt1 else stmt2 prod. 2 if expr1 then else expr2 stmt2 stmt1 prod. 1 if expr1 then else expr2 stmt2 stmt1 prod. 1 prod. 2

Removing Ambiguity Restrict the grammar Choose a rule: “else” matches innermost “if” Codify with new productions Intuition: when we have an “else”, all preceding nested conditions must have an “else” # Production rule 1 2 3 4 5 stmt → if expr then stmt | if expr then withelse else stmt | …other statements… withelse → if expr then withelse else withelse

Limits of CFGs Syntactic analysis can’t catch all “syntactic” errors Example: C++ HashTable<Key,Value> x; Example: Fortran x = f(y);

Big Picture Scanners Parsers Type and semantic analysis Based on regular expressions Efficient for recognizing token types Remove comments, white space Cannot handle complex structure Parsers Based on context-free grammars More powerful than REs, but still have limitations Less efficient Type and semantic analysis Based on attribute grammars and type systems Handles “context-sensitive” constructs

Roadmap So far… Parsing: Context-free grammars, precedence, ambiguity Derivation of strings Parsing: Start with string, discover the derivation Two major approaches Top-down – start at the top, work towards terminals Bottom-up – start at terminals, assemble into tree