CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis.

Slides:



Advertisements
Similar presentations
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Chapter 2 Syntax. Syntax The syntax of a programming language specifies the structure of the language The lexical structure specifies how words can be.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Context-Free Grammars Lecture 7
CS252 Lab 2 Prepared by El Kindi Rezig. Notes Check out new version of the “official” fiz interpreter at
CPSC 388 – Compiler Design and Construction
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis Natawut Nupairoj, Ph.D.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lesson 10 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lexical Analysis (I) Compiler Baojian Hua
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Introduction to Parsing
CPS 506 Comparative Programming Languages Syntax Specification.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Introduction to Lex Ying-Hung Jiang
Flex Fast LEX analyzer CMPS 450. Lexical analysis terms + A token is a group of characters having collective meaning. + A lexeme is an actual character.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Muhammad Idrees, Lecturer University of Lahore 1 Top-Down Parsing Top down parsing can be viewed as an attempt to find a leftmost derivation for an input.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Applications of Context-Free Grammars (CFG) Parsers. The YACC Parser-Generator. by: Saleh Al-shomrani.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
PL&C Lab, DongGuk University Compiler Lecture Note, MiscellaneousPage 1 Yet Another Compiler-Compiler Stephen C. Johnson July 31, 1978 YACC.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
LECTURE 7 Lex and Intro to Parsing. LEX Last lecture, we learned a little bit about how we can take our regular expressions (which specify our valid tokens)
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
Syntax(1). 2 Syntax  The syntax of a programming language is a precise description of all its grammatically correct programs.  Levels of syntax Lexical.
CS 3304 Comparative Languages
Lexical Analyzer in Perspective
Chapter 3 – Describing Syntax
System Software Unit-1 (Language Processors) A TOY Compiler
A Simple Syntax-Directed Translator
Lexical Analysis.
Chapter 3 Lexical Analysis.
Tutorial On Lex & Yacc.
Introduction to Parsing
Introduction to Parsing (adapted from CS 164 at Berkeley)
Syntax (1).
TDDD55- Compilers and Interpreters Lesson 2
CSE 3302 Programming Languages
Review: Compiler Phases:
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Discrete Maths 13. Grammars Objectives
Programming Languages 2nd edition Tucker and Noonan
Regular Expressions and Lexical Analysis
Presentation transcript:

CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 2 Compiler Frontend Steps Lexical analyzer/scanner convert sequence of characters to sequence of tokens (inc 13) becomes 4 tokens, (, inc, 13, ) Parser/syntactic analysis analyze a sequence of tokens to create/determine the grammatical structure

slide 3 Brief Description of the Lab Part 1: Implement FIZ without user-defined functions (50%), due Feb 9 Part 2: Implement user-defined functions (50%), due Feb 16 Part 2 is significant harder than Part 1. Do not wait until the last week.

4 Using Lex/Flex with YACC/Bison

slide 5 Files Provided: fiz.l "inc" {return INC;} "(" {return OPENPAR;} ")" {return CLOSEPAR;} 0|[1-9][0-9]* { yylval.number_val = atoi(yytext); return NUMBER; } [ \t\n] {/* Discard spaces, tabs, and new lines */}.{printf("Syntax error. Did not recognize %s\n", yytext); }

slide 6 Files Provided: fiz.y /******************************************************* * Section 1: Definition of tokens and non-terminals * *****************************************************/ %token NUMBER %token INC OPENPAR CLOSEPAR %type expr %union{ char *string_val; intnumber_val; struct TREE_NODE *node_val; } The NUMBER token has number_value These three tokens have no value A parsed expr has a pointer to a node in an Abstract Syntax Tree associated with it. This defines the union associated with each token or non-terminal when parsing.

slide 7 Files Provided: fiz.y /************************************************** * Section 3: Grammar production rules * **************************************************/ goal: statements; statements: statement | statement statements; statement: expr { err_value = 0;resolve($1, NULL); if (err_value == 0) { printf ("%d\n", eval($1, NULL)); } prompt(); }; Red code are currently unnecessary. They are needed when user-defined functions are implemented. Green code evaluates the expression. $1 refers to the AST node associated with the 1 st element in the grammar rule, namely expr

slide 8 Abstract Syntax Tree A abstract syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.

slide 9 Abstract Syntax Tree: An Example IFZ_NODE ARG_NAME strValue = “y” ARG_NAME strValue = “x” FUNC_CALL name =“add” INC_NODE ARG_NAME strValue = “x” DEC_NODE ARG_NAME strValue = “y” The above is an AST for (ifz y x (add (inc x) (dec y))), The body of the function (add x y) Consider how evaluate (add 4 1) would work.

slide 10 Grammar for expr expr: OPENPAR INC expr CLOSEPAR { struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct TREE_NODE)); node -> type = INC_NODE; node -> first_arg = $3; $$ = node; } The above production rule (grammar rule) parses (inc ) It creates a node in the abstract syntax tree, denote its type to be INC_NODE, and stores the tree node for in first_arg; since this is the first (and only) argument of (inc ). $3 refers to the value associated with the 3 rd element in the grammar, i.e., expr in the body $$ refers to the value associated with expr on the left hand side

slide 11 Continuing grammar for expr | NUMBER { struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct TREE_NODE)); node -> type = NUMBER_NODE; node -> intValue = $1; $$ = node; }; The above production rule (grammar rule) parses a number into an expr. It creates a node in the abstract syntax tree, denote its type to be NUMBER_NODE, and stores the integer value in the intValue field. $1 refers to the value associated with the 1st element in the grammar, i.e., NUMBER in the body $$ refers to the value associated with expr on the left hand side

slide 12 What happens from Parsing? Input(inc (inc 1)) Becomes tokens: OPENPAR INC OPENPAR INC NUMBER CLOSEPAR CLOSEPAR This is parsed into statement in the following steps: statement: expr expr: OPENPAR INC expr CLOSEPAR expr: OPENPAR INC NUMBER CLOSEPAR INC_NODE NUMBER_NODE intValue = 1 INC_NODE

slide 13 Regular Expressions: Tool for Lexical Analyzer Regular expression: A notation to specify a pattern that matches a set of strings A regular expression can be: a a single character R 1 |R 2 matches anything that matches either R 1 or R 2 (R) matches the same thing as R [abcde] any of the five letter listed there, i.e., a|b|c|d|e [0-9] any digit

slide 14 Regular Expressions R 1 R 2 matches a string s if s is concatenation of s 1 s 2, and s 1 matches R 1 and s 2 matches R 2 E.g., [abcde] [0-9] matches R* repeating the regular expression R zero or more times E.g., [0-9]* matches the empty string and any digit sequence R+ repeating R one or more times Equivalent to the regular expression R R*

slide 15 RE Syntax in Lex/Flex ‘x’ match the character 'x' ‘.’ any character (byte) except newline ‘[xyz]’ a character class; in this case, the pattern matches either an 'x', a 'y', or a 'z' ‘[abj-oZ]’ a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z' ‘[^A-Z]’ a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.

slide 16 RE Syntax in Lex/Flex ‘[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline ‘[a-z]{-}[aeiou]’ the lowercase consonants ‘r*’ zero or more r's, where r is any regular expression ‘r+’ one or more r's ‘r?’ zero or one r's (that is, “an optional r”) ‘r{2,5}’ anywhere from two to five r's ‘r{2,}’ two or more r's ‘r{4}’ exactly 4 r's

slide 17 RE Syntax in Lex/Flex ‘{name}’ the expansion of the ‘name’ definition ‘"[xyz]\"foo"’ the literal string: ‘[xyz]"foo’ ‘(r)’ match an ‘r’; parentheses are used to override precedence ‘rs’ the regular expression ‘r’ followed by the regular expression ‘s’; called concatenation ‘r|s’ either an ‘r’ or an ‘s’ ‘^r’ an ‘r’, but only at the beginning of a line ‘r$’ an ‘r’, but only at the end of a line

slide 18 Examples Regular expression for an non-negative integer: Is [0-9]* correct? Yes, if allowing is okay, 0 | [1-9][0-9]* is better Regular expression for an identifier: Rule 1: Name of identifier includes alphabets and digits. Rule 2: First character of any identifier must be a letter. How to write the regular expression? [a-zA-Z][a-zA-Z0-9]*

19 More Questions on RE How to write regular expression that matches comments, assuming that comments are defined as anything between ; and end of line?

20 Review Able to write simple regular expressions to match strings. Given a regular expression, able to tell what are matched are what are not.