I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS
SCANNING I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

Interaction of Parser and Scanner
The source program in HLL is a stream of characters read from left to right Scanner reads characters from that stream and groups them into tokens returned to the parser. Whitespace (blanks, tabs, returns) and comments are eliminated

Benefits of Modularity (Scanning is a separate module)
Simplicity: separate input from rest of program, which only deals with tokens and not characters.

Simplicity Efficiency: Scanning I/O intensive Use buffering to improve efficiency

Simplicity Efficiency Portability: (historical) different machines might have different keyboards: e.g. and up-arrow

Internationalisation: e.g ALGOL 68 in many languages: Russian, German, French, Bulgarian, Japanes LOGO (educational): English, French, Italian Mama (educational): English, Hebrew, Yiddish, Chinese M4 macro processor: English, German Scratch (educational): 40+ languages Perl: Klingon (different grammar as well)

What are tokens? Most basic meaningful basic objects in a computer program: Sequences of characters which form the low-level constructs of the HLL (e.g. variable names, keywords, labels, operators) Examples: A[12,index2] = getstuff(23); B[66,88] = getmorestuff(11);

Token Structure Early on (parser) need to recognize validity of structure  value of identifiers does not matter. Tokens have 2 fields: type (compulsory) and value (optional) Example Vocabulary: the type is often called “token” the string being scanned (instance) is called a “lexeme” Token 25 hello if + - < > Type constant identifier IF PLUS MINUS Comparison Value

How to write a scanner Brute-force scanning (e.g. section that handles identifiers starting with ‘c’)

How to write a scanner Brute-force scanning (e.g. section that handles identifiers starting with ‘c’) Problems?

How to write a scanner Hand-coded Finite State Automaton (FSA) or Transition Diagram EXERCISE: Draw FSA for example above Assume FSA recognizes « class », « case » and other identifiers which only contain letters.

Finite State Automata A Finite State Automaton (FSA) A consists of 4 objects A set I called the input alphabet, of input symbols A set S of states the automaton can be in; A designated state s0 called the initial state; A designated set of states called the set of accepting states, or final states; A next-state function N: S×I → S that associates a “next-state” to each ordered pair consisting of a “current state” and “current input”. For each state s in S and input symbol m in I, N(s,m) is called the state to which A goes if m is input to A when A is in state s.

FSA: Transition Diagrams
The operation of an FSA is commonly described by a diagram called a (state-)transition diagram. In a transition diagram, states are represented by circles, and accepting states by double circles. There is one arrow that points to the initial state and other arrows between states as follows: There is an arrow from state s to state t labeled m (∈I) iff N(s,m)=t.

FSA: Next State and Eventual-State
The next-state table is a tabular representation of the next-state function. In the annotated next-state table, the initial state is indicated by an arrow and the accepting states by double circles. The eventual-state function of A is the function N*: S×I* → S defined as: for any state s of S and any input string w in I*, N*(s,w) = the state to which A goes if the symbols of w are input into A in sequence starting when A is in state s.

How to write a scanner Hand-coded Finite State Automaton (FSA) or Transition Diagram How to code an FSA? EXERCISE: CODE FSA

How to write a scanner Need: isFinal(state) function
NextState[state,c] table TokenState[state]

How to write a scanner Question 1: What about lexical errors?

START HERE

How to write a scanner Question 2: What if tokens are not delimited by whitespace? EXERCISE: add “<”, “<=”, and “=” tokens to language

Returning character to stream
Need: Buffering to read and put back characters

How to write a scanner Question 3: How to indicate whether to consume last char?

How to write a scanner Question 4: How to make NextState efficient?

How to write a scanner Question 5: How to optimize for groups of characters? E.g. groups of letters for identifiers, groups of numbers for numeric constants, groups of characters for string constants.

How to write a scanner Question 6: Who is going to create such FSAs and associated tables?

Kleene’s Theorem A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language.

Formal Languages – Alphabets and Strings
An alphabet Σ is a finite set of characters (or symbols). A word, or sequence, or string over Σ is any group of 0 or more consecutive characters of Σ. The length of a word is the number of characters in the word. The null string is the string of length 0. It is denoted ε or λ. A string of length n is really an ordered n-tuple of characters written without parentheses or commas. Given two strings x and y over Σ, the concatenation of x and y is the string xy obtained by putting all the characters of y right after x.

Formal Languages – Languages over alphabet
Let Σ be an alphabet. A formal language over Σ is a set of strings over Σ. ∅ is the empty language (over Σ) Σn = {all strings over Σ that have length n} where n∈N Σ+ = the positive closure of Σ ={all strings over Σ that have length ≥ 1} Σ* = the Kleene closure of Σ = {all strings over Σ}

Formal Languages – Operation on Languages
Let Σ be an alphabet. Let L and L′ be two languages defined over Σ. The following operations define new languages over Σ: The concatenation of L and L′, denoted LL′, is LL′ = {xy | x∈L ∧ y∈L′} The union of L and L′, denoted L∪L′, is L∪L′ = {x | x∈L ∨ y∈L′} The Kleene closure of L, denoted L*, is L*={ x | x is a concatenation of any finite number of strings in L}. Note that ε∈L*.

Regular Expressions - Definition
Let Σ be an alphabet. The following are regular expressions (r.e.) over Σ: I. BASE: ε and each individual symbol of Σ are regular expressions. II. RECURSION: if r and s are regular expressions over Σ, then the following are also regular expressions over Σ: (rs) the concatenation of r and s (r | s) r or s (r*) the Kleene closure of r III.RESTRICTION: The only regular expressions over Σ are the ones defined by I and II above.

Regular Expressions – Operator Precedence
The order of precedence of r.e. operators are, from highest to lowest: Highest: () * concatenation Lowest: |

REs – Languages defined by REs
Let Σ be an alphabet. Define a function L as follows: L: {all r.e.'s over Σ}→{all languages over Σ} L(r) = the language defined by r I. L(ε) = {ε}, ∀a∈Σ L(a)={a} II. RECURSION: If L(r) and L(s) are the languages defined by the regular expressions r and s over Σ, then L(rs) = L(r)L(s) L(r|s) = L(r) ∪ L(s) L(r*) = (L(r))*

REs – Languages defined by REs
Variations Some definitions of regular expressions and regular languages define ∅ to be a r.e. with L(∅)=∅

Properties of REs Axiom Description r | s = s | r | is commutative r | (s | t) = (r | s) | t = r | s| t | is associative (rs)t = r(st) = rst Concatenation is associative r(s|t) = rs | rt and (s|t)r = sr | tr Concatenation is distributive over | rε = εr = r ε is the identity element for concatenation r** = r* * is idempotent r* = (r|ε)* Regular expressions can be simplified by applying the following properties: For any regular expressions r, s, t,

REs – Notional Shorthands
Here are some frequent constructs which have their own notation: (r)+ means one or more instances of r. L((r)+) = (L(r))+ (r)? means 0 or 1 instances of r. i.e. (r)? = r|ε L((r)?) = (L(r|ε)) = L(r) ∪ L(ε) = L(r) ∪ {ε} Character classes: [abc] = a|b|c [a-z] = a|b|…|z

REs – Regular Definitions
Regular expressions can be broken down into regular definitions: sequences of expressions of the form d1 → r1 … dn →rn where each di is a distinct name and ri is a regular expression over symbols in Σ ∪ {d1, d2, … di-1}

REs – Examples Regular expression
Identifier = [A-Za-z][A-Za-z0-9]* Can be broken down into the regular definitions letter  [A-Za-z] digit [0-9] identifier  letter (letter | digit)*

REs and Scanning Why regular expressions for scanning?

Regular Languages and FSA
Let A be a FSA with set of input symbols I. Let w be a string of I*. Then w is accepted by A iff N*(s0,w) is an accepting state. The language accepted by A, denoted L(A), is the set of all strings that are accepted by A. L(A) = {w∈I* | N*( s0,w) is an accepting state of A} Kleene’s Theorem: A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language. Theorem 1: Some languages are not regular. Theorem 2: The set of regular languages over an alphabet I is closed under the complement, union and intersection operators.

How to write a scanner Question 7: Practically speaking: how to translate re’s into FSAs?

RE → Transition Diagram
EXAMPLES

STOP HERE

Javacc Web tab1: Assignment 1 Look at assignment description
OV2-5 JavaCC Scanning Web tab2: The handout ssh: cd ~cps710/ public_html/term/A1/Lecture Structure: directory structure ssh: ls HL.jj Page 2 of handout look at makefile ssh run javacc (1st command in makefile) look at .java files ssh + Page1 of handout finish compilation (2nd command in makefile) use run program to run interactively use run program to redirect input from testfile Look at .jj structure Page 3-4 of handout Special states: 2 scanners, one for java and one for javadoc + 2 parsers sometimes regular expressions are not enough: need PDA in addition to FSA (e.g. comments) so you will need to write extra code. More: Explain it with strings Conflict Resolution Rules

I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

Similar presentations

Presentation on theme: "I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

Similar presentations

Presentation on theme: "I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS"— Presentation transcript:

Similar presentations

About project

Feedback