Com Functional Programming Lexical Analysis Marian Gheorghe Lecture 15 Module homepage Mole & ©University of Sheffieldcom2010
17.1 Finite State Machine (FSM) 17.2 Translator 17.3 Parser Recognisers and Translators ©University of Sheffieldcom2010
For a given set S we may define sequences of symbols over S. M ore precisely, for S={s_1, …, s_n} x = x_1…x_p is a sequence of symbols over S if x_k is from S for any k=1..p. We denote by Seq(S) the set of all sequences over S. Letters is the alphabet, {‘a’..’z’, ‘A’..‘Z’}, then the following sequences are sequences of symbols over Letters ( belong to Seq(Letters) ) “John” “ home” “word” “word1” “long_sentence” don’t belong to Seq(Letters). Words ©University of Sheffieldcom2010
In general not all sequences are of a particular interest and in the set Seq(S) is identified a (proper) subset called language which has some specific properties. Example: the language of words starting with capital letters “John” “Identifier”; “word” How to specify, recognise, process a language? Languages ©University of Sheffieldcom2010 There are some specific mechanisms for recognising words or sentences (regular expressions, automata, syntax diagrams, formal grammars) or for translating them into other things (extensions of the first ones).
Finite State Machine (FSM). Ex. ©University of Sheffieldcom a b c b a c c a a b 0 – initial state 3,7 – final states a,b,c– transition labels
Sequences accepted: aba, abcbab, ababbbbb bac, bca… Sequences rejected ab, abc, ba … A FSM recognizes a language – all the sequences accepted by it (paths from the initial state to final states) We will use only deterministic FSMs FSM - recognizer ©University of Sheffieldcom2010
type SetOf a =[a] data Automaton = FSM (SetOf State) (SetOf Label) (SetOf Transition) InitialState (SetOf State) –- set of final states Example above: States = 0,1, …, 7; Labels = a,b,c; Transition – transition diagram Initial state = 0; Final states = 3,7 FSM is Haskell (1) ©University of Sheffieldcom2010
Where the components are: type State = Int type Label = Char data Transition = Move State Label State type InitialState = State automatonEx = FSM [0..7] ['a','b','c'] [Move 0 'a' 1, Move 1 'b' 2, Move 2 'c' 1, Move 2 'a' 3, Move 3 'b' 3, Move 0 'b' 4, Move 4 'a' 5, Move 5 'c' 7, Move 4 'c' 6, Move 6 'a' 7] 0 [3,7] FSM is Haskell (2) ©University of Sheffieldcom2010
Matching an input against a FSM ©University of Sheffieldcom a b c b a c c a a b “abcbab” is recognized by automatonEx
Various components of a FSM are obtained through select functions: tr :: Automaton -> SetOf Transition -- all transitions of an automaton tr (FSM _ _ t _ _) = t inState :: Transition -> State -- input transition state inState (Move s _ _) = s outState :: Transition -> State -- output transition state outState (Move _ _ s) = s label :: Transition -> Label -- transition label label (Move _ x _ ) = x Selecting components of a FSM ©University of Sheffieldcom2010
All the transitions emerging from a state s and labelled with the same given symbol x : oneMove :: Automaton -> State -> Label -> SetOf Transition oneMove a s x = [t| t <- tr a, inState t == s, label t == x] If it’s deterministic FSM then how many elements are in such a list? Extracting transitions ©University of Sheffieldcom2010
A recogniser that matches an input string against a FSM starting from a state s, is recursively defined thus recogniser :: Automaton -> State -> String -> State recogniser a s xs -- 0 or > 1 transition; returns a dummy state (-1) | length ts /= 1 = no further inputs; returns next state | tail_xs == [] = os -- still inputs to be processed | otherwise = recogniser a os tail_xs where ts = oneMove a s (head xs); tail_xs = tail xs; os = outState (head ts) FSM recogniser in Haskell ©University of Sheffieldcom2010
FSM Translator = Automaton with outputs ©University of Sheffieldcom a/x b/y c/z b/y a/x c/z a/x b/y Inputs: a,b,c; Outputs: x,y,z “abcbab” “xyzyxy”
data AutomatonO = FSMO(SetOf State) (SetOf InputLabel) (SetOf OutputLabel) (SetOf Transition) InitialState (SetOf State) –- set of final states where type InputLabel = Char type OutputLabel = Char data Transition = Move State InputLabel OutputLabel State Automaton with outputs in Haskell ©University of Sheffieldcom2010
A translator may be thus defined translator :: AutomatonO ->(State,OutString) ->InString -> (State,OutString) where InString and OutString are defined as String and denote the input and output strings, respectively. In this case any of the equations defining translator contains tuples instead of states. The tuples are of the form (state,outSymbols), where outSymbols is a string collecting the output label of the current transition. Exercise. Define translator and the associated select functions FSM Translator in Haskell ©University of Sheffieldcom2010
The most basic level of a programming language definition – lexical level. It consists of lexical units or tokens. Examples: identifiers (oneMove, automatonEx) constants or literals - numeric (1, 235), alpha-numeric (“string”) operators (+,-, * /), delimiters (;) … etc Another type of translator is defined by aggregating some inputs and sending them out in certain states. These translators are largely used to recognise lexical units and are called lexical analysers or scanners From Translator to Scanners ©University of Sheffieldcom2010
Lexical analyser as a FSM Translator ©University of Sheffieldcom letter ‘ digit { 4 letterDigit character - { letter is any of ‘a’..’z’ or ‘A’..’Z’ and digit is any of ‘0’..’9’; letterDigit is either letter or digit ; character is any acceptable character
For a string like “ident 453 Id7t” the above automaton may translate it into the following lexical units ident and Id7t which are recognised in the final state 1 and 453 recognised in state 2. When a comment is recognised, a sequence starting with ’{-‘, ending with ‘-}’ and containing any characters in between, in final state 6, then it is discarded. For example the string “34 {-comment-}” produces only one token, 34 Important! In order to ease the process of recognising lexical units assume the tokens are always separated by spaces (‘ ‘) and consequently from every final state we should have a transition to the initial state, 0, labelled by ‘ ‘ Recognising lexical units ©University of Sheffieldcom2010
The following definition is an extension of that given for a FSM data ExtAutomaton = EFSM (SetOf State) (SetOf Label) (SetOf Transition) InitialState (SetOf State) (SetOf FinalStateType) —new!! where type FinalStateType = (State, TokenUnit) type TokenUnit = (Int,String) In our example only states 1 and 2 occur in the list of finalStateType. The final state 6 is not in this list. Extended automaton ©University of Sheffieldcom2010
The translator, which is called lexical analyser, will use a translation function defined thus translation :: ExtAutomaton -> (State,SetOf TokenUnit)-> InputSequence -> String-> (State, SetOf TokenUnit) translation takes an extended automaton a tuple with the first component a state – in general the initial state – and the second part a list of token units – in general empty – an input sequence of characters a string where the current lexical unit will be collected; initially it is empty It produces the last state where the translation process stops and the sequence of token units recognised. Translator ©University of Sheffieldcom2010
If the input string is "ident {-comment-} 346 lastIdent " then translate extAutomaton (0,[]) "ident {-comment-} 346 lastIdent"[] ⇒ (1,[(1,"ident"),(2,"346"),(1,"lastIdent")]) So the translation stops in state 1, which is a final state where lastIdent has been recognised and produces the following token units: (1,"ident") (2,"346") (1,"lastIdent") Example ©University of Sheffieldcom2010
translation function is defined by the following algorithm: when the input string is empty it stops by producing the current state and the list of token units otherwise (input string is not empty) –if the character in the top of the input string is not ‘ ‘ then it is added to the string collecting the current lexical unit and translation resumes from next state, current string collected, and the rest of the input string –(current character is ‘ ‘) the previous state is in the list of FinalStateType then a token unit is recognised and added to the list of token units and translation resumes from the next state, with an empty string where next lexical unit will be collected, and the rest of input string –otherwise (the previous state is not in that list) the collected token is discarded and translation resumes from the next state, with an empty string where next lexical unit will be collected, and the rest of input string Translation function ©University of Sheffieldcom2010