CS30003: Compilers Lexical Analysis Lecture Date: 05/08/13 Submission By: DHANJIT DAS, 11CS10012
What are Lexemes? Before understanding “lexical analysis” let's understand what is a Lexeme in brief ■ Lexemes are a stream of characters which can be grouped together based on a specific pattern. ■ Patterns are the description that lexemes can represent or can take. ■ Example: if var < tmp*6 What are the lexemes here??
Find lexemes: If var < tmp*6 If← keyword var ← identifier < ← operator (logical) tmp ← identifier 6 ← constant ● Note: Space is discarded. In most compilers, spaces are stripped out.
Token, Patterns... and Lexemes ● Generally, there are a set of string in input for which same token is produced as output. ● Patterns is a rule that matches each string of this set. ● Lexeme is a sequence of characters in source program that is matched by pattern for a token. ● So, 'if' ← lexeme ; 'keyword' ← token ; 'i-f- ' ← pattern
TokensSample LexemesPatterns (informal description) enum for identifiercount, flag, varletter followed by letters and digits num3.1416, 2, 0a numeric constant literal“segmentation fault”any characters between two qoutation marks. Source code is a collection of lexemes The collection/pattern of lexemes is defined by the programming language.
Token Tuple ● From lexemes we construct tokens. ● Token is a tuple of two elements, but may be of only one element. {token_name, attribute} symbolic representation optional of a specific lexeme ● Example: 'if' ← when identified, set 'token_name' as 'if' and no attribute for keywords.
● When lexical analyser encounters lexeme, it generates the token_name and fills up the attribute with the name, type, etc.. from the symbol table. ● Attribute will point to the entry in the symbol table, or memory. ● Numeric Constants: token can be represented in three ways → ■ ■ ← where “ptr” is pointer to the number stored in memory
Lexical Anyalyser – Parser relationship. ● Lexical Analyser does not read the source code in entire go. ● Produced tokens are held in a buffer until they are consumed by parser. ● LA cannot proceed when buffer is full and parser cannot proceed when buffer is empty. Source Code Parser Lexical Analyser
Lexical Analyser Symbol Table Parser get next token token ● The schematic diagram is commonly implemented by making the lexical analyser a subroutine of the parser. ● Upon receiving a “get next token” command from the parser, the lexical analyser reads input characters until it can identify next token.
If var < temp*6 Lexical Analyser will first read “if”. match keyword generate token ● NOTE: Read next character also. Example: ifex = 5 ← ifex not a keyword and lack of space is a error!! So, should scan next character also.
● Lexical Analyser reads one data block In one go, lexical analyser will read one data block from source code. ● What is data block? A block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured are said to be blocked. ● Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data, in this case the lexical analyser.
Forward and Begin Pointer ● Two pointers to the input buffer are maintained. ● The string of characters between the two pointers is the current lexeme. ● Forward pointer: Scans ahead until a match for a pattern is found. If lexeme found, 'forward pointer' set to next character to its right. ● Begin pointer: marks the beginning of the current lexeme being searched for a match.
wh begin pointer forward pointer eli “while” is the string between the forward and begin pointer. Once “while” is matched to symbol table, token can be generated. Next character also needs to be scanned
END OF THIS LECTURE Date: 05/08/13