Compiler construction in4020 – course 2001/2002 week 1 Compiler construction in4020 – course 2001/2002 Koen Langendoen Delft University of Technology The Netherlands
Compiler construction 2002 week 1 Goals understand the structure of a compiler understand how the components operate understand the tools involved scanner generator, parser generator, etc. understanding means [theory] be able to read source code [practice] be able to adapt/write source code
Format: “werkcollege” + practicum Compiler construction 2002 week 1 Format: “werkcollege” + practicum 14 x 2 hours of interactive lectures 1 sp book “Modern Compiler Design” schedule: see blackboard handouts: see blackboard assignment 2 sp groups of 2 students modify reference compiler oral exam 1 sp
Compiler construction 2002 week 1 Homework find a partner for the “practicum” register your group send e-mail to koen@pds.twi.tudelft.nl
Compiler construction 2002 week 1 What is a compiler? program in some source language executable code for target machine compiler Ask audience first.
Compiler construction 2002 week 1 What is a compiler? program in some source language front-end analysis semantic represen- tation back-end synthesis compiler executable code for target machine
Why study compilerconstruction? week 1 Why study compilerconstruction? curiosity better understanding of programming language concepts wide applicability transforming “data” is very common many useful data structures and algorithms practical application of “theory” Ask audience first.
Compiler construction 2002 week 1 Overview lecture 1 [introduction] compiler structure exercise ----------------- 15 min. break ---------------------- lexical analysis excercise
Compiler construction 2002 week 1 Compiler structure program in some source language front-end analysis executable code for target machine back-end synthesis L+M modules = LxM compilers program in some source language front-end analysis semantic represen- tation executable code for target machine back-end synthesis compiler Ask audience about disadvantages BEFORE next slide. executable code for target machine back-end synthesis
Limitations of modular approach Compiler construction 2002 week 1 Limitations of modular approach performance generic vs specific loss of information variations must be small same programming paradigm similar processor architecture program in some source language front-end analysis semantic represen- tation executable code for target machine back-end synthesis compiler
Semantic representation Compiler construction 2002 week 1 Semantic representation program in some source language executable code for target machine semantic represen- tation front-end analysis back-end synthesis compiler heart of the compiler intermediate code linked lists of pseudo instructions abstract syntax tree (AST)
Compiler construction 2002 week 1 AST example expression grammar expression expression ‘+’ term | expression ‘-’ term | term term term ‘*’ factor | term ‘/’ factor | factor factor identifier | constant | ‘(‘ expression ‘)’ example expression b*b – 4*a*c
Compiler construction 2002 week 1 parse tree: b*b – 4*a*c expression expression ‘-’ term term term factor ‘*’ term factor term factor identifier ‘*’ ‘*’ Ask if they notice anything redundant in the parse tree. (wat valt je op?) factor identifier factor identifier ‘c’ identifier ‘b’ constant ‘a’ ‘b’ ‘4’
Compiler construction 2002 week 1 AST: b*b – 4*a*c ‘-’ ‘*’ ‘*’ ‘b’ ‘b’ ‘*’ ‘c’ ‘4’ ‘a’
annotated AST: b*b – 4*a*c Compiler construction 2002 week 1 annotated AST: b*b – 4*a*c type: real loc: reg1 ‘-’ type: real loc: reg1 type: real loc: reg2 ‘*’ ‘*’ type: real loc: sp+16 type: real loc: sp+16 type: real loc: reg2 type: real loc: sp+24 ‘b’ Colors denote types of the nodes. ‘b’ ‘*’ ‘c’ identifier constant term expression type: real loc: const type: real loc: sp+8 ‘4’ ‘a’
Compiler construction 2002 week 1 AST exercise (5 min.) expression grammar expression expression ‘+’ term | expression ‘-’ term | term term term ‘*’ factor | term ‘/’ factor | factor factor identifier | constant | ‘(‘ expression ‘)’ example expression b*b – (4*a*c) draw parse tree and AST
Compiler construction 2002 week 1 Answers
answer parse tree: b*b – 4*a*c Compiler construction 2002 week 1 answer parse tree: b*b – 4*a*c expression expression ‘-’ term term term factor ‘*’ term factor term factor identifier ‘*’ ‘*’ factor identifier factor identifier ‘c’ identifier ‘b’ constant ‘a’ ‘b’ ‘4’
answer parse tree: b*b – (4*a*c) Compiler construction 2002 week 1 answer parse tree: b*b – (4*a*c) expression expression ‘-’ term term factor term factor ‘*’ ‘(’ expression ‘)’ factor identifier identifier ‘b’ ‘4*a*c’ ‘b’
Compiler construction 2002 week 1 Break
front-end: from program text to AST Compiler construction 2002 week 1 front-end: from program text to AST program text lexical analysis syntax analysis context handling annotated AST tokens AST front-end
front-end: from program text to AST Compiler construction 2002 week 1 front-end: from program text to AST program text lexical analysis syntax analysis context handling annotated AST tokens AST scanner generator token description parser generator language grammar
Compiler construction 2002 week 1 Lexical analysis covert stream of characters to stream of tokens what is a token? sequence of characters with a semantic notion, see language definition rule of thumb: two characters belong to the same token if inserting white space changes the meaning. digit = *ptr++ - ’0’; digit = *ptr+ + - ’0’; lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction Webster’s Dictionary
Compiler construction 2002 week 1 Lexical analysis covert stream of characters to stream of tokens what is a token? sequence of characters with a semantic notion, see language definition rule of thumb: two characters belong to the same token if inserting white space changes the meaning. digit = *ptr++ - ’0’; digit = *ptr+ + - ’0’;
Compiler construction 2002 week 1 Tokens attributes type lexeme value file position examples typedef struct { int class; char *repr; file_pos position; } Token_Type; type lexeme IDENTIFIER foo, t3, ptr NUMBER 15, 082, 666 REAL 1.2, .002, 1e6 IF if
Compiler construction 2002 week 1 Non-tokens white spaces spaces, tabs, newlines comments /* a C-style comment */ // a C++ comment preprocessor directives #include “lex.h” #define is_digit(d) (’0’ <= (d) && (d) <= ’9’) Q: what is special about the newline character? A: its representation depends on the operating system!
Compiler construction 2002 week 1 Regular expressions Basic patterns Matching x the character x . any character, usually except a newline [abcA-Z] any of the characters a,b,c and the range A-Z Repetition operators R? an R or nothing (= optionally an R) R* zero or more occurrences of R R+ one or more occurrences of R Composition operators R1 R2 an R1 followed by an R2 R1 | R2 either an R1 or an R2 Grouping ( R ) R itself
Examples of regular expressions Compiler construction 2002 week 1 Examples of regular expressions an integer is a sequence of digits: [0-9]+ an identifier is a sequence of letters and digits; the first character must be a letter: [a-z][a-z0-9]*
Compiler construction 2002 week 1 Regular descriptions structuring regular expressions by introducing named sub expressions letter [a-zA-Z] digit [0-9] letter_or_digit letter | digit identifier letter letter_or_digit* define before use
Compiler construction 2002 week 1 Exercise (5 min.) write down regular descriptions for the following descriptions: an integral number is a non-zero sequence of digits optionally followed by a letter denoting the base class (b for binary and o for octal). a fixed-point number is an (optional) sequence of digits followed by a dot (’.’) followed by a sequence of digits. an identifier is a sequence of letters and digits; the first character must be a letter. The underscore _ counts as a letter, but may not be used as the first or last character.
Compiler construction 2002 week 1 Answers
Compiler construction 2002 week 1 Answers base [bo] integral_number digit+ base? dot \. fixed_point_number digit* dot digit+ letter [a-zA-Z] digit [0-9] underscore _ letter_or_digit letter | digit letter_or_digit_or_und letter_or_digit | underscore identifier letter (letter_or_digit_or_und* letter_or_digit+)? Gotcha: the dot character must be escaped by a backslash.
Compiler construction 2002 week 1 Lexical analysis covert stream of characters to stream of tokens tokens are defined by a regular description tokens are demanded one-by-one by the syntax analyzer get_next_token() program text lexical analyzer syntax analyzer AST tokens
Compiler construction 2002 week 1 interface extern Token_Type Token; /* Global variable that holds the current token. */ void start_lex(void); /* Must be called before the first call to * get_next_token(). void get_next_token(void); /* Load the next token into the global * variable Token. Q: why a global variable? A: syntax analyzer tries multiple alternatives + backwards compatibility (C could not return structs)
lexical analysis by hand Compiler construction 2002 week 1 lexical analysis by hand read complete program text into memory for simplicity avoids buffering and arbitrary limits variable length tokens get_next_token() dispatches on the next character dot input: main() { printf( ”hello world\n”);}
Compiler construction 2002 week 1 void get_next_token(void) { int start_dot; skip_layout_and_comment(); /* now we are at the start of a token or at end-of-file, so: */ note_token_position(); /* split on first character of the token */ start_dot = dot; if (is_end_of_input(input_char)) { Token.class = EoF; Token.repr = "<EoF>"; return; } if (is_letter(input_char)) {recognize_identifier();} else if (is_digit(input_char)) {recognize_integer();} if (is_operator(input_char) || is_separator(input_char)) { Token.class = input_char; next_char(); else {Token.class = ERRONEOUS; next_char();} Token.repr = input_to_zstring(start_dot, dot-start_dot); Q: why must next_char() be invoked on an erroneous token? A: to avoid an endless loop.
Character classification & token recognition Compiler construction 2002 week 1 Character classification & token recognition #define is_end_of_input(ch) ((ch) == '\0') #define is_layout(ch) (!is_end_of_input(ch) && (ch) <= ' ') #define is_uc_letter(ch) ('A' <= (ch) && (ch) <= 'Z') #define is_lc_letter(ch) ('a' <= (ch) && (ch) <= 'z') #define is_letter(ch) (is_uc_letter(ch) || is_lc_letter(ch)) #define is_digit(ch) ('0' <= (ch) && (ch) <= '9') #define is_letter_or_digit(ch) (is_letter(ch) || is_digit(ch)) #define is_underscore(ch) ((ch) == '_') #define is_operator(ch) (strchr("+-*/", (ch)) != NULL) #define is_separator(ch) (strchr(";,(){}", (ch)) != NULL) void recognize_integer(void) { Token.class = INTEGER; next_char(); while (is_digit(input_char)) {next_char();} }
Compiler construction 2002 week 1 Summary compiler is a structured toolbox front-end: program text annotated AST back-end: annotated AST executable code lexical analysis: program text tokens token specifications implementation by hand exercises AST regular descriptions
Compiler construction 2002 week 1 Next week Generating a lexical analyzer generic methods specific tool lex program text lexical analysis syntax analysis context handling annotated AST tokens AST scanner generator token description
Compiler construction 2002 week 1 Homework find a partner for the “practicum” register your group send e-mail to koen@pds.twi.tudelft.nl print handout lecture 2 [blackboard]