410/510 1 of 31 Week 1 – Lecture 1 Introduction The Textbook Assessment Overview Compiler Construction
410/510 2 of 31 The Big Picture In this course we will be constructing a compiler! Moving from a High Level Language to a Low Level Language Compilers are complex programs –> 10,000 lines of code Integrate aspects from many different areas of CS –Formal language theory, algorithms, data structures, HLL & LLL (obviously), user interaction (error reporting)
410/510 3 of 31 What is a compiler? A specialization of a language translator Usually in CS: – the Source is a high level programming language –the Target is a machine code for a micro-processor L1L2 SourceTarget Cx86 processor
410/510 4 of 31 Applications of Compiler Techniques Potential Source languages include: –Natural languages (English, French,….) –Circuit layout languages –Mark-up languages (HTML, XML, …) –Command line languages (SQL interface) Potential Target languages include: –Natural languages –Printer drivers –Markup languages e.g. HTML to RTF converter –Could involve many of the aspects we will cover in compiler construction
410/510 5 of 31 Compilers for Programming Languages If we had 1 compiler for each {Source,Target} pair then we would have a lot of compilers! Source LanguagesTarget Languages Compilers C Prolog Java Lisp Haskell C++ C# Fortran Pascal Sather x86 (MMX) JVM PowerPC 750 (G3) ARM SPARC AMD K6
410/510 6 of 31 Modularity for Code Generation Compilers x86 ARM G4 Source Intermediate Representation Compiler portability (man gcc – lists different target machines)
410/510 7 of 31 Modularity for Source Languages? Compilers Intermediate Representation Sources Targets C Java Prolog Typically compilers only compile one source language – but the techniques used are very similar and are shared across different compilers
410/510 8 of 31 Typical Compiler Intermediate Representation SourceTarget Front-endBack-end Independent of Source and Target languages AnalysisSynthesis For a new Source language – we can add a new front-end to an existing back-end For a new Target language – we can add a new back-end to an existing front-end course nowweek 6 Ideally:
410/510 9 of 31 Front End Knowledge about the source language –Lexical structure (tokens) –Syntax Programming constructs –Conditionals, iteration etc –Semantics Type checking Error-reporting –UI component Often basic (and unhelpful!) May vary if part of an IDE or standalone Source program Lexical analyser Syntax analyser Semantic analyser Symbol table Error Handler
410/ of 31 Lexical Analysis Lexical Tasks the compiler has to perform: group together the 3 characters ‘max’ to form the single variable identifier max group together the 2 characters ‘<=’ to form the single relational operator <= (less than or equal to) int max = 20, x; read(x); if ( x <= max ) print(‘ok’); else print(‘too big’);
410/ of 31 Syntactic Analysis Recognise the if.. then … else structure Group the x <= max into a single expression with a relational operator Recognise the format of the variable declaration list –Such that x is correctly declared to be an int Loops, program blocks (begin…end) Arithmetic expressions, etc
410/ of 31 Semantic analysis Check that x <= max is a sensible thing to do –If x was a boolean and max a string then we would have a type error Check that the ‘20’ is in fact an integer and so can be assigned to an int And also (can be split over several phases) –Keep a note of all the variables used so we make sure they all refer to the same value (in memory)
410/ of 31 Data Structures Stream of text as the source file Group together text into larger units from a limited set Nearly all programming constructs can be represented as tree structures If statement ifBoolean expressionstatementelse statement Relational operator expression
410/ of 31 Data Structures Lexical Analyzer – Stream of tokens (enumerated type) –NUMBER OPERATOR NUMBER Syntax Analyzer / Parser – Tree of program structure program if_statementassignmentwhile_loopoutput_statement
410/ of 31 Back-end Knowledge about target processor / virtual machine –Instruction set ‘costs’ of different: –op-codes –instructions –Registers –Memory Semantic analyser Intermediate code generator Code optimiser Code generator Symbol table manager Error handler
410/ of 31 Putting it together Source program Lexical analyser Syntax analyser Semantic analyser Symbol table Error Handler Intermediate code generator Code optimiser Code generator Compiler Skeletal source program preprocessor compiler assembler Loader link-editor Target asse mbly program Relocatable machine code Absolute machine code Source program A language-processing system
410/ of 31 Grammars We define/describe HL languages with grammars A Grammar consists of: –T, set of Terminals –N, set of Non-terminals N T = –P, set of Productions Where and are members of T N –S, special member of N, the Start symbol G = {T, N, P, S}
410/ of 31 Chomsky’s Grammar Hierarchy Type 3 Regular Grammar Type 2 Context Free Grammar Type 1 Context-Sensitive Grammar Type 0 Unrestricted Grammar
410/ of 31 Grammars Type 0 (unrestricted) – , – and are unrestricted sequences, is not null –languages formed from Type 0 grammars can be recognised by non-deterministic Turing machines Type 1 (context sensitive) – A B –A becomes B in the context of … –Complex for computer analysis
410/ of 31 Grammars Type 2 (context free) –A A is a Non-terminal is a member of T N (can be empty) –Equivalent to a push-down automaton Type 3 (regular) –A wB, A w (right linear) w is a string of Terminals A and B are Non-Terminals –Finite state automata
410/ of 31 In a compiler Use the minimum complexity grammars that let us successfully cope with HL programming languages (and process them efficiently) Regular grammars (=regular expressions) in the Lexical Analysis phase –‘recognise the words’ Context-free grammars in the Syntax Analysis phase –’recognise the phrases’ – define our HLL as a grammar based on the output of the Lexical Analysis Deal with context sensitivity in the Semantic Analysis phase
410/ of 31 Overall Front-End View Source program Text file Lexical Analyser Syntax Analyser tokens Semantic Analyser Tree structure Intermediate Representation Type-safe Tree structure Back-end Tree / Linearized tree Context-free grammar Regular grammar Flex Bison
410/ of 31 The Textbook Compilers: principles, techniques & tools Aho, Sethi & Ullman Addison-Wesley {‘The Dragon Book’}
410/ of 31 Assessment Building a compiler for a new language Front-end –Lexical analysis –Parsing Back end –Generating assembler code Some formal and some practical –Formal more at the front-end
410/ of 31 Programming & Tools Lexical analysis generator – lex / flex Parser generator – yacc / bison C / C++ –To implement the remainder of the compiler Unix environment –make files will be useful for coordinating lex and yacc
410/ of 31 Instant Compilation Consider the program: main() { int a = 3; a = a + 1; } Given a reasonably sensible assembly language a hand- compilation might be: LDA #3 STA 1 LDA 1 ADD a, #1 STA 1
410/ of 31 & an Instant Compiler could look like … Switch( source_code_construct ) { case INT_DEC: print( “LDA #”, INT.value) print(“STA 1”) break case INT_ADD: print(“LDA 1”) print(“ADD a,#”, ADD.value) print(“STA 1”) break } /* end switch */
410/ of 31 The Problems …. Not efficient, (LDA #4; STA 1) Only works for 1 variable Only works at one location in memory –(usually let assembler deal with symbolic addresses) Only has 2 programming constructs! Not even slightly portable: – 1 instruction set & 1 source language
410/ of 31 More problems… No error reporting –type checking? Assumes: –Program is correct –Recognition of programming language constructs int a = 3 INT_DEC –Access to values INT.value, ADD.value –1:1 relationship between integers and memory locations
410/ of 31 Solutions We can view compilers as a solution to all of these problems E.g. –Only compile correct programs to object code –Recognise all constructs in the language –Improve the efficiency of code Execution speed Memory usage –Meaningful error messages to the user –Cope with different target architectures
410/ of 31 Why are compilers called compilers? In early compilers one of the main tasks was connecting object program to –standard library functions, I/O devices collecting information from different sources(e.g. libraries) –OS and processor dependent This is now performed by ‘linkers’ Compile – ‘construct by collecting from different sources’