Presentation is loading. Please wait.

Presentation is loading. Please wait.

컴파일러 첫째주 2005/09/01 권혁철. Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable “scientific”

Similar presentations


Presentation on theme: "컴파일러 첫째주 2005/09/01 권혁철. Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable “scientific”"— Presentation transcript:

1 컴파일러 첫째주 2005/09/01 권혁철

2 Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable “scientific” source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger... I believe that had we failed to produce efficient programs, the widespread use of languages like FORTRAN would have been seriously delayed. — John Backus 18 person-years to complete!!!

3 Compiler construction Compiler writing is perhaps the most pervasive topic in computer science, involving many fields: Compiler writing is perhaps the most pervasive topic in computer science, involving many fields: Programming languages Programming languages Architecture Architecture Theory of computation Theory of computation Algorithms Algorithms Software engineering Software engineering In this course, you will put everything you have learned together. Exciting, right?? In this course, you will put everything you have learned together. Exciting, right??

4 과제 It might be the biggest program you’ve ever written. It might be the biggest program you’ve ever written. It cannot be done the day it’s due! It cannot be done the day it’s due! Syllabus 에 있는 대로 따라 올 것 Syllabus 에 있는 대로 따라 올 것 조교가 사전에 직접 프로그램한 후 중요 모듈을 제공하 게 하겠음 조교가 사전에 직접 프로그램한 후 중요 모듈을 제공하 게 하겠음 또한 조교가 만든 프로그램을 수행해 볼 수 있게 하겠 음 또한 조교가 만든 프로그램을 수행해 볼 수 있게 하겠 음 되도록 모든 사람이 컴파일러를 완성하게 할 것임 되도록 모든 사람이 컴파일러를 완성하게 할 것임 프로그램에 자신이 없는 사람은 조교와 직접 만나서 도 움을 받을 것 프로그램에 자신이 없는 사람은 조교와 직접 만나서 도 움을 받을 것

5 간단한 질문 Consider the grammar shown below( is your start symbol). Circle which of the strings shown on the below are in the language described by the grammar? There may be zero or more correct answers. Grammar: ::= a b ::= b | b ::= a | a Strings: A) baab B) bbbabb C) bbaaaa D) baaabb E) bbbabab Compose the grammar for the language consisting of sentences of an equal number of a’s followed by an equal number of b’s. For example, aaabbb is in the language, aabbb is not, the empty string is not in the language.

6 What is a compiler? The source language might be General purpose, e.g. C or Pascal A “little language” for a specific domain, e.g. SIML The target language might be Some other programming language The machine language of a specific machine Compiler Source Program Target Program Error Message

7 관련 용어 What is an interpreter? What is an interpreter? A program that reads an executable program and produces the results of executing that program A program that reads an executable program and produces the results of executing that program Target Machine: machine on which compiled program is to be run Target Machine: machine on which compiled program is to be run Cross-Compiler: compiler that runs on a different type of machine than is its target Cross-Compiler: compiler that runs on a different type of machine than is its target Compiler-Compiler: a tool to simplify the construction of compilers (YACC/JCUP) Compiler-Compiler: a tool to simplify the construction of compilers (YACC/JCUP)

8 Is it hard?? In the 1950s, compiler writing took an enormous amount of effort. In the 1950s, compiler writing took an enormous amount of effort. The first FORTRAN compiler took 18 person-years The first FORTRAN compiler took 18 person-years Today, though, we have very good software tools Today, though, we have very good software tools You will write your own compiler in a team of 3 in one semester! You will write your own compiler in a team of 3 in one semester!

9 Intrinsic interest  Compiler construction involves ideas from many different parts of computer science Artificial intelligence Greedy algorithms Heuristic search techniques Algorithms Graph algorithms, union-find Dynamic programming Theory DFAs & PDAs, pattern matching Fixed-point algorithms Systems Allocation & naming, Synchronization, locality Architecture Pipeline & hierarchy management Instruction set use

10 Intrinsic merit  Compiler construction poses challenging and interesting problems: Compilers must do a lot but also run fast Compilers must do a lot but also run fast Compilers have primary responsibility for run-time performance Compilers have primary responsibility for run-time performance Compilers are responsible for making it acceptable to use the full power of the programming language Compilers are responsible for making it acceptable to use the full power of the programming language Computer architects perpetually create new challenges for the compiler by building more complex machines Computer architects perpetually create new challenges for the compiler by building more complex machines Compilers must hide that complexity from the programmer Compilers must hide that complexity from the programmer Success requires mastery of complex interactions Success requires mastery of complex interactions

11 Implications Must recognize legal (and illegal) programs Must recognize legal (and illegal) programs Must generate correct code Must generate correct code Must manage storage of all variables (and code) Must manage storage of all variables (and code) Must agree with OS & linker on format for object code Must agree with OS & linker on format for object code High-level View of a Compiler Source code Machine code Compiler Errors

12 Two Pass Compiler We break compilation into two phases: We break compilation into two phases: ANALYSIS breaks the program into pieces and creates an intermediate representation of the source program. ANALYSIS breaks the program into pieces and creates an intermediate representation of the source program. SYNTHESIS constructs the target program from the intermediate representation. SYNTHESIS constructs the target program from the intermediate representation. Sometimes we call the analysis part the FRONT END and the synthesis part the BACK END of the compiler. They can be written independently. Sometimes we call the analysis part the FRONT END and the synthesis part the BACK END of the compiler. They can be written independently.

13 Traditional Two-pass Compiler Implications Implications Use an intermediate representation (IR) Use an intermediate representation (IR) Front end maps legal source code into IR Front end maps legal source code into IR Back end maps IR into target machine code Back end maps IR into target machine code Admits multiple front ends & multiple passes Admits multiple front ends & multiple passes (better code) (better code) Typically, front end is O(n) or O(n log n), while back end is NP- Complete Source code Front End Errors Machine code Back End IR

14 Can we build n x m compilers with n+m components? Must encode all language specific knowledge in each front end Must encode all language specific knowledge in each front end Must encode all features in a single IR Must encode all features in a single IR Must encode all target specific knowledge in each back end Must encode all target specific knowledge in each back end Limited success in systems with very low-level IRs A Common Fallacy Fortran Scheme Java Smalltalk Front end Front end Front end Front end Back end Back end i86 SPARC Power PC Back end

15 Source code analysis Analysis is important for many applications besides compilers: Analysis is important for many applications besides compilers: STRUCTURE EDITORS try to fill out syntax units as you type STRUCTURE EDITORS try to fill out syntax units as you type PRETTY PRINTERS highlight comments, indent your code for you, and so on PRETTY PRINTERS highlight comments, indent your code for you, and so on STATIC CHECKERS try to find programming bugs without actually running the program STATIC CHECKERS try to find programming bugs without actually running the program INTERPRETERS don’t bother to produce target code, but just perform the requested operations (e.g. Matlab) INTERPRETERS don’t bother to produce target code, but just perform the requested operations (e.g. Matlab)

16 Source code analysis Analysis comes in three phases: Analysis comes in three phases: LINEAR ANALYSIS processes characters left-to-right and groups them into TOKENS LINEAR ANALYSIS processes characters left-to-right and groups them into TOKENS HIERARCHICAL ANALYSIS groups tokens hierarchically into nested collections of tokens HIERARCHICAL ANALYSIS groups tokens hierarchically into nested collections of tokens SEMANTIC ANALYSIS makes sure the program components fit together, e.g. variables should be declared before they are used SEMANTIC ANALYSIS makes sure the program components fit together, e.g. variables should be declared before they are used

17 Linear (lexical) analysis The linear analysis stage is called LEXICAL ANALYSIS or SCANNING. Example: position = initial + rate * 60 gets translated as: 1. he IDENTIFIER “position” 2. The ASSIGNMENT SYMBOL “=” 3. The IDENTIFIER “initial” 4. The PLUS OPERATOR “+” 5. The IDENTIFIER “rate” 6. The MULTIPLICATION OPERATOR “*” 7. The NUMERIC LITERAL 60

18 Hierarchical (syntax) analysis The hierarchical stage is called SYNTAX ANALYSIS or PARSING. The hierarchical stage is called SYNTAX ANALYSIS or PARSING. The hierarchical structure of the source program can be represented by a PARSE TREE, for example: The hierarchical structure of the source program can be represented by a PARSE TREE, for example:

19 assignment statement identifier= expression position + expression identifier initial expression identifier rate 60 *

20 Syntax analysis The hierarchical structure of the syntactic units in a programming language is normally represented by a set of recursive rules. Example for expressions: The hierarchical structure of the syntactic units in a programming language is normally represented by a set of recursive rules. Example for expressions: 1. Any identifier is an expression 2. Any number is an expression 3. If expression1 and expression2 are expressions, so are expression1 + expression2 expression1 * expression2 ( expression1 )

21 Syntax analysis Example for statements: Example for statements: 1. If identifier1 is an identifier and expression2 is an expression, then identifier1 = expression2 is a statement. 2. If expression1 is an expression and statement2 is a statement, then the following are statements: while ( expression1 ) statement2 if ( expression1 ) statement2

22 Lexical vs. syntactic analysis Generally if a syntactic unit can be recognized in a linear scan, we convert it into a token during lexical analysis. Generally if a syntactic unit can be recognized in a linear scan, we convert it into a token during lexical analysis. More complex syntactic units, especially recursive structures, are normally processed during syntactic analysis (parsing). More complex syntactic units, especially recursive structures, are normally processed during syntactic analysis (parsing). Identifiers, for example, can be recognized easily in a linear scan, so identifiers are tokenized during lexical analysis. Identifiers, for example, can be recognized easily in a linear scan, so identifiers are tokenized during lexical analysis.

23 Source code analysis It is common to convert complex parse trees to simpler SYNTAX TREES, with a node for each operator and children for the operands of each operator. It is common to convert complex parse trees to simpler SYNTAX TREES, with a node for each operator and children for the operands of each operator. Analysis position = initial + rate * 60= position initial + * 60rate

24 Semantic analysis The semantic analysis stage: The semantic analysis stage: Checks for semantic errors, e.g. undeclared variables Checks for semantic errors, e.g. undeclared variables Gathers type information Gathers type information Determines the operators and operands of expressions Determines the operators and operands of expressions Example: if rate is a float, the integer literal 60 should be converted to a float before multiplying. = position initial + * inttorealrate 60

25 The rest of the process intermediate code generator source program target program semantic analyzer syntax analyzer lexical analyzer code optimizer code generator error handler symbol-table manager

26 Symbol-table management During analysis, we record the identifiers used in the program. During analysis, we record the identifiers used in the program. The symbol table stores each identifier with its ATTRIBUTES. The symbol table stores each identifier with its ATTRIBUTES. Example attributes: Example attributes: How much STORAGE is allocated for the id How much STORAGE is allocated for the id The id’s TYPE The id’s TYPE The id’s SCOPE The id’s SCOPE For functions, the PARAMETER PROTOCOL For functions, the PARAMETER PROTOCOL Some attributes can be determined immediately; some are delayed. Some attributes can be determined immediately; some are delayed.

27 Error detection Each compilation phase can have errors Each compilation phase can have errors Normally, we want to keep processing after an error, in order to find more errors. Normally, we want to keep processing after an error, in order to find more errors. Each stage has its own characteristic errors, e.g. Each stage has its own characteristic errors, e.g. Lexical analysis: a string of characters that do not form a legal token Lexical analysis: a string of characters that do not form a legal token Syntax analysis: unmatched { } or missing ; Syntax analysis: unmatched { } or missing ; Semantic: trying to add a float and a pointer Semantic: trying to add a float and a pointer

28 Internal Representations Each stage of processing transforms a representation of the source code program into a new representation. syntax analyzer semantic analyzer lexical analyzer position = initial + rate * 60 id 1 = id 2 + id 3 * 60 = id1 id2 + * 60 id3 1Position… 2initial… 3rate… 4 = id1 id2 + * inttoreal id3 60 symbol table

29 Intermediate code generation Some compilers explicitly create an intermediate representation of the source code program after semantic analysis. Some compilers explicitly create an intermediate representation of the source code program after semantic analysis. The representation is as a program for an abstract machine. The representation is as a program for an abstract machine. Most common representation is “three-address code” in which all memory locations are treated as registers, and most instructions apply an operator to two operand registers, and store the result to a destination register. Most common representation is “three-address code” in which all memory locations are treated as registers, and most instructions apply an operator to two operand registers, and store the result to a destination register.

30 Intermediate code generation = position initial + * inttorealrate 60 semantic analyzer temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2+ temp2 id1 := temp3

31 The Optimizer (or Middle End) T ypical Transformations Discover & propagate some constant value Discover & propagate some constant value Move a computation to a less frequently executed place Move a computation to a less frequently executed place Specialize some computation based on context Specialize some computation based on context Discover a redundant computation & remove it Discover a redundant computation & remove it Remove useless or unreachable code Remove useless or unreachable code Encode an idiom in some particularly efficient form Encode an idiom in some particularly efficient form Errors Opt1Opt1 Opt3Opt3 Opt2Opt2 OptnOptn... IR Modern optimizers are structured as a series of passes

32 Code optimization At this stage, we improve the code to make it run faster. At this stage, we improve the code to make it run faster. code optimizer temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2+ temp2 id1 := temp3 temp1 := id3 * 60.0 id1 := id2 + temp1

33 Code generation In the final stage, we take the three-address code (3AC) or other intermediate representation, and convert to the target language. In the final stage, we take the three-address code (3AC) or other intermediate representation, and convert to the target language. We must pick memory locations for variables and allocate registers. We must pick memory locations for variables and allocate registers. code generator MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 temp1 := id3 * 60.0 id1 := id2 + temp1

34 The Back End Responsibilities Translate IR into target machine code Translate IR into target machine code Choose instructions to implement each IR operation Choose instructions to implement each IR operation Decide which value to keep in registers Decide which value to keep in registers Ensure conformance with system interfaces Ensure conformance with system interfaces Automation has been less successful in the back end Errors IR Register Allocation Instruction Selection Machine code Instruction Scheduling IR

35 The Back End Instruction Selection Produce fast, compact code Produce fast, compact code Take advantage of target features such as addressing modes Take advantage of target features such as addressing modes Usually viewed as a pattern matching problem Usually viewed as a pattern matching problem ad hoc methods, pattern matching, dynamic programming ad hoc methods, pattern matching, dynamic programming Errors IR Register Allocation Instruction Selection Machine code Instruction Scheduling IR

36 The Back End Register Allocation Have each value in a register when it is used Have each value in a register when it is used Manage a limited set of resources Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete (1 or k registers) Optimal allocation is NP-Complete (1 or k registers) Compilers approximate solutions to NP-Complete problems Errors IR Register Allocation Instruction Selection Machine code Instruction Scheduling IR

37 The Back End Instruction Scheduling Avoid hardware stalls and interlocks Avoid hardware stalls and interlocks Use all functional units productively Use all functional units productively Can increase lifetime of variables (changing the allocation) Can increase lifetime of variables (changing the allocation) Optimal scheduling is NP-Complete in nearly all cases Heuristic techniques are well developed Errors IR Register Allocation Instruction Selection Machine code Instruction Scheduling IR

38 Cousins of the compiler PREPROCESSORS take raw source code and produce the input actually read by the compiler PREPROCESSORS take raw source code and produce the input actually read by the compiler MACRO PROCESSING: macro calls need to be replaced by the correct text MACRO PROCESSING: macro calls need to be replaced by the correct text Macros can be used to define a constant used in many places. E.g. #define BUFSIZE 100 in C Macros can be used to define a constant used in many places. E.g. #define BUFSIZE 100 in C Also useful as shorthand for often-repeated expressions: #define DEG_TO_RADIANS(x) ((x)/180.0*M_PI) #define ARRAY(a,i,j,ncols) ((a)[(i)*(ncols)+(j)]) Also useful as shorthand for often-repeated expressions: #define DEG_TO_RADIANS(x) ((x)/180.0*M_PI) #define ARRAY(a,i,j,ncols) ((a)[(i)*(ncols)+(j)]) FILE INCLUSION: included files (e.g. using #include in C) need to be expanded FILE INCLUSION: included files (e.g. using #include in C) need to be expanded

39 Cousins of the compiler ASSEMBLERS take assembly code and covert to machine code. ASSEMBLERS take assembly code and covert to machine code. Some compilers go directly to machine code; others produce assembly code then call a separate assembler. Some compilers go directly to machine code; others produce assembly code then call a separate assembler. Either way, the output machine code is usually RELOCATABLE, with memory addresses starting at location 0. Either way, the output machine code is usually RELOCATABLE, with memory addresses starting at location 0.

40 Cousins of the compiler LOADERS take relocatable machine code and alter the addresses, putting the instructions and data in a particular location in memory. LOADERS take relocatable machine code and alter the addresses, putting the instructions and data in a particular location in memory. The LINK EDITOR (part of the loader) pieces together a complete program from several independently compiled parts. The LINK EDITOR (part of the loader) pieces together a complete program from several independently compiled parts.

41 Compiler writing tools We’ve come a long way since the 1950s. We’ve come a long way since the 1950s. SCANNER GENERATORS produce lexical analyzers automatically. SCANNER GENERATORS produce lexical analyzers automatically. Input: a specification of the tokens of a language (usually written as regular expressions) Input: a specification of the tokens of a language (usually written as regular expressions) Output: C code to break the source language into tokens. Output: C code to break the source language into tokens. PARSER GENERATORS produce syntactic analyzers automatically. PARSER GENERATORS produce syntactic analyzers automatically. Input: a specification of the language syntax (usually written Input: a specification of the language syntax (usually written as a context-free grammar) Output: C code to build the syntax tree from the token sequence. Output: C code to build the syntax tree from the token sequence. There are also automated systems for code synthesis. There are also automated systems for code synthesis.


Download ppt "컴파일러 첫째주 2005/09/01 권혁철. Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable “scientific”"

Similar presentations


Ads by Google