Download presentation
Presentation is loading. Please wait.
Published byRosa Cole Modified over 9 years ago
1
Chapter 3. Lexical Analysis (1)
2
2 Interaction of lexical analyzer with parser.
3
3 Lexical Analysis Issues –Simpler design is preferred –Compiler efficiency is improved –Compiler portability is improved Terms –Tokens terminal symbols in a grammar –Patterns rules to describing strings of a token –Lexemes a set of strings matched by the pattern
4
4 TOKENSAMPLE LEXEMESINFORMAL DESCRIPTION OF PATTERN const if relation id num literal const if, >, >= pi, count, D2 3.1416, 0, 6.02 E 23 " core dumped " const if or >= or > letter followed by letters and digits any numeric constant any characters between " and " except " Examples of tokens.
5
5 Difficulties in implementing lexical analyzers FORTRAN –No delimiter is used –DO 5 I=1.25 DO 5 I=1,25 DO 5 I= 1 25 PL/I –Keywords are not reserved –IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;
6
6 Attributes for tokens A lexical analyzer collects information about tokens into their associated attributes Example –E = M * C ** 2 generally stored in constant table
7
7 Lexical Errors Rules for error recovery –Deleting an extraneous character –Inserting a missing character –Replacing an incorrect character by a correct character –Transposing two adjacent characters Minimum-distance erroneous correction Example –Detectable : 2as3, 2#31, … –Undetectable : fi(a == f(x)) …
8
8 Input Buffering A single buffer could make a big difficulty – 두 버퍼 사이에 있는 word –Declare (arg1, …., argn) array or function Buffer pairs –A good solution –Sentinels 을 쓰면 매번 버퍼의 끝인지와 파일 의 끝인지를 동시에 검사할 필요가 없음
9
9 Sentinels at end of each buffer half.
10
10 Specification of Tokens Strings and languages –Alphabet or character class finite set of symbols –String sentence word –|s| length of a string s – ε : empty string, Ф ={ε} : empty set – x, y are strings xy : concatenation, εx = x ε = x Operations on languages
11
11 Terms for parts of a string. TERMDEFINTION prefix of s A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana. suffix of s A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana. substring of s A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and are prefixes, suffixes, and substrings of s. proper prefix, suffix, or substring of s Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x. subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.
12
12 Definitions of operations on languages. OPERATIONDEFINITION union of L and M written L M. L M = {s | s is in L or s is in M} concatenation of L and M written LM LM = { st | s is in L and t is in M } Kleene closure of L written L* L* denotes “zero or more concatenations of” L. positive closure of L written L + L + denotes “one or more concatenations of” L.
13
13 Regular Expressions 1. is a regular expression that denotes { }, that is, the set containing the empty string. 2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting L(r) L(s). b) (r)(s) is a regular expression denoting L(r)L(s). c) (r)* is a regular expression denoting (L(r))*. d) (r) is a regular expression denoting L(r).
14
14 Examples on operations in regular expressions Σ ={a,b} alphabets – a | b {a,b} –(a|b)(c|d) {ac, ad, bc, bd} – a* { ε, a, aa, aaa, …} –(a|b)* (a*|b*)* – aa* = a+, ε|a+ = a* –(a|b) = (b|a)
15
15 Algebraic properties of regular expressions. AXIOMDESCRIPTION r|s = s|r| is commutative r|(s|t) = (r|s)|t| is associative (rs)t = r(st)concatenation is associative r(s|t) = rs|rt (s|t)r = sr|tr concatenation distributes over | r = r r = r is the identity element for concatenation r* = (r| )*relation between * and r** = r** is idempotent
16
16 Regular Definitions Regular definition – d1 r1 d2 r2 …. dn rn 예 letter A|B| … |Z|a|b| … |z digit 0|1| … | 9 id letter (letter|digit)*
17
17 Unsigned numbers Pascal digit 0|1| … |9 digits digit digit* operational_fraction . digits | ε optional_exponent (E(+|-| ε) digits | ε num digits operational_fraction optional_exponent
18
18 Notational Shorthands (1/2) 1.One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r) + is a regular expression that denotes the language (L(r)) +. Thus, the regular expression a + denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r + | and r + = rr* relate the Kleene and positive closure operators. 2.Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r| . If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r) { }. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as
19
19 Notational Shorthands (2/2) 3.Character classes. The notation [ abc ] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [ a – z ] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression [ A – Za – z ][ A – Za – z0 – 9 ] * digit digits optional _fraction optional_exponent num 0 | 1 | ··· | 9 digit + (. digits )? ( E ( + | - )? digits )? Digits optional_fraction optional_exponent
20
20 Nonregular set {wcw - 1 |w is a string of a’s and b’s} context-free grammar is required to represent the string
21
21 Regular-expression patterns for tokens. REGULAR EXPRESSION TOKENATTRIBUTE-VALUE ws if then else id num < <= = > >= - if then else id num relop - pointer to table entry LT LE EQ NE GT GE
22
22 Transition diagram Finite-state automata states and edges 몇 가지 예를 보여줌 …. 다음 페이지, 그림 3.14 는 앞의 예를 바탕으로 그림
23
23 Transition diagram for identifiers and keywords.
24
24 Lex 에 의한 구현 Regular definition finite automata, transition diagram C 프로그램으로 출력 Lexical analysis, pattern matching, …
25
25 Creating a lexical analyzer with Lex.
26
26 Lex program for the tokens of Fig. 3. 10. (1/2) %{ /*definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /*regular definitions */ delim [ \ t \ n ] ws { delim }+ letter [ A-Za-z ] digit [ 0 – 9 ] id { letter } ( { letter } | { digit } )* number { digit } + ( \.{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?
27
27 Lex program for the tokens of Fig. 3. 10. (2/2) % { ws }{ /* no action and no return */ } if{ return(IF); } then{ return(THEN); } else{ return(ELSE); } { id }{ yylval = install_id(); return(ID); } { number }{ yylval = install_num(); return(NUMBER); } “<”{ yylval = LT; return(RELOP); } “<=”{ yylval = LE; return(RELOP); } “=”{ yylval = EQ; return(RELOP); } “<>”{ yylval = NE; return(RELOP); } “>”{ yylval = GT; return(RELOP); } “>=”{ yylval = GE; return(RELOP); } % install_id() { /* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */ } install_num() { /* similar procedure to install a lexeme that is a number */ }
28
28 Lookahead operator DO 5 I = 1.25 DO 5 I=1,25 –DO/({letter | digit})* = ({letter} | {digit})*, –DO/{id}* = {digit}*, IF(I,J)=3 IF(condition) statement –IF/ \(.* \) {letter}
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.