Download presentation
Presentation is loading. Please wait.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free (except for a few special cases, like the input statement in FORTRAN). In this section we will briefly see that the popular hypertext markup language(HTML) and the newly emerging extensible markup language (XML), for e-commerce, are also context-free. These languages are usually presented in a descriptive form. We will see how the language description can be transformed to a formal grammar. This is a brief summary of the hard copy handed out at the class. See this handout for more details.
177 HyperText Markup Language(HTML) 1. Char is a single character. 2. Text is any string of characters with no tags. 3. Doc represents documents, which are sequences of Elements. 4. Element is either a Text, a pair of matching tags and a Doc between them, or an unmatched tag followed by a Doc. 5. ListItem is the tag followed by a document, which is a single list item. 6. List is a sequence of zero or more list items. HTML consists of text and tags. Matching tags are of the form and for a string x. Unmatched tags are of the form with no matching part. The following specification is for the item list, which can be easily transformed to a set of production rules as shown in the box. a | A |..... | z | Z |... | | | |
178 XML and Document-Type Definitions The main purpose of XML is to describe the meaning (i.e., semantics) of the document by elaborating it with a form called DTD (Document Type Definition). The general form of a DTD is,where the Element definition has the following form. Element descriptions are essentially regular expressions defined as follows. (Notice that definition 1 and 2 are for the base of the definition. Recall that (E1) + = E1(E1)*.) 1. An element-name. 2. The special term \#PCDATA, standing for any text that does not involve XML tags. 3. If E1 and E2 are Elements, then E1*, E1+, E1?, E1.E2, and E1 | E2 are Elements which, respectively, denote the following: E1* : zero or more occurrences of E1. E1+ : one or more occurrences of E1. E1? : zero or one occurrences of E1. E1.E2 : E1 concatenated by E2. E1 | E2 : E1 union E1.
179 Example. A DTD for personal computer <!DOCTYPE PcSpecs [ ]>
180 Above DTD form can easily be transformed to a CFG. For example Can be transformed to the production rule |, and so on.
181 A part of an XML document conforming the above DTD is shown below. Notice that each element is delimited by a pair of matching tags with the name of the element. 1234 $3000 512</RAM. Superdc xx1000 62Gb 32x....
182 Lex and YACC Most compilers have two main functional components; a lexical analyzer and a parser. The lexical analyzer, reading the input source program, identifies tokens, and the parser, based on the tokens, parses (i.e., identifies the relationships between the tokens in terms of a sequence of production rules) the program. Lex Since the tokens can be expressed in terms of a regular expression, the lexical analyzer can be built based on the model of finite state automata. (Recall the automaton that we have designed for recognizing Pascal numbers.) Lex builds a lexical analyzer based on “token forms” given as (actually, a variation of) regular expressions, and carries out the action given for each token. The input to Lex consists of three parts, each separated by %: - Definition - Token description and actions - User-written codes
183 Definition In definition section, reach regular expression is defined with a name. In regular expressions, operator + is used for the closure together with the operator *, and the vertical bar | is used for the union operator. Thus (a | b)* denotes any combination of a’s and b’s including the null string, and (a | b)+ denotes (a | b)* with the null string excluded. Alternation can be also written using brackets. For example, [ab] means (a | b), and [a-z] for any symbol from the lower case alphabet. A question mark indicates the preceding expression is optional. Thus abc? is equivalent to ab | abc, and (abc)? is abc | . The period is used as a wild card symbol that matches any character. For example integers and reals can be defined as follows. digits [0-9] int {digits}+ real {int}”.”{int}([Ee][+-]?{int})?
184 Token Descriptions and Actions Recognizing a token, Lex returns an indication to the parser of what the token is. This section specifies such responses in terms of actions. For example, {real} return FLOAT; {inteter} return INTEGER; User-Written Code When the action part for a token is complex, it is written as a function and included in this section to be used in terms of a function call in the section for actions.
185 YACC Yacc takes a grammar as its input and generates the table and a program (in C) that implements a look ahead LR (also called LALR) parser. The input also provides semantic actions for each production rule, and YACC generates a code for carrying out these actions. The input form for YACC also consists of three sections as for Lex. - Declarations and Definitions - Grammar and Actions - User-written Codes We will briefly describe each section. (For more details, see a reference manual for YACC.) Declarations and Definitions In this section, all tokens, except single-character operator, are defined. To help parser we can also specify operator precedence and the associativity (left or right) in this section. To establish proper links to other parts of C, it also includes facilities for identifying variables and type definition. Here are some examples.
186 %token ID /*token for identifier */ %token NUMBER /* token for numbers */ %token BEGIN END ARRAY FILE #include “yylex.c” /* include lexical scanner */ extern int yylval; /* token values from yylex */ int tcount = 0; /* a temporary integer variable */ %start S /* Grammar’s start symbol */ Grammar and Actions The grammars are defined similar to BNF form as follows. - Single characters used as terminals are put in singly quotes and non-terminals are written as a name with no delimiters. - Instead of , a colon is used, and the right end of a production rule is marked by a semicolon. - Blank is used to represent an -production.
187 Here is an example. E E + T | E – T | T T T * F | T / F | F F (E) | i expr : expr ‘+’ term | expr ‘-’ term | term ; term : term ‘*’ fact | term ‘/’ fact | fact fact : ‘(‘ expr ‘)’ | ID ; User-written codes The user-written code section contains the main program that invokes the parser, named yyparse ( ), and other codes if needed. So it should contain at least the following code. main ( ) { yyparse ( ) ; }
Similar presentations
© 2025 Inc.
All rights reserved.