Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scanners/Parsers in a Nutshell

Similar presentations


Presentation on theme: "Scanners/Parsers in a Nutshell"— Presentation transcript:

1 Scanners/Parsers in a Nutshell
Scanners/Parsers in a Nutshell

2 Basic Definitions A scanner takes a stream of input characters (possibly discarding some) and produces a stream of tokens (each with a label and the associated characters). Scanners are said to tokenize input characters or to “scan” character. Input: a = 123 Output: Token (Identifier, “a”), Token (Assignment, “=“), Token (Integer, “123”), Token (EndOfFile, “”) As compiler implementers, we decide on the label to use. In a parser, we might refer to these tokens as Identifier, Assignment, Integer, EndOfFile But we know they are really 2 component objects

3 Basic Definitions A parser takes a stream of input tokens and produces one tree representing the program (not all tokens are in the tree). Parsers are said to generate trees or to “parse” input; i.e., structure the input. Input: Token (Identifier, “a”), Token (Assignment, “=“), Token (Integer, “123”), Token (EndOfFile, “”) Output: <- Token (Identifier, “a”) Token (Integer, “123”) The tree has a label (anything we choose) and an arbitrary number of children where each is either a token (a leaf) or another tree (a non-leaf)

4 Corresponding to source code like
Terminology Corresponding to source code like a = 123 This is the concrete syntax The tree representation is called an abstract syntax tree. <- Token (Identifier, “a”) Token (Integer, “123”) This is the abstract syntax

5 Caution We often say “string” when we mean “sequence” but we are thinking of a generalization of strings in which the elements are any type of object. A string of tokens  a sequence of tokens

6 Terminology A grammar describes a language by means of a set of “potentially” recursively defined variables, some of which are termed goals if they are followed by braces to indicate what comes after the variable. Each variable describes itself in terms of other variables and ultimately non-variables (each description is called a production). Technical names also provided next slide (gaudy colors) a parser grammar One goal per grammar; e.g., LispList LispList {EndOfFile} -> Identifier [node] -> OpenBracket LispList * CloseBracket => "List". * means 0 or more a scanner grammar Many goals per grammar; e.g., Identifier, Integer all = alphabetic = "abc…zABC…Z". Identifer {all - alphabetic} -> alphabetic+ => "Identifier" . Integer {all - digit} -> digit+ => "Integer" . - means subtraction + means 1 or more

7 Color Coded Terminology (Technical Names Also Provided)
red variables defined with -> (nonterminals) pink non-variables (terminals) orange variables with = for shorthand (macros) green tree building information a parser grammar One goal per grammar; terminals are called tokens LispList {EndOfFile} -> Identifier [node] -> OpenBracket LispList * CloseBracket => "List". There is 1 productions. a scanner grammar Each character (or integer that can represent an unprintable character) is a terminal (256 for end of file) all = alphabetic = "abc…zABC…Z". digits = “ ". Identifer {all - alphabetic} -> alphabetic+ => "Identifier" . Integer {all - digit} -> digit+ => "Integer" . Many goals per grammar terminals are characters There are 5 productions, 3 of which are macros.

8 Notation and Terminology
If we start with a string x containing any nonterminal A and replace it by a string described by its A-production, and repeat 0 or times (possibly with different nonterminals), the final string y that we get is said to be derived from x. Notationally, we write x Þ y to indicate that it took 1 step x Þ* y to indicate that it took 0 or more steps x Þ+ y to indicate that it took 1 or more steps Mathematicians call Þ a relation To be more specific about where A is inside the string, we write uAv Þ uwv Since u and v are unchanged, we must be replacing A by w.

9 Notation and Terminology
If y is derived from x, the inverse is reducing y to x and one step is a reduction. Notationally, If x Þ* y, y  * x OR If x Þ* y, y Þ-1 * x Where the superscript -1 (like in matrices, indicates an inverse) In a reduction, uwv  uAv w is called a handle

10 An Example Parser Derivation
a parser grammar LispList {EndOfFile} -> Identifier [node] -> OpenBracket LispList * CloseBracket => "List". There are NO metasymbols like * in a derivation a parser derivation LispList  OpenBracket LispList LispList LispList CloseBracket  OpenBracket Identifier LispList LispList CloseBracket  OpenBracket Identifier LispList OpenBracket CloseBracket CloseBracket  OpenBracket Identifier Identifier OpenBracket CloseBracket CloseBracket An example list list: (a b ()) I have also underlined the handle if you wish to work backwards

11 An Example Scanner Derivation
a scanner grammar all = alphabetic = "abc…zABC…Z". digits = “ ". Identifer {all - alphabetic} -> alphabetic+ => "Identifier" . Integer {all - digit} -> digit+ => "Integer" . A scanner derivation There are NO metasymbols like * in a derivation Identifer  testing The derivation is very shallow since there is no recursion or replacements that end up with new nonterminals. Another scanner derivation Integer  2019

12 How a Scanner Works. A scanner performs the following until the parser stops asking for a token. It performs scanner readahead (via scanner readahead tables); i.e., reading inputs with attribute read while accumulating those characters with attribute keep. By contrast, attribute look leave the character in the input and noKeep does NOT record the character. It performs a semantic action (via a semantic action table); typically called buildToken with a parameter, say called X). This creates a token with label X and the kept characters recorded above made available to the parser. It then clears the kept characters. Once it generates an EndOfFile token, the parser will stop asking for more

13 Scanning: An Example The scanner performs the following on input
test = “123”; 1: “scanner readahead” picks up and records t, e, s, t and “semantic action” creates Token (Identifier, “test”) 2: “scanner readahead” discards “space” and records = and “semantic action” creates Token (“Assign”, “=”) 3: “scanner readahead” discard “space”, double quote, records 1, 2, 3, discards double quote and “semantic action” creates Token (“String”, “123”) 4: “scanner readahead” picks up ; and “semantic action” creates Token (“Semicolon”, “;”) 5: “scanner readahead” picks up EndOfFile and “semantic action” creates Token (“EndOfFile”, “”)

14 How It Does This? What to do is encoded in data structures (compiler writers call them tables since it is shorter to say) First, let’s see what tables look like diagrammatically. There are 2 types (of course). Second, let’s see what tables look like as an actual data structure.

15 Scanner Readahead Tables
read (no squiggly bracket) => perform next Deals with input characters look (squiggly bracket) => do not perform next a [keep] Scanner Readahead input characters viewed as a stream " [noKeep] peek (to get it; it stays in the input) {b} [noKeep] next (to remove it from the input) (square brackets for attributes) Keep means record, noKeep means don’t keptCharacters Each transitions has 2 attributes read/look and keep/noKeep, and points at another table (the goto, not shown here) The process starts by peeking at the input then using the transitions to decide what to do. What’s in input keptCharacters What happens a “a” a kept a no longer in the input " “” “ discarded " no longer in the input b “” b discarded b is still in the input At the end, switch to goto

16 Semantic Tables SemanticTable Deals with special processing
Doesn’t care about the input buildToken (name) keptCharacters 1 unlabelled transition pointing at the next table (the goto) The process is to build a new token with the name provided and with the characters in “keptCharacters”). This is an output of the scanner viewed as an output stream. It also resets “keptCharacters” to an empty string. What’s token does it create? Token (name, “test” Assuming keptCharacters contains “test”. At the end, keptCharacters is cleared and switch to goto

17 With Multiple Tables … … … … a [keep] Scanner Readahead b [keep]
Input: test; {;} [noKeep] a [keep] Scanner Readahead b [keep] As an object oriented programmer, think of the tables as objects that perform the processing 0 [keep] {;} [noKeep] SemanticTable buildToken (Identifier) This produces the output Token (Identifier, “test”) Input is now: ;

18 The Tables as Data Structures
table number string of characters OR array of integers (unprintables) (ScannerReadaheadTable 1 (“ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz” “RK” 2) (( ) 'L' 11) (' ” “L” 3)) (ScannerReadaheadTable 2 (“ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz” “RK” 2) (“+*=[]{}()^;#:.$ ' 'L' 11) (' ” “RK”5)) (SemanticTable 5 buildToken: Identifier 1) Successive triples (3 element subarray) an array attributes goto state goto state an array name of the semantic action Parameters for the semantic action (0 or more) This produces the output Token (Identifier, “test”) Input: test;

19 Parsers More complex than Scanners. They parse bottom up according to the grammar by replacing sequences of tokens by “grammar” tokens; i.e., they perform reductions while at the same time constructing a tree. List -> List Integer [node] => "Sequence" -> Integer [node] This is really a token; e.g., Token (Integer, “123”) To be able to parse, it makes use of 3 parallel stacks and two variables left and right (positions in the stack) Token stack Table number stack left right Tree stack The top of the stack is on the right.

20 How a Parser Works A parser performs the following until it decides to perform accept. It performs readahead (via readahead tables); i.e., reading inputs while optionally stacking (in parallel) information associated with this input with variable right being made to track the top of the stack (left is dragged past right). It performs readback (via readback tables); i.e., looking at stack information “to the left of variable left” in order to decide whether or not to “decrement left”. It performs a semantic action (via a semantic action table); typically called buildTree with a parameter, say called X). This takes tree stack entries between left and right and creates a new tree with root X and those entries as children and stores it in a variable called newTree. It performs a reduce to A operation (via a reduce table) knowing that everything in the stack between left and right (inclusive) corresponds to the right side of a grammar “rule” and the left side corresponds to “A”. Then it picks up the new tree (if there was one) or the tree in the stack (otherwise), and then pops everything between left and right, pushing in its place the token for A and its tree. Information on top of the stack after the pop is used to figure out which table to use next.

21 There are 6 types of Parser Tables
Readahead Readback Semantic action Reduce to A Shiftback n Accept Let’s consider each one in turn.

22 Readahead Tables Readahead Deals with input tokens
read (no squiggly bracket) => perform next Deals with input tokens look (square bracket) => do not perform next Identifier [node, stack] input tokens viewed as a stream Readahead peek (to get token; it stays in the input) = [noNode, stack] next (to remove it from the input) {;} [noNode, noStack] Node => record in tree stack (it becomes a node), noNode means don’t token stack Stack => transfer input information to the stack, noStack means don’t table number stack Each transitions has 3 attributes read/look, node/noNode, and stack/noStack, and points at another table (the goto, not shown) Tree stack The process starts by peeking at the input token then using the transitions to decide what to do. What’s in input How 3 Stacks Change What happens to the input Identifier Push (Identifier, goto number, Identifier) Identifier no longer in the input = Push (=, goto number, nil) = is no longer in the input ; ; is still in the input Do nothing table number token tree At the end, set right to top index, left to “1 more”, then switch to goto

23 Readback Tables (Identifier 10) Readback {(Number 20)}
read (no squiggly bracket) => decrement “left” Deals with moving “left” to the left look (squiggly bracket) => don’t do anything (Identifier 10) input tokens viewed as a stream Readback peek (to get token; it stays in the input) {(Number 20)} token stack next (to remove it from the input) table number stack Each transitions has a pair (symbol and state number), 1 attribute read/look, and points at another table (the goto) Tree stack The process starts by peeking at the token+table number stack at index position “left – 1" then using the transitions that match to decide what to do. What’s in the stack What we do Does not look in the tree stack (Identifier 10) Decrement “left” (Number 20) Do nothing At the end, switch to goto

24 Semantic Action Tables
Deals with special processing SemanticTable Doesn’t care about the input buildTree (rootName) token stack table number stack Tree stack 1 unlabelled transitions pointing at another table The process is to build a new tree (stored in “newTree) with the root provided in the table and the children extracted from the tree stack between left and right (inclusive) At the end, switch to goto

25 Reduce to A Tables Reduce to A
Deals with popping and simulating the reading of nonterminal A read (no squiggly bracket) => decrement “left” look (squiggly bracket) => don’t do anything (10 [node, stack] 20) Reduce to A {(30 [node, stack] 50)} token stack Each transitions has a triple (“from state”, attributes, “to state”), 3 attributes read/look, node/noNode, and stack/noStack and points at another table (“to state” is the goto) table number stack Tree stack The process starts by picking up newTree (if it’s not nil); otherwise, the tree in the stack between left and right (it’s an error if there is more than one). Hold it in a variable called tree. If there isn’t any, use nil. Then pop everything in the 3 stacks between left and right and create an “A token”. Next, find the transition in which the table number on top of the stack matches “from state”. Finally, simulate what the readahead table would do reading the A token with the given attributes and to state in the transition you found. What’s the attribute look or noStack Do nothing node Push (A token, goto, tree) i.e., you want the tree noNode Push (A token, goto, nil) i.e., you don’t want the tree At the end, switch to goto

26 “Shiftback n” Tables Shift back n Decrements “left” by n
This is an optimized variation of a readback table where we don’t need to consider what’s in the stack the goto Decrement left by n and switch to goto

27 Accept Table (only one)
There is nothing to do except return nil to indicate that parsing is done…

28 What Do Tables Look Like As Data Structures
table number (ReadaheadTable 1 (Integer 'RSN' 27) (Identifier 'RSN' 4) ('(' 'RS' 5)) (ReadbackTable 21 ((Term 12) 'RSN' 40) ((Term 3) 'RSN' 40)) (ShiftbackTable ) (ReduceTable 35 Expression (1 'RSN' 2)(5 ‘L' 10)(8 'RN' 13)(9 'RSN' 14)(15 'RSN' 17)) (SemanticTable 39 buildTree: '+' 35) (AcceptTable 43)) Triples: symbol × attributes × goto Triples: pair × attributes × goto where pair: symbol × table number Shift amount and goto Nonterminal to reduce to Triples: “table number at top of table number stack” × attributes × goto “action followed by 0 or more parameters” × goto A identifier inside an array literal ends up being a symbol (a unique string)

29 Making the Scanner/Parsers More Efficient
scanners Note all 4 types of tables consist of triples (ScannerReadaheadTable 1 (“ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz” “RK” 2) (( ) 'L' 11) (' ” “RK”3)) All keys could be integers, All values could be pairs even in other tables like those below; e.g., (“RK” 2) parsers (ReadaheadTable 1 (Integer 'RSN' 27) (Identifier 'RSN' 4) ('(' 'RS' 5)) (ReadbackTable 21 ((Term 12) 'RSN' 40) ((Term 3) 'RSN' 40)) (ReduceTable 35 Expression (1 'RSN' 2)(5 'RSN' 10)(15 'RSN' 17)) keys could be symbols More complicated keys “could be from state” A symbol keyed dictionary of “integer keyed dictionary of pairs” Term 12 Use dictionaries since keys are looked up using hashing.

30 Let’s Look At Code It’s a small hierarchy of classes. It’s object-oriented; doesn’t have to be. It can be efficient.

31 Of course, you have no clue how to create the tables it needs.
Now What? You are now ready to implement your own scanner/compiler. It’s not impossible to implement the whole thing from scratch in say, 3 weeks. To make it easy, I give you all the code with a little over a dozen methods stripped of it’s code. Of course, you have no clue how to create the tables it needs.


Download ppt "Scanners/Parsers in a Nutshell"

Similar presentations


Ads by Google