Forth lecture 18: General purpose parsing

INTRODUCTION

We will try to understand about parsing in general -- not just Forth -- by looking at Pascal.

Lexical parsing of Pascal.

One can of course change the rules of the game. We forthers tend to do that all the time. Then it is extremely easy to lexically parse Pascal. Just demand from the user to put spaces around every keyword and identifier.
What remains are Pascal strings and Pascal comment. Both can easily be handled by recognizers, or even without recognizers. This time I'll try to play by the Pascal-rules.

In ciforth I rewrote the parsing internals of Forth.
: NAME
    _ BEGIN DROP PP@@ ?BLANK NOT OVER EOP = OR UNTIL  \ ( -- start )
    _ BEGIN DROP PP@@ ?BLANK UNTIL                   \ ( -- start end )
OVER - ;
It is easy enough to understand, parse to the first non-blank, then to the first blank. Subtract pointers to get the length. The end of the parse area results in a token with length zero, how convenient.
A small trick. We really want to jump the first time to PP@@ . All the other time we have a pointer to get rid off. So we put a dummy value on the stack by _ 2].

This design is easily adapted to other parsers.
NAME can be revectored to accommodate Pascal and is better called TOKEN .
The first thing is to introduce delimiters. A word ?START (char -- flag) replaces ?BLANK . This gives TRUE for all the weird characters that Pascal may use to end a name like + : ; etc. in the following example:
   a:=b+c;
With
: pascal-delimiters "-+*/=<>:;.{}()[]," ;
the second line becomes:
      _ BEGIN DROP PP@@ DUP ?BLANK SWAP ?START OR UNTIL   ( end)

There is however a subtlety. If the character is a delimiter, it has been parsed. It is part of the next token so we have to add the line
      DUP C@ ?START IF PP-- THEN
in order to set the parse pointer back by one character, using PP-- .
The implementation of ?START and PP-- is left as an exercise for the reader.

This goes a long way. However if parsing
   a:=b+c;
has proceeded to the ':' the token found is ":=b" . Remember that '+' is a delimiter. Looking ":=b" up we want to find ":=" . In ciforth this is easily accomplished by declaring the := word in the Pascal namespace a PREFIX . The result is that the word ":=" is FOUND , and the parse pointer is set after the ":=" , so at the 'b' . Now := executes whatever it is supposed to do.

There is one more subtlety and an even more subtle one.
There are keywords that are a prefix to some other keyword in Pascal. An example of such pair is < and <= . This is solved by carefully defining longer keywords later, so earlier in the search order.

Then I discovered that didn't help in this case. Suppose we are at the '<' in "a<=b". Now we cannot find the word <= . The reason is that '=' is a delimiter so the string "<" is searched for in the dictionary and <= cannot be found at all. This also has a simple solution. All those irregular tokens are prefixes anyway, so instead of searching for a blank or delimiter, we just pass the whole remainder of the parse area to the find-engine. In this case that is "<=b", but it may be the better part of a file. Both prefixes "<" and "<=" match, but the latter is defined later. Note how impossible this would be with a `BL WORD FIND' mechanism.

So the new TOKEN becomes:
\ A name now starts with the next non-blank, but ends on a blank
\ or delimiter.
: TOKEN
      _ _ BEGIN 2DROP PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
      ?START IF EOP OVER -  EXIT THEN
      _ _ BEGIN 2DROP PP@@ DUP ?BLANK OVER ?START OR UNTIL
      ?START IF PP-- THEN
      OVER - ;
The next step is to revector NAME to use this new definition. All high level word's in ciforth can be revectored in the following way.
    'TOKEN 'NAME  2 CELLS MOVE
Note that by making the delimiters empty we have the old Forth parsing back, so this revectoring never needs to be undone.

Special treatment is needed for comment, strings and numbers, as always. These are handled the same way as ciforth strings and numbers, with prefixes. Because they are defined in the pascal NAMESPACE , 1]
they are just separate from the corresponding Forth concepts. (namespaces are wordlists with a name.) pascal contains a catch-all at the end. This is an empty prefix that matches everything and make them identifiers. This also seals the namespace, so no Forth word is accessable, nor are the Forth strings and numbers. This is the way to parse a string (or file) with pascal code:
: parse-pascal   pascal-delimiters delimiters $!
    pascal EVALUATE PREVIOUS   "" delimiters $! ;
There you have it:
    OK
   "a:=b*12; {Bob is your uncle}"   parse-pascal

   Identifier: a
   operator: :=
   Identifier: b
   operator: *
   Number: 12
   Interpunction: ;
   Comment: Bob is your uncle
    OK
1] The pascal namespace contains words that print their own name like
: keyword        CREATE DOES>           $ego "Keyword" .cl+id ;
: operator       CREATE PREFIX DOES>    $ego "operator" .cl+id ;

operator =
keyword  elsif
: {   &} PARSE "Comment:" .cl+id ;  PREFIX
2] Some Forthers are familiar with AHEAD . It can be used to replace
      _ _ BEGIN 2DROP PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
by
      AHEAD BEGIN 2DROP THEN PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
or so they say.

  • Other Forth lectures
  • Go to the home page of Albert van der Horst