Forth lecture 18: General purpose parsing
INTRODUCTION
We will try to understand about parsing in general -- not just
Forth -- by looking at Pascal.
Lexical parsing of Pascal.
One can of course change the rules of the game.
We forthers tend to do that all the time.
Then it is extremely easy to lexically parse Pascal.
Just demand from the user to put spaces around every keyword
and identifier.
What remains are Pascal strings and Pascal comment.
Both can easily be handled by recognizers, or even without recognizers.
This time I'll try to play by the Pascal-rules.
In ciforth I rewrote the parsing internals of Forth.
-
before parsing starts, the whole source is put into a
contiguous memory area. There are three relevant pointers,
to the start and after the end of the parse area and at the
next character to be parsed, the parse pointer.
-
the elementary action is to advance the parse pointer PP
using PP@@ .
PP@@ returns a pointer to the character parsed and the character
itself.
PP@@ advances PP past the character.
If the parse area is exhausted it points past the parse
area and returns a zero, which is counted as blank space.
-
the interpreter does a NAME which results in (addr n) which is
passed to FOUND.
-
any denotations (literals) like numbers, strings xt's etc. are handled by
prefixes.
: NAME
_ BEGIN DROP PP@@ ?BLANK NOT OVER EOP = OR UNTIL \ ( -- start )
_ BEGIN DROP PP@@ ?BLANK UNTIL \ ( -- start end )
OVER - ;
It is easy enough to understand, parse to the first non-blank,
then to the first blank. Subtract pointers to get the length.
The end of the parse area results in a token with length zero,
how convenient.
A small trick. We really want to jump the first time to PP@@ .
All the other time we have a pointer to get rid off.
So we put a dummy value on the stack by _ 2].
This design is easily adapted to other parsers.
NAME can be revectored to accommodate Pascal and
is better called TOKEN .
The first thing is to introduce delimiters. A word ?START
(char -- flag) replaces ?BLANK .
This gives TRUE for all the weird characters that
Pascal may use to end a name like + : ; etc. in the following example:
a:=b+c;
With
: pascal-delimiters "-+*/=<>:;.{}()[]," ;
the second line becomes:
_ BEGIN DROP PP@@ DUP ?BLANK SWAP ?START OR UNTIL ( end)
There is however a subtlety.
If the character is a delimiter, it has been parsed.
It is part of the next token so we have to
add the line
DUP C@ ?START IF PP-- THEN
in order to set the parse pointer back by one character,
using PP-- .
The implementation of ?START and PP-- is left as an
exercise for the reader.
This goes a long way. However if parsing
a:=b+c;
has proceeded to the ':' the token found is ":=b" .
Remember that '+' is a delimiter.
Looking ":=b" up we want to find ":=" .
In ciforth this is easily accomplished by declaring
the := word in the Pascal namespace a PREFIX .
The result is that the word ":=" is FOUND ,
and the parse pointer is set after the ":=" ,
so at the 'b' .
Now := executes whatever it is supposed to do.
There is one more subtlety and an even more subtle one.
There are keywords that are a prefix to some other keyword in Pascal.
An example of such pair is < and <= .
This is solved by carefully defining longer keywords later,
so earlier in the search order.
Then I discovered that didn't help in this case.
Suppose we are at the '<' in "a<=b".
Now we cannot find the word <= .
The reason is that '=' is a
delimiter so the string "<" is searched for in the dictionary
and <= cannot be found at all.
This also has a simple solution.
All those irregular tokens are prefixes anyway,
so instead of searching for a blank or delimiter,
we just pass the whole remainder of the parse area to the find-engine.
In this case that is "<=b",
but it may be the better part of a file.
Both prefixes "<" and "<=" match,
but the latter is defined later.
Note how impossible this would be with a `BL WORD FIND'
mechanism.
So the new TOKEN becomes:
\ A name now starts with the next non-blank, but ends on a blank
\ or delimiter.
: TOKEN
_ _ BEGIN 2DROP PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
?START IF EOP OVER - EXIT THEN
_ _ BEGIN 2DROP PP@@ DUP ?BLANK OVER ?START OR UNTIL
?START IF PP-- THEN
OVER - ;
The next step is to revector NAME to use this new definition.
All high level word's in ciforth can be revectored in the following way.
'TOKEN 'NAME 2 CELLS MOVE
Note that by making the delimiters empty we have the old
Forth parsing back,
so this revectoring never needs to be undone.
Special treatment is needed for comment, strings and numbers, as
always.
These are handled the same way as ciforth strings and numbers, with
prefixes. Because they are defined in the pascal
NAMESPACE , 1]
they are just separate from the corresponding
Forth concepts. (namespaces are wordlists with a name.)
pascal contains a catch-all at the end. This is an empty
prefix that matches everything and make them identifiers.
This also seals the namespace, so no Forth word is accessable,
nor are the Forth strings and numbers.
This is the way to parse a string (or file) with pascal code:
: parse-pascal pascal-delimiters delimiters $!
pascal EVALUATE PREVIOUS "" delimiters $! ;
There you have it:
OK
"a:=b*12; {Bob is your uncle}" parse-pascal
Identifier: a
operator: :=
Identifier: b
operator: *
Number: 12
Interpunction: ;
Comment: Bob is your uncle
OK
1]
The pascal namespace contains words that print their own name like
: keyword CREATE DOES> $ego "Keyword" .cl+id ;
: operator CREATE PREFIX DOES> $ego "operator" .cl+id ;
operator =
keyword elsif
: { &} PARSE "Comment:" .cl+id ; PREFIX
2]
Some Forthers are familiar with AHEAD .
It can be used to replace
_ _ BEGIN 2DROP PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
by
AHEAD BEGIN 2DROP THEN PP@@ OVER EOP = OVER ?BLANK NOT OR UNTIL
or so they say.
Other Forth lectures
Go to the home page of Albert van der Horst