Forth lecture 10.

Forth as a scripting language.

What do we need for scripting? A lot of things, but let us order them by subject.

No interactive message
Looping while interpreting
Looping over strings
How to handle regular expressions in Forth

No messages during startup, or later.

For scripting we must get rid of messages during startup. At startup, normally a sign on message is presented, showing what Forth and what version you are talking to.

It helps if we have all this information gathered in a single word called .SIGNON. Typically it prints the contents of the environment queries NAME SUPPLIER VERSION CPU. Of course CPU is a double number printed in base hexatrentical.

The FIG tradition printed this sign on message with each ABORT . Coos Haak insists that ABORT should be silent, because QUIT is supposed to be silent. I am not quite convinced this is a correct interpretation of the ISO Forth standard. I see systems like GForth printing a lot of information on ABORT's (or any THROW) and I think that is a Good Thing. It is a good idea to have diagnostics information printed at the place where the final and fatal exception is caught, but I also think is is good to separate this from code that is supposed to do a reinitialisation. Of course error detection and post mortem analysis is an area where there should be much room for customization, and a possibility to insert sophisticated tools. Later on, because it definitely doesn't belong in the kernel of a Forth system.

So ideally we have about this situation.
ABORT executes 2 THROW. The exception is caught and all possible help is given to find out about the error. Then execute a silent reinitialisation that for a lack of a better name we could call (ABORT). (Sane people would call it INIT.) This word has the effect of QUIT plus cleaning of stacks.

Bottom line is that COLD calls (ABORT) and this doesn't result in any messages.

We are left with two sources of messages:

The word OK that is responsible for printing "OK" after each line of code is executed.
The sign on message.

So in scripting during cold startup we just don't print the sign on message. If we are not talking to a terminal, we just suppress .SIGNON . And we make OK shut up. Now that is easy.

'NOOP 'OK 3 CELLS MOVE

At least in ciforth that is easy. The above code copies the behavior of NOOP (a no operation word) into OK .

If you didn't factor out the printing of "OK" to a separate word, now is the time. It is a great place to insert a stack print if you are debugging too.

Now the last trick. How do we find out whether we are talking to a terminal? This is a bit system-dependent.
In linux that goes like this:

CREATE TERMIO 60 ALLOT
HEX 0 5401 TERMIO 36 LINOS 0<

This asks Linux, using an operating system call 36, to fill in TERMIO with the properties of the terminal. If it gives a negative result, that means it failed, and we are not connected to a terminal at all, but to a stream.

The constants 60, 5401 and 36 are looted from c after a long and bloody battle. On a typical Red Hat system, there are 8 files called termios.h , and one of them includes a file that defines TCGETS as 5401. (Or includes a file that includes ... ).
So at last, this is the code to be present in COLD :

0 5401 TERMIO 36 LINOS 0< IF
    'NOOP 'OK 3 CELLS MOVE
ELSE
    .SIGNON
THEN

And if you don't want to do it yourself.

ALSO ENVIRONMENT
: .SIGNON
CR SUPPLIER TYPE "is proud to present "
CR BASE @ 36 BASE ! CPU D. BASE ! NAME TYPE SPACE VERSION TYPE CR
;
PREVIOUS

This assumes that environment queries are Forth words present in an ENVIRONMENT word list. This is not ISO, but this approach is taken by GForth, iForth, tForth, ciforth and probably others.


Simple scripting.

Let's say we have a Forth that shuts up if it senses that we
are talking to it through a channel, so not an interactive
terminal. Then in a Unix system we already have a practical
scripting system, in combination with the powers of the Unix
command interpreters, (called "shell").
For example a script to add 1 to 2 and print the results:

    forth < 'THEEND'
    1 2 + .
    BYE
    THEEND

This uses a feature called a here document. The remainder up till
"THEEND" is passed to the forth program.

Of course it is more useful to have a script called add, and pass
it the parameters 2 and 3:

    add 2 3
    5


The script would now look like:


    forth < THEEND
    $1 $2 + . CR
    BYE
    THEEND

The quotes are missing from THEEND. To the shell this means that it must interpret the lines before passing them on. In this case $1 gets replaced by 2 and $2 by 3. The shell will also make the Unix environment available, a set of strings with information about the environment a program is running in. An environment variable is a name, not a number. It is likewise preceeded by $ , for example "$HOME" and expanded by the shell to what it was set to. Environment variables contains such things as the current directory, the users name, and all sorts of information you care to pass to programs, such as library names, or the preferred place for video editing and cd writer programs to write huge scratch files. The most famous is undoubtedly PATH. It is a row of directories where the shell looks for programs.

Of course passing Forth code through a shell is dangerous. Unix shells are the kind of tools as on that picture of Brody. (On my page I will show the hammer-screwdrive-whatnot if I can get permission.) It will do so many things that at least one is unexpected, causing problems. (Careful people can put all lines between single quotes by default, but that is ugly.)

As an aside, the command interpreters on MSDOS systems are plain bad in comparison. The default ones are all called COMMAND , they change without notice, they are not powerful and they are not sufficiently documented. There seems to be an official Korn-shell for WINDOWS, but it is not according to the specification (says a man named Korn. 1) ) However, that being said, the above techniques apply to MSDOS mutatis mutandis and can achieve useful results.
[ 1) I hope that is no urban legend. Even if it is, it is the kind of anecdote that is true, even if it isn't. ]

The environment

A Forth running on a host operating system needs access to the information available to all programs running there, called the environment. This is especially true for scripts, because they are mostly parts of a large body of cooperating small programs. We have seen that even a simple Forth can do such scripting, because a shell can give us the content of parameters and environment variables. But to get serious, we must be able to access them directly.

The Unix system, the Bourne shell and the Kernighan&Ritchie c-compiler were all designed together. No wonder that they cooperate well. A shell passes the command line arguments and the environment variables to C as you can see in the declaration of main :

int main(int argc, char *argv[], char *env[]);

A c-program has nothing to translate, the parameters are just there because the shell is expecting a c-program. On operating system oriented towards other languages, such as MSDOS where the systems programming languages is BASIC, a c-program needs a preambule to analyse data area's. And is in that respect no better off than Forth.

You see that a program also passes in int back. A zero indicates a successful completion, any other number identifies an error condition, comparable with a throw code. It is a pity that Forth has no provision in BYE to pass information back. However it is of course possible to have a variable EXIT-CODE or some such and pass its value to to the OS during BYE This cannot break any existing code. It is implemented in ciforth.

What hook do we need in a Forth system to get at the argument and environment information? Under a Unix system this is typically extremely simple. On a Forth that relies on C for the connection with the operating system, such a gForth, it is both simple and portable. On a Forth defined in assembler it is still quite simple, but system dependant.

A c-function gets its arguments via the stack. The function main is no exception to this. It is sufficient to remember the stack pointer.

The following example is from ciforth for GNU-Linux on Intel 386:

        MOV      LONG[USINI+(CW*(31))],ESP ;Remember ARGS.

ARGS is defined as a user variable with an offset of 31 cells in the user area.

This is the dictionary entry:

ARGS "arguments" --- addr Return the addr of ARGS, a user variable that contains a system dependant pointer to any arguments that are passed from the operating system to ciforth during startup. In this ciforth it points to an area with the argument count, followed by a a null ended array of arguments strings, then by a null ended array of environment strings. This leads to the following code. The comment uses the Stallman convention, see lecture 3 (forth coming.) \ Return the NUMBER of arguments passed by Linux : ARGC ARGS @ @ ; \ Return the argument VECTOR passed by Linux : ARGV ARGS @ CELL+ ; \ Return the environment POINTER passed by Linux : ENV ARGS @ @+ 1+ CELLS + ; An indispensable word to deal with c-strings is also \ For a CSTRING (pointer to zero ended chars) return a STRING. : Z$@ DUP BEGIN COUNT 0= UNTIL 1- OVER - ; For example if forth is started with lina HELLO_WORLD The code ARGV CELL+ Z$@ TYPE would print the second argument, i.e. the first argument passed to forth Looking up an environment string C-data structures are territory alien to Forth. Looking up an environment string is not totally trivial. Lets first define what we want: GET-ENV "get environment string" sc1 -- sc2 So a string constant SC1 is passed in, and another one is passed out. A string constant is an address length pair where you are not supposed to reach through to change at the character level. See forth lecture 13. (forth coming.) For the possibility that an environment string is not found, the following convention is used. The address of sc2 is zero. This is called a NULL-string. Of course an environment string can have zero characters. Then sc2 has a length of zero, but a non-zero address. This convention is c-ish, and born from the impossibility to pass more than one parameter back. In Forth you could define the stack diagram as (sc1 -- sc2 false/true), But I don't like that. If you prefer that you can always do ; GET-ENV GET-ENV OVER ; In programming the word GET-ENV I learned something. If you test a word, and it fails, it may be too complicated. If a word contains more than say 7 words or it contains a nested control structure, you may conclude it is is too complicated from the very fact that it fails a test. What did Jeff Fox say about Chuck Moore? "He doesn't spend time debugging." The reason is that he makes the words so simple that they work the first time. I may never become as good a programmer as Chuck, but I can try to do the same trick. As can you. (And maybe Chuck doesn't get regular expressions right the first time as often as I do.) Back to looking up strings in the environment, we see that one of three possibilities can occur in comparing with a particular environment string. That environment string can be a NULL-string, meaning we have reached the end of environment. Otherwise it can compare equal, or unequal. This is sufficiently complicate to warrant generating a new word for it. Note that in addition we need a flag whether we must go on searching. For some reason I cannot recall, I have named this word (MENV) Its implementation is rather straight forward now. \ For SC and ENVSTRING leave SC / CONTENT and GOON flag. : (MENV) DUP 0= IF DROP 2DROP 0. 0 ELSE Z$@ &= $S 2SWAP >R >R 2OVER COMPARE IF RDROP RDROP 1 ELSE 2DROP R> R> 0 THEN THEN ; (&= is a denotation, see forth lecture 1 denotations. forthcoming. read CHAR = or [CHAR] = for it in the mean time.) If I didn't get that one right the first time, I would have factored out the second line. That is the tricky part. After $S ("string split") (see forthlecture 12 forthcoming) we have three strings, the one to look up, the environment name and the environment content. The environment content is put on the return stack. Then we compare, keeping the string to lookup. Depending on the outcome the content or the original string is dropped. GET-ENV itself is now easy and need no further comment. ( Find a STRING in the environment, -its VALUE or NULL string) : GET-ENV ENV BEGIN $+ SWAP >R (MENV) WHILE R> REPEAT RDROP ; And at last an example: "HOME" GET-ENV TYPE /home/albert OK (" starts a denotation, it leaves a string constant. See lecture 1 forthcoming. )

Options

ARGS @ @ 1 - IF

tests whether there are any arguments passed to the lina.

Shell variables.

The word GET-ENV can be used to look up a string in the environment. With $ we can make a denotation of it. It remains to be seen whether we want binary search. If they are not ordered it may be no use.

Loops in interpret mode

Automatic conditional code and loops.

Using T] and T[. Just do

WANT -scripting-

You can now just loop outside of a definition:

10 0 DO I . LOOP

This works, but I am not happy with the way conditional and loops are done in Forth, want the Algol way.

REGULAR EXPRESSIONS

Regular expressions in C or other languages are handled by creating a compiled string that is interpreted. In Forth it would result in compiling to a temporary definition.

: EMATCH ECOMPILE EXECUTE ;

EMATCH gives -1 if not matched and otherwise the number of bracketed expressions. Under the number of bracketed expressions are as many strings.

EREPLACE returns a string where \1 \2 etc are replaced by the expressions returned from EMATCH.

Strings In combination with the conditional stuff that generates very volatile strings we need

: =$ $, CREATE , , DOES 2@ ;

Used as in ... if ".bin.edu" else ".bin.org" then =$ wwwtail$

Notes EREPLACE is also handy for

.if www$ 2DUP
    domain$ 1  "\1$" EREPLACE EMATCH 0=
    ipadd$ 1  "^\1" EREPLACE EMATCH 0=  AND
.then
  ..
.else
 ..
.fi

I am tempted to add the following syntactic joke to " that is non standard anyway. It parses another character. It must be blank ; or . . If it is a ; another TYPE is compiled. If it is . another TYPE and CR are compiled, such that we get

"You site had "; hits . " hits today!".

The extreme terseness beats perl, but the terseness is probably not in line with the less terseness in other areas.

STRING LOOPS

String loops are loops over substrings in a string. Lines in a file can be considered as subtrings too, as a file after fetching into memory is a string with embedded new lines. In fact this is the default.

First of all we need the double precision return stack words <R 2R< 2R@ . If needed they can be defined by:

: 2>R POSTPONE SWAP POSTPONE >R POSTPONE >R ; IMMEDIATE
: 2R> POSTPONE R>  POSTPONE R> POSTPONE SWAP  ;  IMMEDIATE
: 2R@ POSTPONE 2R> POSTPONE 2DUP POSTPONE 2>R ; IMMEDIATE

Secondly the word $S is indispensible once again. Its Stallman stack comment is:
Split a STRING on a DELIMITER, leaving the PART before and the PART after the delimiter.
For example:

"ABCDEF" &C $S TYPE &| EMIT TYPE
AB|DEF OK

In order to find out how to implement string loops, we imagine how we would print a file:

: .FILE GET-FILE $DO I$ TYPE CR $LOOP ;

This is equivalent to

: .FILE
GET-FILE
BEGIN                    \ 1
    ^J $S 2SWAP 2>R 2>R  \ 1
    2R@ TYPE CR $LOOP
    OVER WHILE            \ 1
    2R> ( current line) 2DROP 2R>  \ 2
    REPEAT                         \ 2
2DROP                              \ 2
;

The results of $S are swapped in order to access the string using the standard word 2R@ .
The words $DO and $LOOP apparently must compile the lines marked with a 1 and a 2 respectively.
The delimiter is used repetitively. If we want to be able to use a user-specified delimiter we must place it on the return stack.
This leads to the following code. (The alias for POSTPONE is only there to make the postponing more readable.) As you see it is removed from the gene pool immediately after use.

'POSTPONE ALIAS %

: I$   % 2R@ ; IMMEDIATE

: $|DO   % >R % BEGIN % R@ % $S % 2SWAP % 2>R % 2>R ; IMMEDIATE
: $DO   ^J % LITERAL % $|DO ; IMMEDIATE

: $LOOP   % OVER % WHILE % 2R> % 2DROP % 2R> % REPEAT % 2DROP % RDROP ; IMMEDIATE
'% HIDDEN

Now if we assume that the T] T[ are present, we can add words that do the looping even in interpret mode. These words compile to a temporary area.

WANT T[
: $do           T] POSTPONE $DO                    ; IMMEDIATE
: $|do          T] POSTPONE $|DO                   ; IMMEDIATE
: $loop            POSTPONE $LOOP      POSTPONE T[ ; IMMEDIATE

Examples of usage

The following will just print the string "AAP":

: TEST $DO I$ TYPE $LOOP ;

"AAP" TEST
AAP OK

This is an example of splitting a string on a delimiter '|':

: TEST2 &| $|DO I$ TYPE $LOOP ;

"A|B|C|D|E|F" TEST2
ABCDEF OK

or shorter

"A|B|C|D|E|F" &| $|do I$ TYPE $loop
ABCDEF OK

Prints out all lines of the file "aap" that are not empty:

"aap" GET-FILE $do
        I$ -TRAILING DUP 0= IF TYPE CR ELSE 2DROP THEN
$loop

Print all lines that do not start with "\ " :

"x.frt" GET-FILE $do
    I$ OVER "\ " CORA IF TYPE CR ELSE 2DROP THEN
$loop

REGULAR EXPRESSIONS

How to handle regular expressions in Forth.

As you all know the classic way to implement reg expr is ( Kernighan & Pike FORTRAN techniques) compile the reg expr string into an intermediate code that is interpreted like this.
"ab*[ab]" becomes in c.

int imp[] = {
MATCH-ONE , 'a',
MATCH-ONE-MULTIPLE , 'b',
MATCH-SET, 2, 'a', 'b', }

But if you imagine that MATCH-xxx is a forth word that handles in lines arguments, it becomes clear that you want to compile the string to Forth code immediately. It becomes

POSTPONE MATCH-ONE [ CHAR a COMPILE, ] ...

However this somehow doesn't work out..

REGULAR EXPRESSIONS SECOND ATTEMPT

A successful attempt for regular expression is based on a straightforward port of my c++-code for regular expressions. It is based on the stack diagram

( CP EP -- CP' EP' FLAG )

Where CP points to the characters and EP to the regular expression. If there is a match, CP is advanced to CP' . EP is to EP' and true is returned as the FLAG.
Otherwise the pointers are left as is, and false is returned.

This package handles only simple regular expressions and replacements. Because there is no grouping, or nesting these are in fact not regular expressions in the Computer Science definition at all. But they are geared towards practical usability. Alternatives could be added using curly brackets, but grouping is a problem because the round brackets are taken. There is nothing in the design of the package to disallow adding grouping or alternatives, though.

See the words RE-MATCH and RE-REPLACE for usage.

The following aspects are handled:

Compiling ^ (begin only) $ (end only) and special characters + ? * [ ] < >
Grouping using ( ) , only for replacement.
Ranges and inversion of char set (between [ ] ).
Above characters must be escaped if used as is by \ , making \ a special char.
Some sets are escaped by \ (\w) , some non-printables are denoted by an escape sequence.
It is an error to escape characters that do no denote blank space, are not special, nor are denoting a set, However ^ - $ etc. may be escaped where they are not special.

Usage notes.
This differs from all regular expression definitions that exist. But all regular expression definitions differ among themselves anyway, unless you go for POSIX which is horrendous.

Specific for Forth is that < and > and \w all observe Forth white space. For this the Forth system must supply a word ?BLANK that returns for a character whether is considered blank in this Forth ( ch -- flag).

Implementation notes:
Usually regular expressions are compiled into a buffer consisting of tokens followed by strings or characters in some format. We follow the same here, except that tokens are execution tokens.
No attempt is done at reentrant code.
\d \s \w etc. can be handled by just adding sets )
There are effectively two basic execution tokens, STRING-EXACT that matches a string litterally, followed by an in line (counted) string, and STRING-CHAR that matches one char against a character set, followed by that character set.
A character set is a simple bit map and the string match does multiple characters at the same time. Furthermore if the regular expression is known at compile time, it is parsed at compile time. This results in an efficient implementation.
The quantifiers + * ? can handle only sets of exactly one character. They are represented by an execution token, that is followed by the execution token of STRING-CHAR .
All matchers are based on the stack diagram

( CP EP -- CP' EP' FLAG ) as explained above.

Simple matchers advance EP one item.

Quantified matchers match against the whole remaining expression and handle backtracking.

You can get the source here. It presupposes some other small wordsets, so you may prefer get the archive. Even if you don't want these regular expressions you may want the test set.

IMPLEMENTATION DEPENDANCIES

There are no implementation dependancies except for RE-MATCH" that compiles inline in a system dependant way. Using SLITERAL to remove this dependancy is left to be done.
A few words outside of the CORE wordset are used.

STACK DIAGRAMS OF THE FINAL WORDS

These stack diagrams use the Stallman convention

RE-MATCH ( sc1 sc2 -- flag )
For STRING and regular expression STRING: "there IS a match". \0 ..\9 are been filled in.

RE-MATCH" ( sc "expression" -- flag)
Only to be used while compiling. For STRING and "inline regular expression": "there IS a match". \0 ..\9 are filled in.

RE-REPLACE ( sc -- sc' )
Use the replacement STRING to replace the matched part for a recent call of ``RE-MATCH''. Leave the replaced string. This is a static buffer, and must be copied before passing to words in this package.

Other Forth lectures Go to the home page of Albert van der Horst