Forth lecture 5.
Optimisation in Forth.



Introduction
============

   You may wonder why an optimizer for a computer language would be
considered an AI application.  This optimizer is not so much for a
particular language as well related to a Computer Intelligence that has
insight in her own code.

   Different types of optimisations interfere and finding ones way
through this certainly requires some heuristics. The bottom line is
that an optimiser qualifies as an AI application.

Properties
----------

   A Forth system is a database of small programs. It is worthwhile to
investigate what properties these small programs (words) might have.
The flag field of a word allow to add this information to the header.
A certain combination of flags allow a particular optimisation.

Definitions
-----------

   An annihilator is a word that only deletes an item from the stack.
Examples are DROP 2DROP NIP RDROP.

   A juggler reorders the stack without adding or removing items.
Examples are SWAP 2SWAP ROT.

   A duplicator copies an item from the stack.  Examples are DUP OVER
2DUP.

Optimisations
-------------

   Optimisations are manipulations on a program source or intermediate
code to improve the speed of the resulting program.  In other respect
the result is inferior. Symbolic debugging - one of Forth's strong
points - goes through the drain.  (The name "optimisation" is a
misnomer.)

    * Constant folding is a well known technique in optimisation.  It
     means that if an operator works on constants the result may be
     replaced by a constant that is calculated at compile time.  In
     Forth we generalise this to folding.  Folding refers to all words
     that can be replaced by simpler words in case they receive
     constant data on the stack.

   * Reordering is not so much an optimisation per se, but it allows
     other optimisations to kick in. As a rule of thumb constants are
     moved to the top of the stack, where they fall prey to folding.
     Reordering might also eliminate a juggler.

   * Annihilation is the elimination of e a whole sequence of
     operations.  In Forth sometimes the result of a calculation is
     dropped.  Depending on the properties of the calculation, the
     calculation itself can be removed.  This type of annihilation is
     related to an annihilator.  Another type is related to conditional
     branching where the condition is known at compile time. Code known
     to be skipped is removed.

   * Inlining means replacing a Forth word with its constituents.  This
     technique is very important in Forth, more so than in other
     languages, due to the small size of Forth words.  Inlining is
     always a winner in speed, and mostly even also a winner with
     regard to space.

     Even more important is the fact that inlining allows folding to be
     applied across constituent words. This applies to high level and
     low level code alike.

     Inlining high level code is trivial.  A further inlining stage
     replaces a high level definition that only calls code words, by a
     code definition which concatenates the code words.

Data collecting
---------------

   In order to add introspective information to a Forth, in the first
place the machine code words must be analysed, because ultimately
everything is defined in terms of code words.  For this purpose the
code words are disassembled using a disassembler that allows to readily
inspect the parts of a disassembled instruction.  A Postit-Fixup
assembler and disassembler is well suited.

   * By inspecting register words in the disassembly, registers usage
     can be accumulated.  This information is then added to the header.

   * At the same time information about whether memory or I/O ports are
     accessed for read or write can be collected.  It turns out to be
     useful make a difference between input and output side effects.
     Here the words to look out for are the MOVE and IN/OUT
     instructions, operations that access memory (in fact all
     operations that not target registers) and special instructions
     like the string operations on Pentia.

   * Finally the stack effect can be deduced from the number of POP's
     and PUSH'es.  And the use of the return stack can be marked, which
     mostly warrants a special treatment.

   After all code words have been analysed, the stack effects and
register usage can be concluded for all high level words.  The stack
effect of a high level words is the concatenation of the stack effect
of its constituents.  The register usage of a high level word is the
logical or of the register usage of the constituents, as are its side
effects.

   There will no doubt be exceptions.  It is wrong to cater for too
many exceptional situation in such a heuristic tool.  Instead, the
exception are filled in by hand before the automated collection is
started, it fills in only as yet unknown items.  Of course it helps to
have a simple and straightforward Forth to begin with.

Purpose
-------

   A \ci in general will use optimisation to generate a temporary
definition that is much faster, and retain all the valuable partial
information about words.

   In normal non-AI applications, words are recursively replaced by
faster words, and those eventually by code definitions. Meanwhile words
that are no longer directly used in the final application are
eliminated.  For space conservation headers may be removed as well,
provided in the application no dictionary lookup is needed.

Implementation
==============

   The following implementation notes apply to a 32 bits Pentium Forth
where a full cell (4 bytes, 0..3) is reserved for the flags.  They must
be considered as an example.  The information about a word,
optimisation opportunities and stack effect, sits in the flag field.
Whenever nothing is filled in in the flag field, it means unknown.
This applies equally to the stack effect as to the optimisation flags.

Stack effects
-------------

   The information about the stack effects sits in the byte 3 of the
flag field.  The highest nibble of this third byte applies to input.
It is the number of stack items popped plus one.  The lowest nibble
thusly indicates the number of pushed items.  0 means an unknown (not
yet analysed) stack effect.  0FH indicates a variable number of items.

   The stack effect is added in three steps.  For all low level words
the stack effect is found by counting pops and pushes.  Irregular stack
effects are corrected as well as filled in for high level words.  All
high level stack effects are derived from the stack effect of their
constituents.

   Code words are analysed by disassembling the code that is pointed to
by the code field to the first "next" code encountered. For each
instruction the opcode, which is the first part of its disassembly, is
looked up in a set of possible pop and push instructions.

   Irregularities are just tabulated in the source code of the analyser.

   Words are recognized by their code field.  High level words are
either created by ":" or by a "CREATE .. DOES>" construction .  They
are recognised by the code field containing DOCOL or DODOES
respectively.  For both the data field points to a chain of high level
calls, i.e. a number of such calls possibly with inlined data and
ending in a "(;)", the word compiled by ";".  (The result of this "high
level next" is to return control is returned to the word that called
this one.)  For a linear chain of calls the stack effect is calculated
as follows:

   * Start with a effect of ( 0 - 0 ).

   * For each constituent
     Subtract the pops from the left (output) nibble.      If the
     output nibble is negative, add its (absolute) value to inputs, and
     make it zero.      Add the pushes to the left (output) nibble.
     (Correction by 11H is not yet done).

     The following exceptions to a linear chain have special treatment:

   * LIT BRANCH'es and SKIP are followed by inline data that must be
     taken care off

   * A BRANCH or 0BRANCH forward is always taken, analysing just one
     path through a definition: the shortest one. A more sophisticated
     way is to analyse all paths and conclude a variable outcome if it
     is not consistent or any of the paths contains a variable
     constituent.

   * If the stack effect of a constituent is variable, the result is
     variable, overruling any other outcome

   * If the stack effect of a constituent is unknown, the result is
     unknown, overruling any other outcome except variable.

   *  For a CREATE .. DOES> word the linear chain pointed to by the
     DOES> pointer is analysed. However the stack effect is initialised
     to ( 0 - 1) to reflect the passing of the data pointer to the
     DOES> part.

   * ' EXECUTE has the stack effect of  .  Other
     occurrences of EXECUTE lead to a variable stack effect.  Lateron
     we will leave this to the optimiser, but at the stage of analysing
     the kernel this is useful, especially because all usage of EXECUTE
     in the kernel is of this type.

   * ' CATCH has the stack effect of  plus an
     extra output.  Other occurrences of CATCH lead to a variable stack
     effect.  So a word is treated as if exceptions do not occur.  This
     is okay because the stack effect is not relevant in case of
     exceptions.

   A high level word is recognised by its code field address containing
DOCOL , i.e. the nesting routine for the interpreter.  A CREATE ..
DOES> word is detected by is code field address containing DODOES ,
i.e. the common code that starts up words defined by compiler extension.
All other words are treated as code.

   The whole of Forth is treated as follows:
   * Fill in the exception

   * Fill in the code words

   * Sweep repeatedly through the dictionary, from early to latest:
     For each unknown stack effect, try to find it by discriminating
     between     DODOES DOCOL and other words,   Stop if no progress is
     made any more.

   Hopefully everything is known now, but maybe we must add to the
exceptions.  And repeat the above process.

   The notion of a simple sequence is one that doesn't reach to the
stack outside what is defined within the sequence.

Optimisation classes
--------------------

   As has been pointed out, the optimisation class of a word is
indicated by a bit set in the flags field. Each bit set to true opens
up a particular opportunity for optimisation.  Further a sequence has a
certain class if each constituent has that class.  For example, if one
of the words called does a store, the sequence is found to do a store
and the optimisations that would be allowed by "no stores" are blocked.
So the optimisation class of a sequence is the logical or of the oc's
of the constituents.  This can be done efficiently by bit-wise or
operations.

The no store bit.
.................

   The "no store" bit would better be named "no output side effect" bit.
It indicates that the outside world doesn't change by executing this
word.  Again not that the stacks and internal registers are inside.
Note that fetching from an input port has an output side effect, (as
well as an input side effect.)

   The following optimisation are opened up:
   * In combination with an annihilator. If the output of a "no store"
     sequence is annihilated, the whole sequence and the annihilator
     may be left out.  Example: BASE CELL+  XX  NIP becomes XX

   * In combination with a juggler. If the outputs of "no store"
     sequence are juggled, the sequences itself may be juggled,
     eliminating the juggler.  Example: XX CELL+  BASE  SWAP becomes
     BASE  XX CELL+

   * In combination with a duplicator. Again a sequence may be
     duplicated and the duplicator eliminated. This is not an
     optimisation, except for the duplication of constants.  Going the
     other direction can be an optimisation.  Two identical sequences
     with no output side effect can be replaced by one and a duplicator.
     Example: (for a recursive definition with stack effect ( 1 - 1 )
     and no side effects) 12 RECURSE OVER RECURSE becomes 12 RECURSE 12
     RECURSE  (elimination duplicator)
     12 RECURSE 12 RECURSE becomes 12 RECURSE DUP. (introducing
     duplicator)

The no fetch property.
......................

   The "no fetch" bit would better be named "no input side effect" bit.
It indicates that the outside world affects the outcome of this word.
Input side effects are weaker than output side effects and the
knowledge that they are absent allows less optimisation.

The no stack effect property.
.............................

   The "no stack effect fetch" bit refers to absolute accesses of the
stacks, i.e. where the data or return stack are not used as stacks.
Typical examples are DEPTH and RSP.  These words are rare but prevent
considerable optimisation.

The no side effect property.
............................

   The combination of the "no store ", "no fetch " and "no stack effect
" properties is quite common. Such a word is said to have the "no side
effect" property.  The combination allows substantially more
optimisation than each alone. We will use the abbreviation NS for this
important concept.  Examples are CONSTANT's, VARIABLE's, operators like
+ or NEGATE, and all stack manipulations: jugglers, annihilator,
duplicators.

   NS-words are amenable to folding:
   * If a NS-sequence has only constant inputs, it may be run at
     compile time.  Its inputs and the code sequence may be replaced by
     the resulting constant outputs.  Example: After "12 4 3 SWAP * +"
     is replaced by 24.

   * If a NS-sequence has no inputs, it may be run at compile time and
     replaced by the resulting constant outputs.  The difference with
     the preceeding example is that the sequence starts with 12 instead
     of *. Any literals are of course NS.

   On closer inspection the second condition is equivalent to the first.
It is the more easy one to implement.

Associativity.
..............

   An operator with two inputs and one output, so called "binary
operators" can have, in addition to NS, the property of associativity.
This refers to a situation where three operands are involved. Examples
are OR and + . However not F+ .  In the following we will denote an
associative operator by %.  Associativity allows to replace the
sequence "x % y %" with "x y % %" where it may be that "x y %" can be
folded into a constant.  Example: (assuming a 64-bit Forth) "CELL+
CELL+" is first inlined to "8 + 8 +" then associated to "8 8 + +" then
folded to "16 +".
Note that it is not necessary to look for other patterns, in view of
other transformation that are done.

Short circuit evaluation.
.........................

   Another optimisation aplicable to binary NS-operators is short
circuit evaluation. This is the situation where the result is known,
while only one of the operands is known, such as "FFFF AND" "FFFF OR"
"0 +" "0 XOR" and "0 *". Some of these operations can be just dropped,
while in other cases the result is known and the other operand
(possibly non-constant) can be dropped.

Optimisation by recursive inlining
----------------------------------

   A word is optimized by first optimizing its constituents, then
inlining the constituents and apply any optimisation opportunities like
folding that open up.

In more detail we have the following steps:
   * Check First of all check whether the item has been optimised
     already.  We do in fact a "depth first" optimisation, so the words
     lowest in the call hierarchy are optimised first. It is important
     to only attempt optimisation once. This cuts the recursion short.

   * Recurse For all constituent words of this definition, do an
     optimisation, such as defined in these steps.

   * Inline Build up an executable sequence in the dictionary.  Inline
     a constituents word, keeping track of all opportunities to
     optimise. Try to build up a sequence  of NS-words that starts with
     constants and where each word following doesn't consume more
     inputs than are available. Consequently the outputs are available
     as constants.

   * Breakdown When a sequence of NS-words breaks down, we have
     identified a sequence that can be run at compile time. This
     sequence is run, and removed from the output sequence. Then the
     output of the run is compiled, as a sequence of constants.

   * Special opportunities.  After each inlining the constituent is
     checked whether it allows special optimisations. Here the likes of
     the associativity optimisation is done, by looking for a "operand
     % operand %" pattern.  In a fashion similar to the breakdown of a
     NS-sequence the tail of the sequence that is being built up is, is
     replaced.

   * Replace After inlining is finished, the sequence is now attached
     to the word we are optimizing to replace the original sequence.
     Maybe the original code is kept if no folding took place and the
     sequence is longer that a certain limit.

   * Mark properties The current word is marked as optimised. Its stack
     effect and its optimization classes are derived from its
     constituents and added to the flags header.
Other Forth lectures
Go to the home page of Albert van der Horst