Forth lecture 5.
Introduction
============
You may wonder why an optimizer for a computer language would be
considered an AI application. This optimizer is not so much for a
particular language as well related to a Computer Intelligence that has
insight in her own code.
Different types of optimisations interfere and finding ones way
through this certainly requires some heuristics. The bottom line is
that an optimiser qualifies as an AI application.
Properties
----------
A Forth system is a database of small programs. It is worthwhile to
investigate what properties these small programs (words) might have.
The flag field of a word allow to add this information to the header.
A certain combination of flags allow a particular optimisation.
Definitions
-----------
An annihilator is a word that only deletes an item from the stack.
Examples are DROP 2DROP NIP RDROP.
A juggler reorders the stack without adding or removing items.
Examples are SWAP 2SWAP ROT.
A duplicator copies an item from the stack. Examples are DUP OVER
2DUP.
Optimisations
-------------
Optimisations are manipulations on a program source or intermediate
code to improve the speed of the resulting program. In other respect
the result is inferior. Symbolic debugging - one of Forth's strong
points - goes through the drain. (The name "optimisation" is a
misnomer.)
* Constant folding is a well known technique in optimisation. It
means that if an operator works on constants the result may be
replaced by a constant that is calculated at compile time. In
Forth we generalise this to folding. Folding refers to all words
that can be replaced by simpler words in case they receive
constant data on the stack.
* Reordering is not so much an optimisation per se, but it allows
other optimisations to kick in. As a rule of thumb constants are
moved to the top of the stack, where they fall prey to folding.
Reordering might also eliminate a juggler.
* Annihilation is the elimination of e a whole sequence of
operations. In Forth sometimes the result of a calculation is
dropped. Depending on the properties of the calculation, the
calculation itself can be removed. This type of annihilation is
related to an annihilator. Another type is related to conditional
branching where the condition is known at compile time. Code known
to be skipped is removed.
* Inlining means replacing a Forth word with its constituents. This
technique is very important in Forth, more so than in other
languages, due to the small size of Forth words. Inlining is
always a winner in speed, and mostly even also a winner with
regard to space.
Even more important is the fact that inlining allows folding to be
applied across constituent words. This applies to high level and
low level code alike.
Inlining high level code is trivial. A further inlining stage
replaces a high level definition that only calls code words, by a
code definition which concatenates the code words.
Data collecting
---------------
In order to add introspective information to a Forth, in the first
place the machine code words must be analysed, because ultimately
everything is defined in terms of code words. For this purpose the
code words are disassembled using a disassembler that allows to readily
inspect the parts of a disassembled instruction. A Postit-Fixup
assembler and disassembler is well suited.
* By inspecting register words in the disassembly, registers usage
can be accumulated. This information is then added to the header.
* At the same time information about whether memory or I/O ports are
accessed for read or write can be collected. It turns out to be
useful make a difference between input and output side effects.
Here the words to look out for are the MOVE and IN/OUT
instructions, operations that access memory (in fact all
operations that not target registers) and special instructions
like the string operations on Pentia.
* Finally the stack effect can be deduced from the number of POP's
and PUSH'es. And the use of the return stack can be marked, which
mostly warrants a special treatment.
After all code words have been analysed, the stack effects and
register usage can be concluded for all high level words. The stack
effect of a high level words is the concatenation of the stack effect
of its constituents. The register usage of a high level word is the
logical or of the register usage of the constituents, as are its side
effects.
There will no doubt be exceptions. It is wrong to cater for too
many exceptional situation in such a heuristic tool. Instead, the
exception are filled in by hand before the automated collection is
started, it fills in only as yet unknown items. Of course it helps to
have a simple and straightforward Forth to begin with.
Purpose
-------
A \ci in general will use optimisation to generate a temporary
definition that is much faster, and retain all the valuable partial
information about words.
In normal non-AI applications, words are recursively replaced by
faster words, and those eventually by code definitions. Meanwhile words
that are no longer directly used in the final application are
eliminated. For space conservation headers may be removed as well,
provided in the application no dictionary lookup is needed.
Implementation
==============
The following implementation notes apply to a 32 bits Pentium Forth
where a full cell (4 bytes, 0..3) is reserved for the flags. They must
be considered as an example. The information about a word,
optimisation opportunities and stack effect, sits in the flag field.
Whenever nothing is filled in in the flag field, it means unknown.
This applies equally to the stack effect as to the optimisation flags.
Stack effects
-------------
The information about the stack effects sits in the byte 3 of the
flag field. The highest nibble of this third byte applies to input.
It is the number of stack items popped plus one. The lowest nibble
thusly indicates the number of pushed items. 0 means an unknown (not
yet analysed) stack effect. 0FH indicates a variable number of items.
The stack effect is added in three steps. For all low level words
the stack effect is found by counting pops and pushes. Irregular stack
effects are corrected as well as filled in for high level words. All
high level stack effects are derived from the stack effect of their
constituents.
Code words are analysed by disassembling the code that is pointed to
by the code field to the first "next" code encountered. For each
instruction the opcode, which is the first part of its disassembly, is
looked up in a set of possible pop and push instructions.
Irregularities are just tabulated in the source code of the analyser.
Words are recognized by their code field. High level words are
either created by ":" or by a "CREATE .. DOES>" construction . They
are recognised by the code field containing DOCOL or DODOES
respectively. For both the data field points to a chain of high level
calls, i.e. a number of such calls possibly with inlined data and
ending in a "(;)", the word compiled by ";". (The result of this "high
level next" is to return control is returned to the word that called
this one.) For a linear chain of calls the stack effect is calculated
as follows:
* Start with a effect of ( 0 - 0 ).
* For each constituent
Subtract the pops from the left (output) nibble. If the
output nibble is negative, add its (absolute) value to inputs, and
make it zero. Add the pushes to the left (output) nibble.
(Correction by 11H is not yet done).
The following exceptions to a linear chain have special treatment:
* LIT BRANCH'es and SKIP are followed by inline data that must be
taken care off
* A BRANCH or 0BRANCH forward is always taken, analysing just one
path through a definition: the shortest one. A more sophisticated
way is to analyse all paths and conclude a variable outcome if it
is not consistent or any of the paths contains a variable
constituent.
* If the stack effect of a constituent is variable, the result is
variable, overruling any other outcome
* If the stack effect of a constituent is unknown, the result is
unknown, overruling any other outcome except variable.
* For a CREATE .. DOES> word the linear chain pointed to by the
DOES> pointer is analysed. However the stack effect is initialised
to ( 0 - 1) to reflect the passing of the data pointer to the
DOES> part.
* ' EXECUTE has the stack effect of . Other
occurrences of EXECUTE lead to a variable stack effect. Lateron
we will leave this to the optimiser, but at the stage of analysing
the kernel this is useful, especially because all usage of EXECUTE
in the kernel is of this type.
* ' CATCH has the stack effect of plus an
extra output. Other occurrences of CATCH lead to a variable stack
effect. So a word is treated as if exceptions do not occur. This
is okay because the stack effect is not relevant in case of
exceptions.
A high level word is recognised by its code field address containing
DOCOL , i.e. the nesting routine for the interpreter. A CREATE ..
DOES> word is detected by is code field address containing DODOES ,
i.e. the common code that starts up words defined by compiler extension.
All other words are treated as code.
The whole of Forth is treated as follows:
* Fill in the exception
* Fill in the code words
* Sweep repeatedly through the dictionary, from early to latest:
For each unknown stack effect, try to find it by discriminating
between DODOES DOCOL and other words, Stop if no progress is
made any more.
Hopefully everything is known now, but maybe we must add to the
exceptions. And repeat the above process.
The notion of a simple sequence is one that doesn't reach to the
stack outside what is defined within the sequence.
Optimisation classes
--------------------
As has been pointed out, the optimisation class of a word is
indicated by a bit set in the flags field. Each bit set to true opens
up a particular opportunity for optimisation. Further a sequence has a
certain class if each constituent has that class. For example, if one
of the words called does a store, the sequence is found to do a store
and the optimisations that would be allowed by "no stores" are blocked.
So the optimisation class of a sequence is the logical or of the oc's
of the constituents. This can be done efficiently by bit-wise or
operations.
The no store bit.
.................
The "no store" bit would better be named "no output side effect" bit.
It indicates that the outside world doesn't change by executing this
word. Again not that the stacks and internal registers are inside.
Note that fetching from an input port has an output side effect, (as
well as an input side effect.)
The following optimisation are opened up:
* In combination with an annihilator. If the output of a "no store"
sequence is annihilated, the whole sequence and the annihilator
may be left out. Example: BASE CELL+ XX NIP becomes XX
* In combination with a juggler. If the outputs of "no store"
sequence are juggled, the sequences itself may be juggled,
eliminating the juggler. Example: XX CELL+ BASE SWAP becomes
BASE XX CELL+
* In combination with a duplicator. Again a sequence may be
duplicated and the duplicator eliminated. This is not an
optimisation, except for the duplication of constants. Going the
other direction can be an optimisation. Two identical sequences
with no output side effect can be replaced by one and a duplicator.
Example: (for a recursive definition with stack effect ( 1 - 1 )
and no side effects) 12 RECURSE OVER RECURSE becomes 12 RECURSE 12
RECURSE (elimination duplicator)
12 RECURSE 12 RECURSE becomes 12 RECURSE DUP. (introducing
duplicator)
The no fetch property.
......................
The "no fetch" bit would better be named "no input side effect" bit.
It indicates that the outside world affects the outcome of this word.
Input side effects are weaker than output side effects and the
knowledge that they are absent allows less optimisation.
The no stack effect property.
.............................
The "no stack effect fetch" bit refers to absolute accesses of the
stacks, i.e. where the data or return stack are not used as stacks.
Typical examples are DEPTH and RSP. These words are rare but prevent
considerable optimisation.
The no side effect property.
............................
The combination of the "no store ", "no fetch " and "no stack effect
" properties is quite common. Such a word is said to have the "no side
effect" property. The combination allows substantially more
optimisation than each alone. We will use the abbreviation NS for this
important concept. Examples are CONSTANT's, VARIABLE's, operators like
+ or NEGATE, and all stack manipulations: jugglers, annihilator,
duplicators.
NS-words are amenable to folding:
* If a NS-sequence has only constant inputs, it may be run at
compile time. Its inputs and the code sequence may be replaced by
the resulting constant outputs. Example: After "12 4 3 SWAP * +"
is replaced by 24.
* If a NS-sequence has no inputs, it may be run at compile time and
replaced by the resulting constant outputs. The difference with
the preceeding example is that the sequence starts with 12 instead
of *. Any literals are of course NS.
On closer inspection the second condition is equivalent to the first.
It is the more easy one to implement.
Associativity.
..............
An operator with two inputs and one output, so called "binary
operators" can have, in addition to NS, the property of associativity.
This refers to a situation where three operands are involved. Examples
are OR and + . However not F+ . In the following we will denote an
associative operator by %. Associativity allows to replace the
sequence "x % y %" with "x y % %" where it may be that "x y %" can be
folded into a constant. Example: (assuming a 64-bit Forth) "CELL+
CELL+" is first inlined to "8 + 8 +" then associated to "8 8 + +" then
folded to "16 +".
Note that it is not necessary to look for other patterns, in view of
other transformation that are done.
Short circuit evaluation.
.........................
Another optimisation aplicable to binary NS-operators is short
circuit evaluation. This is the situation where the result is known,
while only one of the operands is known, such as "FFFF AND" "FFFF OR"
"0 +" "0 XOR" and "0 *". Some of these operations can be just dropped,
while in other cases the result is known and the other operand
(possibly non-constant) can be dropped.
Optimisation by recursive inlining
----------------------------------
A word is optimized by first optimizing its constituents, then
inlining the constituents and apply any optimisation opportunities like
folding that open up.
In more detail we have the following steps:
* Check First of all check whether the item has been optimised
already. We do in fact a "depth first" optimisation, so the words
lowest in the call hierarchy are optimised first. It is important
to only attempt optimisation once. This cuts the recursion short.
* Recurse For all constituent words of this definition, do an
optimisation, such as defined in these steps.
* Inline Build up an executable sequence in the dictionary. Inline
a constituents word, keeping track of all opportunities to
optimise. Try to build up a sequence of NS-words that starts with
constants and where each word following doesn't consume more
inputs than are available. Consequently the outputs are available
as constants.
* Breakdown When a sequence of NS-words breaks down, we have
identified a sequence that can be run at compile time. This
sequence is run, and removed from the output sequence. Then the
output of the run is compiled, as a sequence of constants.
* Special opportunities. After each inlining the constituent is
checked whether it allows special optimisations. Here the likes of
the associativity optimisation is done, by looking for a "operand
% operand %" pattern. In a fashion similar to the breakdown of a
NS-sequence the tail of the sequence that is being built up is, is
replaced.
* Replace After inlining is finished, the sequence is now attached
to the word we are optimizing to replace the original sequence.
Maybe the original code is kept if no folding took place and the
sequence is longer that a certain limit.
* Mark properties The current word is marked as optimised. Its stack
effect and its optimization classes are derived from its
constituents and added to the flags header.