Forth lecture 5.
Introduction ============ You may wonder why an optimizer for a computer language would be considered an AI application. This optimizer is not so much for a particular language as well related to a Computer Intelligence that has insight in her own code. Different types of optimisations interfere and finding ones way through this certainly requires some heuristics. The bottom line is that an optimiser qualifies as an AI application. Properties ---------- A Forth system is a database of small programs. It is worthwhile to investigate what properties these small programs (words) might have. The flag field of a word allow to add this information to the header. A certain combination of flags allow a particular optimisation. Definitions ----------- An annihilator is a word that only deletes an item from the stack. Examples are DROP 2DROP NIP RDROP. A juggler reorders the stack without adding or removing items. Examples are SWAP 2SWAP ROT. A duplicator copies an item from the stack. Examples are DUP OVER 2DUP. Optimisations ------------- Optimisations are manipulations on a program source or intermediate code to improve the speed of the resulting program. In other respect the result is inferior. Symbolic debugging - one of Forth's strong points - goes through the drain. (The name "optimisation" is a misnomer.) * Constant folding is a well known technique in optimisation. It means that if an operator works on constants the result may be replaced by a constant that is calculated at compile time. In Forth we generalise this to folding. Folding refers to all words that can be replaced by simpler words in case they receive constant data on the stack. * Reordering is not so much an optimisation per se, but it allows other optimisations to kick in. As a rule of thumb constants are moved to the top of the stack, where they fall prey to folding. Reordering might also eliminate a juggler. * Annihilation is the elimination of e a whole sequence of operations. In Forth sometimes the result of a calculation is dropped. Depending on the properties of the calculation, the calculation itself can be removed. This type of annihilation is related to an annihilator. Another type is related to conditional branching where the condition is known at compile time. Code known to be skipped is removed. * Inlining means replacing a Forth word with its constituents. This technique is very important in Forth, more so than in other languages, due to the small size of Forth words. Inlining is always a winner in speed, and mostly even also a winner with regard to space. Even more important is the fact that inlining allows folding to be applied across constituent words. This applies to high level and low level code alike. Inlining high level code is trivial. A further inlining stage replaces a high level definition that only calls code words, by a code definition which concatenates the code words. Data collecting --------------- In order to add introspective information to a Forth, in the first place the machine code words must be analysed, because ultimately everything is defined in terms of code words. For this purpose the code words are disassembled using a disassembler that allows to readily inspect the parts of a disassembled instruction. A Postit-Fixup assembler and disassembler is well suited. * By inspecting register words in the disassembly, registers usage can be accumulated. This information is then added to the header. * At the same time information about whether memory or I/O ports are accessed for read or write can be collected. It turns out to be useful make a difference between input and output side effects. Here the words to look out for are the MOVE and IN/OUT instructions, operations that access memory (in fact all operations that not target registers) and special instructions like the string operations on Pentia. * Finally the stack effect can be deduced from the number of POP's and PUSH'es. And the use of the return stack can be marked, which mostly warrants a special treatment. After all code words have been analysed, the stack effects and register usage can be concluded for all high level words. The stack effect of a high level words is the concatenation of the stack effect of its constituents. The register usage of a high level word is the logical or of the register usage of the constituents, as are its side effects. There will no doubt be exceptions. It is wrong to cater for too many exceptional situation in such a heuristic tool. Instead, the exception are filled in by hand before the automated collection is started, it fills in only as yet unknown items. Of course it helps to have a simple and straightforward Forth to begin with. Purpose ------- A \ci in general will use optimisation to generate a temporary definition that is much faster, and retain all the valuable partial information about words. In normal non-AI applications, words are recursively replaced by faster words, and those eventually by code definitions. Meanwhile words that are no longer directly used in the final application are eliminated. For space conservation headers may be removed as well, provided in the application no dictionary lookup is needed. Implementation ============== The following implementation notes apply to a 32 bits Pentium Forth where a full cell (4 bytes, 0..3) is reserved for the flags. They must be considered as an example. The information about a word, optimisation opportunities and stack effect, sits in the flag field. Whenever nothing is filled in in the flag field, it means unknown. This applies equally to the stack effect as to the optimisation flags. Stack effects ------------- The information about the stack effects sits in the byte 3 of the flag field. The highest nibble of this third byte applies to input. It is the number of stack items popped plus one. The lowest nibble thusly indicates the number of pushed items. 0 means an unknown (not yet analysed) stack effect. 0FH indicates a variable number of items. The stack effect is added in three steps. For all low level words the stack effect is found by counting pops and pushes. Irregular stack effects are corrected as well as filled in for high level words. All high level stack effects are derived from the stack effect of their constituents. Code words are analysed by disassembling the code that is pointed to by the code field to the first "next" code encountered. For each instruction the opcode, which is the first part of its disassembly, is looked up in a set of possible pop and push instructions. Irregularities are just tabulated in the source code of the analyser. Words are recognized by their code field. High level words are either created by ":" or by a "CREATE .. DOES>" construction . They are recognised by the code field containing DOCOL or DODOES respectively. For both the data field points to a chain of high level calls, i.e. a number of such calls possibly with inlined data and ending in a "(;)", the word compiled by ";". (The result of this "high level next" is to return control is returned to the word that called this one.) For a linear chain of calls the stack effect is calculated as follows: * Start with a effect of ( 0 - 0 ). * For each constituent Subtract the pops from the left (output) nibble. If the output nibble is negative, add its (absolute) value to inputs, and make it zero. Add the pushes to the left (output) nibble. (Correction by 11H is not yet done). The following exceptions to a linear chain have special treatment: * LIT BRANCH'es and SKIP are followed by inline data that must be taken care off * A BRANCH or 0BRANCH forward is always taken, analysing just one path through a definition: the shortest one. A more sophisticated way is to analyse all paths and conclude a variable outcome if it is not consistent or any of the paths contains a variable constituent. * If the stack effect of a constituent is variable, the result is variable, overruling any other outcome * If the stack effect of a constituent is unknown, the result is unknown, overruling any other outcome except variable. * For a CREATE .. DOES> word the linear chain pointed to by the DOES> pointer is analysed. However the stack effect is initialised to ( 0 - 1) to reflect the passing of the data pointer to the DOES> part. * 'EXECUTE has the stack effect of . Other occurrences of EXECUTE lead to a variable stack effect. Lateron we will leave this to the optimiser, but at the stage of analysing the kernel this is useful, especially because all usage of EXECUTE in the kernel is of this type. * ' CATCH has the stack effect of plus an extra output. Other occurrences of CATCH lead to a variable stack effect. So a word is treated as if exceptions do not occur. This is okay because the stack effect is not relevant in case of exceptions. A high level word is recognised by its code field address containing DOCOL , i.e. the nesting routine for the interpreter. A CREATE .. DOES> word is detected by is code field address containing DODOES , i.e. the common code that starts up words defined by compiler extension. All other words are treated as code. The whole of Forth is treated as follows: * Fill in the exception * Fill in the code words * Sweep repeatedly through the dictionary, from early to latest: For each unknown stack effect, try to find it by discriminating between DODOES DOCOL and other words, Stop if no progress is made any more. Hopefully everything is known now, but maybe we must add to the exceptions. And repeat the above process. The notion of a simple sequence is one that doesn't reach to the stack outside what is defined within the sequence. Optimisation classes -------------------- As has been pointed out, the optimisation class of a word is indicated by a bit set in the flags field. Each bit set to true opens up a particular opportunity for optimisation. Further a sequence has a certain class if each constituent has that class. For example, if one of the words called does a store, the sequence is found to do a store and the optimisations that would be allowed by "no stores" are blocked. So the optimisation class of a sequence is the logical or of the oc's of the constituents. This can be done efficiently by bit-wise or operations. The no store bit. ................. The "no store" bit would better be named "no output side effect" bit. It indicates that the outside world doesn't change by executing this word. Again not that the stacks and internal registers are inside. Note that fetching from an input port has an output side effect, (as well as an input side effect.) The following optimisation are opened up: * In combination with an annihilator. If the output of a "no store" sequence is annihilated, the whole sequence and the annihilator may be left out. Example: BASE CELL+ XX NIP becomes XX * In combination with a juggler. If the outputs of "no store" sequence are juggled, the sequences itself may be juggled, eliminating the juggler. Example: XX CELL+ BASE SWAP becomes BASE XX CELL+ * In combination with a duplicator. Again a sequence may be duplicated and the duplicator eliminated. This is not an optimisation, except for the duplication of constants. Going the other direction can be an optimisation. Two identical sequences with no output side effect can be replaced by one and a duplicator. Example: (for a recursive definition with stack effect ( 1 - 1 ) and no side effects) 12 RECURSE OVER RECURSE becomes 12 RECURSE 12 RECURSE (elimination duplicator) 12 RECURSE 12 RECURSE becomes 12 RECURSE DUP. (introducing duplicator) The no fetch property. ...................... The "no fetch" bit would better be named "no input side effect" bit. It indicates that the outside world affects the outcome of this word. Input side effects are weaker than output side effects and the knowledge that they are absent allows less optimisation. The no stack effect property. ............................. The "no stack effect fetch" bit refers to absolute accesses of the stacks, i.e. where the data or return stack are not used as stacks. Typical examples are DEPTH and RSP. These words are rare but prevent considerable optimisation. The no side effect property. ............................ The combination of the "no store ", "no fetch " and "no stack effect " properties is quite common. Such a word is said to have the "no side effect" property. The combination allows substantially more optimisation than each alone. We will use the abbreviation NS for this important concept. Examples are CONSTANT's, VARIABLE's, operators like + or NEGATE, and all stack manipulations: jugglers, annihilator, duplicators. NS-words are amenable to folding: * If a NS-sequence has only constant inputs, it may be run at compile time. Its inputs and the code sequence may be replaced by the resulting constant outputs. Example: After "12 4 3 SWAP * +" is replaced by 24. * If a NS-sequence has no inputs, it may be run at compile time and replaced by the resulting constant outputs. The difference with the preceeding example is that the sequence starts with 12 instead of *. Any literals are of course NS. On closer inspection the second condition is equivalent to the first. It is the more easy one to implement. Associativity. .............. An operator with two inputs and one output, so called "binary operators" can have, in addition to NS, the property of associativity. This refers to a situation where three operands are involved. Examples are OR and + . However not F+ . In the following we will denote an associative operator by %. Associativity allows to replace the sequence "x % y %" with "x y % %" where it may be that "x y %" can be folded into a constant. Example: (assuming a 64-bit Forth) "CELL+ CELL+" is first inlined to "8 + 8 +" then associated to "8 8 + +" then folded to "16 +". Note that it is not necessary to look for other patterns, in view of other transformation that are done. Short circuit evaluation. ......................... Another optimisation aplicable to binary NS-operators is short circuit evaluation. This is the situation where the result is known, while only one of the operands is known, such as "FFFF AND" "FFFF OR" "0 +" "0 XOR" and "0 *". Some of these operations can be just dropped, while in other cases the result is known and the other operand (possibly non-constant) can be dropped. Optimisation by recursive inlining ---------------------------------- A word is optimized by first optimizing its constituents, then inlining the constituents and apply any optimisation opportunities like folding that open up. In more detail we have the following steps: * Check First of all check whether the item has been optimised already. We do in fact a "depth first" optimisation, so the words lowest in the call hierarchy are optimised first. It is important to only attempt optimisation once. This cuts the recursion short. * Recurse For all constituent words of this definition, do an optimisation, such as defined in these steps. * Inline Build up an executable sequence in the dictionary. Inline a constituents word, keeping track of all opportunities to optimise. Try to build up a sequence of NS-words that starts with constants and where each word following doesn't consume more inputs than are available. Consequently the outputs are available as constants. * Breakdown When a sequence of NS-words breaks down, we have identified a sequence that can be run at compile time. This sequence is run, and removed from the output sequence. Then the output of the run is compiled, as a sequence of constants. * Special opportunities. After each inlining the constituent is checked whether it allows special optimisations. Here the likes of the associativity optimisation is done, by looking for a "operand % operand %" pattern. In a fashion similar to the breakdown of a NS-sequence the tail of the sequence that is being built up is, is replaced. * Replace After inlining is finished, the sequence is now attached to the word we are optimizing to replace the original sequence. Maybe the original code is kept if no folding took place and the sequence is longer that a certain limit. * Mark properties The current word is marked as optimised. Its stack effect and its optimization classes are derived from its constituents and added to the flags header.