Alberts home page

Post-It Fix-Up

an assembler according to a new principle

For the impatient:
jump to the downloads of the Forth assembler
jump to the downloads of the Perl assembler

Requirements for an Artificial Intelligence Assembler

For my ai-work I need a Forth that is self aware in the sense that the dictionary contains all tools to analyse itself. The same requirement I had for the assembler. Reverse engineering has similar requirements, and in fact these assemblers are well suited for that purpose. Unfortunately the Pentium (that I choose for my system) is rather complex. After rereading Thinking Forth I came up with the post-it fix-up philosophy described here. A design goal is the "reverse engineering" principle:

Disassembling code and assembling the resulting source gives the exact same code.
Assembling code than disassembling gives the source code back.

This sounds fair enough, until you realise that MOV AX,BX has three different opcodes, besides the LEA instruction. There existed no assembler for Intel 386 that obeys the reverse engineering principle, let alone one written in Forth.

Post-It Fix-Up philosophy

For an assembler for the Pentium it turns out that the in-between-step of creation defining words for each type of assembly gets in the way. (That is the traditional way of Forth Assemblers.) There are just too many of them. Instead I separate all bits of instructions into words that are separately interpreted, where each word "knows" how to work together with others. In particular a word knows how to handle a preceeding number. So we can have a relative jump preceeded by an offset, and still allow to place an absolute address there. It is just a matter of a different word that interprets the absolute address.
The idea is related to the blackboard design pattern, and lexical parsers that push rather than pull. Reportedly the latter is supported by flex and bison to an extent.

Details of Post-It Fix-Up

You then get this kind of instructions:
ES: MOV, X| T| DI'| [MEM +8* AX] FFFFF800 X,
Ugly? Yes, this reflects the ugliness of the instruction set of the Pentium. You can hide the ugliness, but then you defeat the purpose of an assembler: absolute control.

The following is copied from an early version of the asgen.frt source.

( Most instruction set follow this basic idea that it contains of three )
( distinct parts:                                                       )
(   1. the opcode that identifies the operation                         )
(   2. modifiers such as the register working on                        )
(   3. data, as a bit field in the instruction.                         )
(   4. data, including addresses or offsets.                            )
( This assembler goes through three stages for each instruction:        )
(   1. postit: assemblers the opcode with holes for the modifiers.      )
(      This has a fixed length. Also posts requirements for commaers.   )
(   2. fixup: fill up the holes, either from the beginning or the       )
(     end of the post. These can also post required commaers            )
(   3. fixup's with data. It has user supplied data in addition to      )
(      opcode bits. Both together fill up bits left by a postit.        )
(   4. The commaers. Any user supplied data in addition to              )
(      opcode, that can be added as separate bytes. Each has a          )
(      separate command, where checks are built in.                     )

Instead of having a defining word for each "type" of opcode I have now defining words for postits (size 1 2 3 and 4) , fixup from front and behind, data fixups and for commaers. The rest is data and tables.
Not all of those defining words are relevant for all assemblers. Fixup from front can be dispensed with in Intel assemblers, as can data fixups, while DEC Alpha's have only 4 byte instructions etc. So from these few words the 8080 assembler uses only 3, the 8086 assembler uses 4, the DEC Alpha uses 3.
The above is from a Forth perspective. From a Perl perspective there is a small interpreter that loads tables, which are in fact look up tables, so called hashes in Perl. During assembly, as second stage, the mnemonics are looked up.

A small trick -- FAMILY -- saves a lot of errors in tricky magic constants. This means that similar words are defined in a loop e.g.
0100 0 8 xFAMILY|R AX| CX| DX| BX| SP| BP| SI| DI|

I started with implementing an 8086 assembler (for fig-Forth!). You can look at an equivalent ISO Forth version here.. In this vein I went on to make a 386 assembler that was now part of the generic i86 figforth and later on of the generic i86 ciforth. If you ran in 16 bit protected mode it automatically switches to 16 bits. But testing this beast was a bit of a nightmare. (It has now, as per ciforth 4.2.0, been superseded by a light weight version compatible with the great assembler.)

So I went back to the drawing board and separated out the generic part , i.e. the part that has no reference to any processor in particular. Then I used it to implement an 8080 assembler, and I added the selfawareness by making a word that lists all possible opcodes. Then I added a disassembler. All illegal combinations of instruction pieces are detected and give a comprehensible error. The assembler is tested by assembling all the possible opcodes, disassembling and comparing the same.

Go to the home page of Albert van der Horst