You then get this kind of instructions:
ES: MOV, X| T| DI'| [MEM +8* AX] FFFFF800 X,
Ugly? Yes, this reflects the ugliness of the instruction set of the Pentium.
You can hide the ugliness,
but then you defeat the purpose of an assembler: absolute control.
The following is copied from an early version of the asgen.frt source.
( Most instruction set follow this basic idea that it contains of three ) ( distinct parts: ) ( 1. the opcode that identifies the operation ) ( 2. modifiers such as the register working on ) ( 3. data, as a bit field in the instruction. ) ( 4. data, including addresses or offsets. ) ( This assembler goes through three stages for each instruction: ) ( 1. postit: assemblers the opcode with holes for the modifiers. ) ( This has a fixed length. Also posts requirements for commaers. ) ( 2. fixup: fill up the holes, either from the beginning or the ) ( end of the post. These can also post required commaers ) ( 3. fixup's with data. It has user supplied data in addition to ) ( opcode bits. Both together fill up bits left by a postit. ) ( 4. The commaers. Any user supplied data in addition to ) ( opcode, that can be added as separate bytes. Each has a ) ( separate command, where checks are built in. )
Instead of having a defining word for each "type" of opcode I have now
defining words for postits (size 1 2 3 and 4) , fixup from front and behind,
data fixups and for commaers.
The rest is data and tables.
Not all of those defining words are relevant for all assemblers.
Fixup from front can be dispensed with in Intel assemblers,
as can data fixups, while DEC Alpha's have only 4 byte instructions etc.
So from these few words
the 8080 assembler uses only 3, the 8086 assembler uses 4,
the DEC Alpha uses 3.
The above is from a Forth perspective.
From a Perl perspective there is a small interpreter that loads
tables, which are in fact look up tables, so called hashes
in Perl. During assembly, as second stage, the mnemonics are looked up.
A small trick -- FAMILY -- saves a lot of errors in tricky magic constants.
This means that similar words are defined in a loop e.g.
0100 0 8 xFAMILY|R
AX| CX| DX| BX| SP| BP| SI| DI|
I started with implementing an 8086 assembler (for fig-Forth!). You can look at an equivalent ISO Forth version here.. In this vein I went on to make a 386 assembler that was now part of the generic i86 figforth and later on of the generic i86 ciforth. If you ran in 16 bit protected mode it automatically switches to 16 bits. But testing this beast was a bit of a nightmare. (It has now, as per ciforth 4.2.0, been superseded by a light weight version compatible with the great assembler.)
So I went back to the drawing board and separated out the generic part , i.e. the part that has no reference to any processor in particular. Then I used it to implement an 8080 assembler, and I added the selfawareness by making a word that lists all possible opcodes. Then I added a disassembler. All illegal combinations of instruction pieces are detected and give a comprehensible error. The assembler is tested by assembling all the possible opcodes, disassembling and comparing the same.