Forth lecture 16.

Doubly linked vocabularies

Scope

It may be argued that limited vocabularies are sufficient for normal Forth coding. During interpretation speed is limited by the typing speed of the user anyway. During compilation speed of searching is not essential, because it is done just once. (This was not true of old 8 bit Forth compilers, and is not true even today for small microcomputers as are used in embedded applications.)

However, some applications in Forth use the Forth vocabulary as a database of some sort. There is no limit to the number of items that might be required to store in such circumstance. A particular example of this is my reverse engineering assembler and disassembler .

Alternatives for speeding up searching

Traditionally Forth uses searches through vocabularies in a straightforward way via linked list, or like in colorforth looking up in an array. This is an O(n) process where n is the number of entries, which is slow unless for small n.

This can be speeded up in several way. Most often used is some kind of hash mechanism. A hash code can be derived from the word to look up, and then one of several link chains can be used, instead of one. This can only achieve a linear speedup in the number of chains, i.e. for 16 chains one may expect at most a 16-fold speed up.

The advantage are

it requires relatively little memory for a substantial speed up.
the chains have about the same structure as the original vocabularies
it has no problems with part of the vocabulary in rom

An other method is using a hash table on top of the link structures. This achieves a constant lookup time.

The advantage are

it achieves an optimal search time.
it can be added to any structure
it has no problems with part of the vocabulary in rom

Both have a disadvantage of scaling. A chains method will not cope with truly huge vocabularies, a graceful degradation. A hash method will eventually have its table full, and then a nasty rehash and some unpleasant code is required, an ungraceful degradation.

Few have tried to do away with links altogether. colorforth is an example. Here a word is looked up in a table. This has the same disadvantage as a hash table, that it becomes full in the end. It is easy to speed up, by sorting this table. Chuck Moore nor any of the users of colorforth has felt the need for that.

I do want a method that uses O(log(n)) operations for a look-up. But for ciforth I must rule out hash tables, because they cannot be reconciled with denotations. This is my main motivation looking for another approach.

Using double link fields

The idea of double link fields is rather simple. The links are arranged such that the words are in ascending order. Now a second link is put in place, called the far link. It points to some other entry, possibly even a random one. If the name to be searched is outside of the alphabetical range represented by the current word and its far link, the searching can proceed with the far link. So you can think of it as a proposal to skip a part of the dictionary. This is most efficient if the proposals encountered span a descending range of a half, a quarter etc. of the dictionary. New words must be inserted at a proper place to keep the dictionary in ascending order.

This method has its advantages:

Over the hashing method that it scales well, there is no breakdown point.
Over the chains method, that a reordering of the far links can be done, when performance degrades after a great many additions.
A vocabulary can be optimised in two parts.
If need be a part of a vocabulary can be excluded from ordering, which then will still remain a pure linked list. (This is important for the Post-It Fix-Up assembler.)
it can cope well with prefixes (unlike the hashing method)

Splitting a vocabulary in two sorted parts is needed in case a vocabulary has been split over a ROM and a RAM part. The effect -- without changes to the search engine -- is that first a binary search is done in the first part, then in the second.

Denotations are prefixes that are used to implement number and string constants. With alphabetic sorting they come before names that contain them as a prefix, i.e. $ will come before $@. This, of course, is never wanted. Fortunately, here again we can split a vocabulary in two sorted parts: the non-denotations and the denotations.
A simpler solution may seem to take denotations into account during comparison. A prefix is then sorted after any word that contains it as a prefix. This doesn't work, because the prefix is then sometimes not found, e.g. after $@ a search for "$1" quits, and $ is not considered.

Primitives and design

Searching a wordlist becomes approximately as follows.
A useful primitive is one that follows the far links, until an entry is found that is alphabetically higher than or equal to the name being looked up. Then it backups to the previous entry, if any. It follows one near link, than again far links.
- Follow far links as long as too high Then we backup to the entry whose far link is too far: if it is found or too high we are ready, either found or fail if it is too low we can follow its far link again In short:

REPEAT
    SKIP AS LONG AS NAME IS HIGH ENOUGH
    BACKUP ONE
UNTIL FOUND OR NOT PRESENT

This assumes that in the end we have a short linear chain, where normal links and far links are the same, except for the last one that contains the end-sentinel, probably a zero. Because we follow those far links, searching ends there naturally.

An nice feature of this algorithm is that it can be used now to find the place where to insert a new name.

A modification works in the case of a split wordlist, as with ROM and RAM parts. Instead of the end-sentinel, we place the start of the second part, that is in itself organized as the first part.
If the wordlist has been split in parts we must keep following far links after we known it is not found. This will lead you as fast as possible out of a first order part into a second order part, or the end
This becomes

REPEAT
    REPEAT
        SKIP AS LONG AS NAME IS HIGH ENOUGH
        BACKUP ONE
    UNTIL FOUND OR NOT PRESENT
    SKIP AS LONG AS NAME IS LOW ENOUGH
AGAIN

Surprisingly this is the same than

REPEAT
    FOLLOW HIGH LINK
    BACKUP ONE IF NOT HIGH ENOUGH
UNTIL FOUND OR END SENTINEL

As a last surprise, this algorithm works correctly in case of a linear chain where the far links are the same as the normal links. It always probes the far link first -- always in vain --, then goes by the normal links. And it doesn't stop once past the place where the name is expected, but goes on until the end sentinel. Because a random list can be considered as a bunch of ordered chains, it even almost works for just any wordlist. The exception is the first word that cannot be found if it is larger than the second word.
A disadvantage is that the same algorithm can not be used now to find the place where to insert a new name. But we can use the old algorithm, and use as an end-sentinel the address of the first entry in the second part of the table.

One more complication

There is a problem when more of the same names (homonyms) are present. The danger exists that the second one is found.

A solution is:

while sorting the dictionary, keep homonyms in order
fill in the far links immediately, the same as the normal links, for all but the last of as set of homonyms
in setting up the tree, make sure that far links always point to the first one of a set of homonyms

All this can be accomplished by considering pre-existing linear chains as one entry, which leads to some trivial changes at the lowest level words.

FORGET-ting

Normally a linked wordlist can accomplish FORGET by cutting out entries. The entry that points to a to-be-forgotten entry, is made to point to the link of that entry. This works for sorted vocabularies too, and they remain sorted.
The far links can be short circuited in the same way.

Other Forth lectures Go to the home page of Albert van der Horst