Forth lecture 16.
Doubly linked vocabularies
Scope
It may be argued that limited vocabularies are sufficient for
normal Forth coding.
During interpretation speed is limited by the typing
speed of the user anyway.
During compilation speed of searching is not essential,
because it is done just once.
(This was not true of old 8 bit Forth compilers,
and is not true even today for small microcomputers
as are used in embedded applications.)
However,
some applications in Forth
use the Forth vocabulary as a database of some sort.
There is no limit to the number of items that might be required
to store in such circumstance.
A particular example of this is my
reverse engineering assembler and disassembler
.
Alternatives for speeding up searching
Traditionally Forth uses searches through vocabularies in
a straightforward way via linked list,
or like in colorforth looking up in an array.
This is an O(n) process where n is the number of entries,
which is slow unless for small n.
This can be speeded up in several way.
Most often used is some kind of hash mechanism.
A hash code can be derived from the word to look up, and then
one of several link chains can be used,
instead of one.
This can only achieve a linear speedup in the number of chains,
i.e. for 16 chains one may expect at most a 16-fold speed up.
The advantage are
-
it requires relatively little memory for a substantial speed up.
-
the chains have about the same structure as the original vocabularies
-
it has no problems with part of the vocabulary in rom
An other method is using a hash table on top of the link structures.
This achieves a constant lookup time.
The advantage are
-
it achieves an optimal search time.
-
it can be added to any structure
-
it has no problems with part of the vocabulary in rom
Both have a disadvantage of scaling.
A chains method will not cope with truly huge vocabularies,
a graceful degradation.
A hash method will eventually have its table full, and then
a nasty rehash and some unpleasant code is required,
an ungraceful degradation.
Few have tried to do away with links altogether.
colorforth is an example.
Here a word is looked up in a table.
This has the same disadvantage as a hash table, that it becomes
full in the end.
It is easy to speed up, by sorting this table.
Chuck Moore nor any of the users of colorforth has felt the need
for that.
I do want a method that uses O(log(n)) operations for a look-up.
But for ciforth I must rule out hash tables,
because they cannot be reconciled with denotations.
This is my main motivation looking for another approach.
Using double link fields
The idea of double link fields is rather simple.
The links are arranged such that the words are in ascending
order.
Now a second link is put in place, called the far link.
It points to some other entry, possibly even a random one.
If the name to be searched is outside of the alphabetical
range represented by the current word and its far link,
the searching can proceed with the far link.
So you can think of it as a proposal to skip a part of the
dictionary.
This is most efficient if the proposals encountered span
a descending range of a half, a quarter etc. of the dictionary.
New words must be inserted at a proper place to keep the
dictionary in ascending order.
This method has its advantages:
-
Over the hashing method that it scales well,
there is no breakdown point.
-
Over the chains method, that a reordering of the far links
can be done, when performance degrades after a great many additions.
-
A vocabulary can be optimised in two parts.
-
If need be a part of a vocabulary can be excluded from ordering,
which then will still remain a pure linked list.
(This is important for the Post-It Fix-Up assembler.)
-
it can cope well with prefixes (unlike the hashing method)
Splitting a vocabulary in two sorted parts is needed
in case a vocabulary has been split over
a ROM and a RAM part.
The effect -- without changes to the search engine -- is that first
a binary search is done in the first part, then in the second.
Denotations are
prefixes that are used to implement number and string constants.
With alphabetic sorting they come before names that contain
them as a prefix,
i.e. $ will come before $@.
This, of course, is never wanted.
Fortunately, here again we can
split a vocabulary in two sorted parts:
the non-denotations and the denotations.
A simpler solution may seem to take denotations into account during
comparison.
A prefix is then sorted after any word that contains it as
a prefix.
This doesn't work, because the prefix is then sometimes not found,
e.g. after $@ a search for "$1" quits, and $ is not considered.
Primitives and design
Searching a wordlist becomes approximately as follows.
A useful primitive is one that follows the far links,
until an entry is found that is alphabetically higher than or equal to
the name being looked up.
Then it backups to the previous entry, if any.
It follows one near link, than again far links.
- Follow far links as long as too high
Then we backup to the entry whose far link is too far:
if it is found or too high we are ready, either found or fail
if it is too low we can follow its far link again
In short:
REPEAT
SKIP AS LONG AS NAME IS HIGH ENOUGH
BACKUP ONE
UNTIL FOUND OR NOT PRESENT
This assumes that in the end we have a short linear chain,
where normal links and far links are the same,
except for the last one that contains the end-sentinel,
probably a zero. Because we follow those far links,
searching ends there naturally.
An nice feature of this
algorithm is that it
can be used
now to find the place where to insert a new name.
A modification works in the case of a split wordlist,
as with ROM and RAM parts.
Instead of the end-sentinel,
we place the start of the second part, that is in
itself organized as the first part.
If the wordlist has been split in parts we must keep
following far links after we known it is not found.
This will lead you as fast as possible out of a first order
part into a second order part, or the end
This becomes
REPEAT
REPEAT
SKIP AS LONG AS NAME IS HIGH ENOUGH
BACKUP ONE
UNTIL FOUND OR NOT PRESENT
SKIP AS LONG AS NAME IS LOW ENOUGH
AGAIN
Surprisingly this is the same than
REPEAT
FOLLOW HIGH LINK
BACKUP ONE IF NOT HIGH ENOUGH
UNTIL FOUND OR END SENTINEL
As a last surprise, this algorithm works correctly
in case of a linear chain
where the far links are the same as the normal links.
It always probes the far link first -- always in vain --,
then goes by the normal links.
And it doesn't stop once past the place where the name is
expected,
but goes on until the end
sentinel.
Because a random list can be considered as a bunch of ordered
chains, it even almost works for just any wordlist.
The exception is the
first word that cannot be found if it is larger than the second word.
A disadvantage is that the same algorithm can not be used
now to find the place where to insert a new name.
But we can use the old algorithm, and use as an end-sentinel the
address of the first entry in the second part of the table.
One more complication
There is a problem when more of the same names (homonyms) are present.
The danger exists that the second one is found.
A solution is:
- while sorting the dictionary, keep homonyms in order
- fill in the far links immediately, the same as the normal links,
for all but the last of as set of homonyms
- in setting up the tree, make sure that far links always point to the
first one of a set of homonyms
All this can be accomplished
by considering pre-existing linear chains as one
entry, which leads to some trivial changes at the lowest level words.
FORGET-ting
Normally
a linked wordlist can accomplish FORGET by cutting out entries.
The entry that points to a to-be-forgotten entry, is made to point to
the link of that entry.
This works for sorted vocabularies too, and they remain sorted.
The far links can be short circuited in the same way.
Other Forth lectures
Go to the home page of Albert van der Horst