**[author]**   Marcel Hendrix   email: mhx@iaehv.iaehv.nl
-- ----------------------------------------------------------------------------


**[title]**    tForth: Forth for the Transputer


**[abstract]**

This paper discusses a parallel Forth implementation for the INMOS Transputer;
tForth version 1.0 by the Dutch Forth Workshop.

Besides being a parallel Forth, another important feature of tForth is that it
tracks the documents prepared by X3J14, an American National Standards
Technical Committee (ANS TC) with the mission of standardizing the Forth
language. ANS Forth will materialize somewhere in 1993. Several important
consequences for tForth's design are discussed.

The (occam-like) parallel constructs of tForth, its optimizing compiler
and server / client concept are described in detail.

The tForth kernel, sans the parallel extensions, proves easy to port to
other 32-bit CPU's. This process uses the same base set of source files plus a
small hardware dependent kernel file. As a first experiment, a 32-bit
protected mode Forth for the Intel 386+387 and 486 chips was produced:
iForth. Helped by the ANS Forth word ENVIRONMENT? , both these Forths run
our standard set of examples, benchmarks and utilities -- currently a
collection of over two megabytes of text files. Some details of iForth's
implementation are disclosed.


**[introduction]**

tForth is developed by five people over a period of two years; originally 
we called ourselves ``the Dutch Transputer Forth Workshop''. The project 
was started and partly sponsored by the Dutch FIG branch, at that time headed
by Hans Nieuwenhuijsen. After two years of implementing, testing and writing
documentation, the project is now finally nearing completion. The original DTFW
crew will stay together to support the transputer product, but a subgroup
is planning to extend the project to other very high performance hardware 
(32-bit RISC and DSP) under the banner of the ``The Dutch Forth Workshop''.


**[The tForth system]** 

---The Transputer hardware architecture

The INMOS transputer was first introduced in 1985 [1,2]. The first processor in
the series was the T414, a 32-bit 15 MIPS processor (at 30 MHz) with the
ability to perform multitasking in hardware. New members have been added to the
family regularly, but we'll limit ourselves to the T800, a T414 with an
ANSI-IEEE 754-1985 standard 64-bit floating-point coprocessor integrated onto
the chip (2.25 MFLOPs at 30 MHz), 4 Kbyte onboard RAM and 4 high-speed
communication links. The chip has a smart onboard memory interface that
minimizes external circuitry but still gives 40 Mbytes/s throughput with 100 ns
DRAM.  [PICTURE]

The best-known feature of a transputer is probably its link interface. The
links allow networking of transputers by direct point-to-point connections
without any external logic. The four T800 links support bit rates of 5, 10 and
20 Mbit/s, bidirectionally, and with full automatic DMA control. To a
programmer, there is no handshaking or protocol to take care of, and sending
messages over a link can be done with a single instruction (plus a few to
set up buffer pointers and counts).

Peeking inside the chip reveals that the transputer is a stack machine. It has
two three-level stacks for integer and floating-point expression evaluation.
Apart from these stacks, the transputer has a workspace pointer. This is a
register that points to the base of a workspace where local variables are kept.
When calling a subroutine, room is allocated at the bottom (lowest address) of
the workspace and the integer stack plus the old workspace pointer are copied
there. [PICTURE] 

Transputer multitasking is implemented with on-chip timer and microcoded 
scheduling hardware. The basic item to schedule is called a process.
A process needs a (potentially very small) workspace to save its state and
queueing information, and for its local variables.

At any time, a process is active: being executed or on a list waiting to be
executed, or it is inactive: ready to input, ready to output, or waiting until
a specified time. Inactive processes do not consume any time. Input and output
is not necessarily done on a hardware link, there is something called a
``memory channel'' that behaves in exactly the same way as the hardware and is
used for on-chip interprocess communication.

Active processes are held in two linked lists of process workspaces, one of
high priority processes and one of low priority processes. A high priority
process runs until completion, or until it becomes inactive (although one may 
assume that to be a programming error!). A low priority process is only
permitted to run for two timeslices (about 2 ms), after that it is forcibly
descheduled at the next available descheduling point.

---Mapping tForth on the hardware; tForth internals

We have decided to use subroutine threading with in-line code expansion for all
those Forth primitives that run significantly slower if they have to execute an
additional call/return pair (like 1+ or DUP) [11,12]. Because we also use the
transputer on-chip stacks to hold the top three items of the corresponding
Forth stacks (data and floating-point), a simple peephole optimizer is able to
eliminate the need to push or pop the software stacks if any two primitives are
compiled directly after each other (without an intervening high-level call) 
[13].  Combined with the fact that a transputer call instruction using a 32-bit
offset is eight (8) bytes long, we find that macro expansion not only gives an
enormous speedup but also leads to somewhat smaller code size.

Actually, our present peephole optimizer is not simple anymore. The first
approach used a state-machine that combined the current stack depth with the
stack requirements of the macro to be compiled, resulting in an optimized
sequence of external d(f)push, d(f)pop instructions or equivalent internal
stack shuffling. We found however that almost every primitive benefited from
some kind of special optimization that was very hard to generalize (e.g
addition is commutative, subtraction is not, ROT is faster using a local etc.).
We ended up putting a full-scale stack optimizer in every primitive. The
speedup is about 50% over the state machine approach. However, the transputer
is extremely hostile to the stack concept (a dpush or dpop costs about 8
bytes!) and we suspect that for other CPU's these extreme measures will not be
necessary. For iForth we do not keep the top of the stack in registers and
(therefore) perform much less optimization, nevertheless iForth (on a 33 MHz
'386) is about 3 times faster than tForth (on a 25 MHz T800). 

A subroutine-threaded Forth without macro expansion is really not much faster
than for instance a direct threaded system. But if words like ``dup + 2+'' are
implemented as macro's, what do you do when the user accesses them in interpret
mode: ``13 dup . .''? And what happens when a macro is ticked: `` ' DUP routine
!  routine @ EXECUTE''?  Although the first problem could be tackled by
compiling input into a buffer and execute it there (watch out for nested
interpreters!), the second problem cannot. Thus, we have decided to implement
a shadow definition for every macro. The shadow is directly executable and can
be ticked and thus indirectly compiled, although that will result in suboptimal
code.  Macro's are identified by a special bit in their header, which is used
by ' and FIND to pick the right definition, depending on STATE .  Words like
COMPILE [COMPILE] POSTPONE and COMPILE, need to be really smart when a user
specifies a macro word as their argument. The easy way out is to always compile
the slow non-macro definition, but tForth does extra processing, again relying
on STATE , to optimize these cases.

Considering the fact that a transputer context switch is in effect a change
of the workspace, we decided to put all of the Forth administration in the
workspace. For tForth this means all of its five stacks (data, return,
system, floating-point and locals) plus the user table. The stack pointers
are implemented as user variables and are thus also located in the workspace.
This means that a switch between Forth processes has no additional overhead as 
it involves a simple swap of workspace pointers, which is accomplished by the
scheduling hardware automatically.

A transputer call instruction changes the workspace pointer.  When all Forth
stacks and the user area are in the workspace, it is clearly unacceptable (and
unnecessary) that the workspace moves around when Forth words call each other.
We solved this by starting every single piece of code with an instruction that
``undoes'' the workspace adjustment caused by a call. This creates a small
problem for the occasional Forth word that in a subroutine-threaded model
effectively ``jumps'' into a piece of code (for instance EXECUTE). All affected
words must simulate a workspace adjustment before they jump.

When discussing transputer hardware scheduling we mentioned that a low
priority process is only permitted to run for two timeslices (about 2 ms)
before it is forcibly descheduled at the next available descheduling point.
This ``descheduling point'' is an almost Forth-like concept resembling PAUSE .
A process can prevent descheduling if it does not use two special instructions,
lend (loop end) and j (an unconditional branch). Of course it must also avoid
becoming inactive. We kept this ``optionally non-cooperating'' feature in
tForth by not using the special loop instruction (it is incompatible with +LOOP
anyway) and by clearly documenting which tForth control flow words (AGAIN
REPEAT , but not ELSE) use the special unconditional branch instruction. Also,
we provide alternative words that perform the same action but cannot be
descheduled. This gives the programmer full dynamic control over the priorities
of Forth processes, using PAUSE-like words of his own design.

---hardware requirements

To be able to use tForth, the hardware must provide for a T4xx or T8xx
processor with at least 256 Kbyte of RAM.  Keyboard, screen and file I/O is
handled by a PC-compatible machine running MS-DOS, MS-Windows, or some flavor
of UNIX.  This machine needs a small plug-in board with a standard INMOS C012
link adapter chip.  Alternatively a UNIX workstation can be used, provided
it is able to handle a C012 chip.  The interface problem is trivial if the
transputer itself is seated on a PC plug-in board, as these cards almost always
provide a C012 compatible hardware interface [4].


**[The consequences of tracking ANS Forth]**

tForth conforms as closely to the future ANS Forth as it is possible to
guarantee at this moment.  All of the words in the dpANS-3 document are
available [6,7,8]. 

A transputer Forth conforming to the Forth 83 standard is a bad idea, as that
explicitly requires the use of a 16-bit model [9]. The T4 and T8-type
transputers can not access this data type directly; they only handle bytes and
words efficiently (it is no problem for the 16-bit T2). The new ANS Forth does
not specify a virtual machine, and only requires that the size of a cell is an
integral multiple of the size of a character, which should have a minimum of 8
bits (data-stack elements, return-stack elements, addresses and
single-precision numbers are all one cell wide).  A natural choice for the
transputer is thus a byte-sized character and a 32-bit cell.

Of course, there is no such thing as a free lunch. The extra flexibility and
speed is paid for with alignment problems. These have now become the
responsibility of the programmer (if he cares about transportability, that is).
To a transputer a cell is not equivalent to four bytes in a row, as a cell
address should be a multiple of four. ANS Forth specifies the words CELL+ CELLS
CHAR+ CHARS ALIGN and ALIGNED for use with , C, ALLOT etcetera.  Furthermore
CREATE must guarantee that aligned addresses are passed into DOES> code.  Stack
comments specifying addresses now mention if these addresses should be aligned.

To a user of tForth alignment problems are hardly visible.  The only problem
we frequently encountered is when using CREATE to build records of mixed
datatypes and forgetting to use strategically placed ALIGN words (This happened
often when transporting quick hacks from iForth to tForth. iForth uses 32-bit
cells too, but the hardware is able to fetch a word from an unaligned address
so a programming error goes unnoticed).  Therefore special attention must be
paid to the mixed use of C, and , and to the compilation of strings.
Floating-point numbers must also be aligned (single and double-precision IEEE
numbers plus the processor-specific internal format must be supported).

The problems become really interesting when the hardware does not support bytes
and thus CHAR+ is not equivalent to 1+ anymore. We expect to find a lot of
invalid assumptions in our metacompiler code once we get to build a TMS320C30
target (This DSP chip treats everything as a 32-bit item and so CHAR+ and CELL+
must be equivalent).

--LOOPing
ANS Forth still specifies the +LOOP algorithm introduced with Forth 83. This 
means the loop index must be incremented or decremented across the boundary
between the loop limit minus one and the loop limit in order for looping to
stop.  This specification allows a loop index to address all of memory (64 
Kbyte for a 16-bit Forth) but it rules out generating efficient machine code 
for a transputer. Essentially the Forth-83 LOOP wants to treat the loop index
as an unsigned number. For RISC-like reasons, the transputer treats addresses
as signed numbers. This saves a flag register that is difficult to handle on a
stack computer and presents extra context to save when a task switch occurs.
The transputer can test for overflow after an addition with the testerr
instruction. The problem with (+)LOOP deliberately generating an overflow to
signal loop end, is that operating system software may consider this to be a
serious system error and is able to halt the complete transputer if it occurs.
All it takes is to set the transputer's halt-on-error flag. tForth assumes this
flag is never set, and our bootloader explicitly resets this flag at startup.

---division
We like the way ANS Forth ``solves'' the question of floored versus symmetric
division. The implementor is free to choose the method used in the kernel, as
long as the documents clearly specify which method is used. Standard ANS Forth
programs must include code to generate the desired form of division and may not
rely on the implementation-defined result. tForth uses symmetric division
because the hardware directly supports it. Besides, we have not seen much code,
apart from contrived examples, that really needs the floored variant. ANS Forth
requires that SM/REM and FM/MOD are available in CORE , so a simple
redefinition at the top of a source file is all it takes to switch between the
two variants:

 : /    S" FM/MOD NIP "  EVALUATE ; IMMEDIATE
 : MOD  S" FM/MOD DROP " EVALUATE ; IMMEDIATE

The code shown will not obstruct the workings of the optimizer.

---restricted access to code and headerfields 
The dpANS documents describe the Dictionary as "an extensible structure
that contains definitions and associated data space. The form of this structure
is not specified by the standard, but it may be described as consisting of
three logical parts: ordered word list(s) that may be searched for word names,
a code space where the actions of the definitions are stored, and a data space.
Of these, only data space may be directly addressed by a standard program."

By this definition, words like PFA , NFA , CFA etcetera are no longer required,
and an implementation can use any algorithm it sees fit to organize the word
headers and the information stored therein. Furthermore, it means special 
threading techniques and in-line code generation are now legal. For a user,
it means something like : FOO [ ' BAR , ] ; is no longer considered portable
code (if it ever was). Likewise, 3 CONSTANT FUBAR  5 FUBAR 2+ ! must be frowned
upon. Regrettably, forbidding access to word headers and vocabulary internals
makes illegal some elegant OOP techniques [10].

tForth uses the relaxed rules of ANS Forth to implement a subroutine-threaded
Forth where some primitives are expanded in-line and optimized at macro edges
for optimal stack access. The headers are separated from the code (this makes
implementing locals much easier) and use a binned hashing technique with
256 threads for every vocabulary (or wordlist). The header table contains 
extra information to aid the optimizer (more flags) and contains an additional
pointer to optional forget code that is executed when FORGET walks the complete
set of wordlists and removes user definitions one by one.

[PICTURE]
 Header: flags(word) forget(ptr)  hash(ptr)  code(ptr)  link(ptr)  name(token)

Flags holds the bits that signal IMMEDIATE, COMPILE-ONLY SMUDGED MACRO, <not
in ANS Forth> and PRIVATE .
The hash field is used to look up a word in a given wordlist. 
The code field holds a pointer to the executable code.
The link links all words in a wordlist. This is essential for a safe and 
bullet-proof FORGET . For instance, it allows the following:

 VOCABULARY one   ALSO one DEFINITIONS  : hello  ." hello!"  ; 1000 ALLOT
 PREVIOUS
 VOCABULARY two   ALSO two DEFINITIONS  : hello2 ." hello2!" ; 1000 ALLOT
 PREVIOUS  ALSO one DEFINITIONS  : hello3  ." hello3!"  ; 

 FORGET hello  ( or even FORGET one . Try it on your system... )

Name is a standard Forth counted string.

---POSTPONE COMPILE [COMPILE] COMPILE, 
We already saw that access to code and headerfields has become restricted.
This affects COMPILE and [COMPILE] . On systems that generate machine code, but
do not have tForth's luxurious shadow words, things like COMPILE DUP do not
work, or have to be written [COMPILE] DUP . POSTPONE has been invented to
circumvent the IMMEDIATE problem, although there are still some pathological
cases where it cannot do what you want. The famous example is COMPILE
[COMPILE] IF .

COMPILE, provides a portable alternative for the common coding practice:  [ '
DUP ] LITERAL , . It should now be written  ['] DUP COMPILE, but again, it will
only work if the programmer knows the ticked word is not IMMEDIATE .

---ENVIRONMENT?
This CORE word enables one to check for strings in the ``present environment''.
It takes a string argument from the stack and returns false or <value> true.
We use this word a lot in our Forths, but almost always to look for strings we
defined ourselves, never for the ones ANS Forth thinks are needed. We use it to
check for server version number, floating-point size (we have 3 now, maybe 4
later), and if we are running tForth or iForth. Our systems implement it by
using a special hidden wordlist, but ANS Forth allows this to be done by other
means (presumably checking a disk-based database).

---memory allocation 
The tForth parallel extensions need dynamic memory allocation to implement
the PAR structure. Luckily ANS Forth provides it in a usable form (direct
addressing) although a handle-based scheme like that proposed in [10] is more 
like the state of the art, and could have been implemented painlessly on 8-bit 
and 16-bit systems.

---Floating-point
Again, the presence of the Floating-point wordset in ANS Forth is exactly what
we needed for both iForth and tForth. Both Forths use the hardware coprocessor
and the hardware floating-point stacks. The standard sorely lacks flexible
tools for floating-point number formatting, like <# does for integers (There
are words to print in common formats, and recently REPRESENT was added). Also,
there is no way to use floating-point locals, which is a serious omission as
there are at least 3 different floating-point formats.

---Multitasking and BLOCKs.
Multi-tasking is not part of ANS Forth. Nevertheless, in the Informative Annex
generic multitasking is described using the PAUSE concept (implicit in a set of
I/O words). Fortunately, we can easily implement a PAUSE-like concept on a
micro-scale in tForth (that is within user-defined words), however once in the
system an I/O word can be descheduled because its time slice runs out, not only
because it has finished accessing the disk block or buffer.  What we do is
acquire a special semaphore in the block buffer words. This semaphore is
released in every I/O word documented in the ANS Document (they give a list).
When a process has the semaphore, it can freely access the buffer. This 
solution does not extend easily to more than one block buffer, so we had rather
seen ANS Forth words like LOCK and UNLOCK . 

Sequential files do not have this problem because most OS's give out unique
fid's when a file is multiple accessed (it therefore is environmentally
dependent what happens).

**[Parallel Forth]**

---Transputer links and channels
The parallel processes enabled by the transputer hardware communicate 
by the use of ``channels'' [2,3,5].  The communication regulates data transfer
as well as decides which process is to run.  A communication proceeds along a
channel that connects two processes, the output and the input process.  The
output process stops until it is able to hand its data to the input process.
The input process stops until the process on the other side of the channel is
ready to deliver its data.  In principle, there is no limit to the number of
channels connected to a process, but there is always just one input and one
output process that relates to a channel.

A ``channel'' in above description can be a hardware bi-directional link, of
which there are a limited number (between 2 and 4), or a memory channel, of
which there can be an almost unlimited number (they only need one word of
memory).  Of course, processes running on different transputers can only
communicate using hardware links. Therefore the ideal, writing code to run on
n processes without needing knowledge of the physical location of these n
processes, is unattainable in practice. The new generation of transputers
features an unlimited number of ``virtual channels'', that is, the link
messages are multiplexed and routed automatically [14]. This eliminates the
differences between a memory channel and a hardware link, and thus the
difference between local and remote processes.

The CHANNEL and LINK definitions
--------------------------------
A memory channel is created by the defining word CHANNEL , followed by a
name.  The channel created is initialized automatically, i.e. it is ready to
be used.  It is the user's responsibility to handle it properly, this means
observing the rule of attaching one input and output process through this
channel.

The channels LINK0 LINK1 LINK2 LINK3 and EVENT are pre-defined. The first four
refer to the hardware channels that connect the transputers to each other,
opposed to the memory channels defined using CHANNEL . These hardware
channels behave identically to those defined using CHANNEL down to the
machine code level, as far as a single input or output action is concerned.
The EVENT channel is special in that it is read only. Nothing but external 
circuitry is able to restart the process thus descheduled. The data read is
garbage and should be thrown away.
Note that, although a link is bi-directional and thus is equivalent to two
separate memory channels, it is possible to refer to both with the same name.
tForth's high-level channel i/o words sort out the hardware details for
themselves. As an example, the following syntax is correct:

 LINK1 CHANNEL-C@  3 +  LINK1 CHANNEL-C!
 
This reads a byte of link1, adds three to it and sends it back over the
same link.

Channel Input and Output operations
-----------------------------------
A channel can be seen as a transputer address representing an hardware link 
or a memory channel.

The word CHANNEL-@ fetches a word from a channel, CHANNEL-! stores a word to 
a channel. They are similar to @ ! . 
As already used in bove, CHANNEL-C@ CHANNEL-C! fetch and store bytes,
similar to C@ C! .
Finally we have CHANNEL-SEND CHANNEL-RECEIVE . Both accept a string address 
plus count and a channel. CHANNEL-SEND sends count bytes located at string 
address over the channel. CHANNEL-RECEIVE receives count bytes from the 
channel and stores them at string address. Note that the number of bytes have 
to agree. Normally they will be communicated by a separate CHANNEL-@ CHANNEL-! 
exchange.

In order to make an hardware link equivalent to a memory channel, it is
sometimes useful to hide the fact that the name of a link corresponds to two
memory addresses, one for input, the other for output. Note that use of the
CHANNEL-xx words already guarantees this automatically, but in some
situations it is better to have explicit conversion words. >INPUT-CHANNEL
and >OUTPUT-CHANNEL convert a channel addres to the address needed for the 
transputer hardware in, and out, instructions.

Language constructs using channels
----------------------------------
The SELECT statement in tForth allows a process to make a choice over its
future behavior dependent on the readiness of other concurrent processes to
communicate with it over a channel. Several input sources are possible for this
construct. We will concentrate here on those components directly using
channels. An informal description follows:

 SELECT
	<channel> GUARD <code> ENDGUARD
	...
	<channel> <boolean> ?GUARD <code> ENDGUARD
	...
	<end> <start> REPLICATE <channel> GUARD <code> ENDGUARD
	...
	<end> <start> REPLICATE <channel> <boolean> ?GUARD <code> ENDGUARD
	...
 ENDSELECT

Here ``<channel>'' is used to describe an arbitrary series of tForth words that
should result in a channel address put on the data stack, for use by GUARD .

The word ``<code>'' describes any tForth code. This <code> is executed on the
condition that something can be read from the channel GUARDed.

Again, the ``<boolean>'' can be generated by any sequence of tForth words. 
?GUARD needs a boolean true \em{and} a ready channel to activate its code.

The REPLICATE statement should be seen as DO LOOP in disguise. Between 
REPLICATE and GUARD , the loop index I can be used to compute channel addresses
for GUARD . If one of these computed channels becomes ready, the corresponding
index is pushed on the data stack and the <code> after GUARD is executed.

Time
----
There are two clock registers, one for each priority level. The high priority
clock increments every microsecond. The low priority clock increments every
64 microseconds. 

Time is cyclic. Whenever a clock register reaches the maximum positive number
fitting in a CELL , it ``increments'' to MININT . Time values must be
considered unsigned values and manipulated with unsigned operators like U<
and U> . As tForth pays no attention whatsoever to the error flag, it
is perfectly okay to use \f{+} to add time values together.

A disadvantage of the internal clocks is that their intervals depend on the
process priority.  Therefore tForth provides the words >MS and >TICKS that
work independent of this priority. The words accept or output data in
millisecond format, like the ANSI standard word MS .
With >MS a count in ``timer ticks'' is translated to real elapsed time in
milliseconds. With >TICKS , a real time in milliseconds is translated to
timer ticks. Both words determine the priority of their callers to decide on
the conversion factor. 

Reading the time
----------------
The present value of the relevant timer register can be read with ?MS . The 
time since switch-on \footnote{tForth does not attempt to provide a real
time-of-day clock. On a 32-bits transputer the low priority timer register
will cycle every 76 hours, but on a 16-bit transputer this will only be 4.2
seconds.} is returned in milliseconds. 

Waiting 
-------
It is possible to halt a process for a specified number of milliseconds. The
word to use is MS . Maybe we should stress that this word only delays the
process calling it, not the transputer itself; other processes and link i/o
will keep on running. For a 32-bits transputer the limits for high and
low-priority processes are 1.2 and 76 hours respectively.

Language constructs using TIME
------------------------------
The SELECT statement in tForth allows a process to make a choice over its
future behavior dependent on signals from the system timer. We will concentrate
here on those components directly using the timer. An informal description
follows:

 SELECT
	<timeout> TIMEOUT <code> ENDGUARD
	...
	<timeout> <boolean> ?TIMEOUT <code> ENDGUARD
	...
	<end> <start> REPLICATE <timeout> TIMEOUT <code> ENDGUARD
	...
	<end> <start> REPLICATE <timeout> <boolean> ?TIMEOUT <code> ENDGUARD
	...
 ENDSELECT

Here ``<timeout>'' is used to describe an arbitrary series of tForth words that
should result in an unsigned number being put on the data stack, for use by 
TIMEOUT or ?TIMEOUT . This number describes a time interval in timer ticks.

The word ``<code>'' describes any tForth code. This <code> is executed on the
condition that the specified time interval has elapsed.

Again, the ``<boolean>'' can be generated by any sequence of tForth words. 
?TIMEOUT needs a boolean true \em{and} a counted down time interval to
activate its code.

The REPLICATE statement was described with GUARD , above. 

An example of the above follows

\ Needed now tForth REQUESTS for keys to be send.
: KEY-REQUEST 	{{ ITERM! 0 BOOTLINK @ CHANNEL-C! }} ;

: X6
	CR ." Type any key to continue (you have a few seconds...)"
	KEY-REQUEST
	SELECT
		BOOTLINK @ 
		GUARD 
			BOOTLINK @ CHANNEL-C@ 
			CR ." You typed '" EMIT ." '" 
		ENDGUARD
		
		100000 
		TIMEOUT CR ." Time out, press a key " 
		        BOOTLINK @ CHANNEL-C@ DROP 
		ENDGUARD
	ENDSELECT ;

An example of REPLICATE

: X5-CALCULATE-TIMEOUT 7 = IF 10000 ELSE 100000 THEN ;

: X5
	SELECT
		10 0 REPLICATE
		I X5-CALCULATE-TIMEOUT TIMEOUT 
			7 <> ABORT" Failure in X5" 
		ENDGUARD
	ENDSELECT ;
	

Concurrent Language Features
----------------------------
tForth supports true and apparent concurrency. In true concurrent systems
each parallel activity is executed by a different physical processor. If
parallel execution is simulated by a single-processor system it is said to
use apparent concurrency. In tForth the distinction between both forms of
concurrency is explicitly programmed by the user. Both forms have advantages.
The same language constructs can be used for both. However, as we have
already seen with the transputer link hardware and channel concept, important
differences exist that reflect the physical implementation. Here we will
discuss the apparent concurrency features only, as the true concurrent
features of tForth are planned for version 2.0 and cannot be discussed yet.
This does not mean true concurrent programming is impossible, it only states
that powerful high-level constructs like automatic placement of processes on
a transputer network are not yet available. However, the basic tools to help
the user build these utilities are already present in tForth version 1.0.

To simulate parallel tasks on a single-processor system, executing the
instructions of one task is interleaved with running all other tasks. The
transputer hardware handles this automatically. This ``slicing'' is
completely transparent to the user: there are no special requirements for the
use of data structures or program control statements; a task or process may
be structured just like any other definition. However, special syntax exists
to create, start and stop processes.

The transputer supports running a task at two levels of priority [3]. The
hardware maintains four queues of active processes, two for high, and two for
low priority processes. A process can be in one of four possible states:
executing; waiting to execute, which implies that it is in one of the two
active process queues; waiting for a timer event, which implies it is in one
of the two timer queues, or waiting for a communication event, in which case
it is in no queue.  A high priority process will execute without interruption
until it terminates, or waits for a timer or communication event to take
place. In this case, if there are any further high priority processes waiting
to proceed, then the process at the head of the high priority active queue
will be scheduled. If there are no high priority processes waiting to
execute, then the next waiting low priority process will be scheduled. Low
priority processes may be pre-empted at any time by a high-priority process
that becomes ready to run. Low priority processes are time-sliced; if a low
priority process executes a \f{j,} or \f{lend,} instruction, and has been
executing for more than its time-slice period, it is descheduled and placed
at the back of the low priority active queue, with the process at the head of
the queue commencing execution.

We already mentioned that tForth uses a combination of time-slicing
and a method involving the word PAUSE . A process can immediately suspend its
execution by calling PAUSE .

Asynchronous processes
----------------------
A colon definition is a routine that will execute in the sequence of other
colon definitions called by the main program. A task or process is a routine
that will execute in parallel with the calling program, immediately after it
is started. It has been a design decision of tForth not to have special
compiler syntax for routines that are to be run as a task. The difference
with normal colon definitions rests solely in how they are started, and what
happens after they run to completion. Typing the name of a colon definition
runs its code as a single task, just like it is for a normal Forth system. 
Actually, this means the colon definition runs \em{instead} of the default
interpreter loop. If it finishes, the interpreter automatically restarts. The
same colon definition can be run \em{concurrent with all other active tasks}
if special syntax is used to submit it. Carefully note that if it runs to
completion (by executing the final \f{;}), there is no interpreter to regain
control. This means a task must explicitly execute STOP or \f{stopp,} when it
finishes. Having the word STOP is advantageous, because it compeletely saves
the state of the task in its private workspace. Using RERUN , these tasks can
be restarted at the exact point they left of.

The necessary steps to be able to run a task concurrently are (1) create a
workspace and (2) pass the wanted priority plus the address of the newly
allocated workspace plus the machine address of the code to be run to the
transputer hardware. As already mentioned, the code must contain an infinite
loop or execute the STOP statement when it finishes.  Because the machine
executable address of a colon definition is not equivalent to its Forth
execution token, the word RUN is available as a conversion operator.

We'll describe one of the two kinds of asynchronous concurrent processes
supported by tForth. Methods are discussed to create, destroy, run and stop
processes. Some simple examples will be given.

Process priorities
------------------
In ``tForth and Concurrency'' we mentioned that the transputer
supports two task priority levels; high priority and low priority. tForth
allows Forth tasks to run at both these machine priority levels.
Consequently, the words LO-PRIO and HI-PRIO are available to be used together
with the RUN command (to be explained below). 

The word GETPRIORITY assesses the priority level of the calling task. Its dual,
SETPRIORITY , changes the priority level of the caller, at the same time 
leaving the current priority level on the data stack.  A guaranteed feature of
this word is that it will \em{not} cause descheduling.

We needed SETPRIORITY for the tForth SEMAPHORE concept.  It was very difficult
to find a way to do priority switching. The transputer instruction set does not
support it directly, probably because there is no use for it in an occam
context.

Concurrent Forth processes 
--------------------------
The workspace of a Forth process is created using the defining word 
FORTH-PROCESS and contains all five Forth stacks and a partially initialized
user area (that is: copied from the starting process). Code to be run in this
workspace can be programmed using all features of the regular tForth
compiler. There is one exception: compilation by sub-processes is not
supported, because these processes may not modify the shared, but
unprotected, header space. However, it is ok to compile executable code fit
for other processes to use. (Note the subtle difference). In general, you
should not try to abuse tForth as a multi-user system, it is not designed for
this.

There is no compile-time parameter to FORTH-PROCESS as the size of stacks and
user area is fixed. A Forth task needs about 8 KByte of memory to run.

A feature of the word created with FORTH-PROCESS is that it not merely returns
the address of the allocated workspace, but that it will actually start a
concurrent Forth process when passed a priority and a machine executable
address.  When the process executes \f{STOP} the transputer hardware
automatically writes data at negative offsets of the workspace pointer in
order to allow a restart of the process. This information is used by RERUN to
let the process continue just after the point where it left off.

An example:

 FORTH-PROCESS process
 VARIABLE count
 
 : counter  
	BEGIN	 	\ begin ... again allows RERUN
 	     1000 0 DO  
		       I count +!  
 	          LOOP  
	     STOP 	\ do one iteration only 
	AGAIN ;

	LO-PRIO  ' counter RUN process

	count ?		\ result of the first iteration
 
	' process RERUN
 
	count ?		\ result of the second iteration
 
 
The process identifier
----------------------
We now have seen that each high-level process in tForth is created by the
defining word FORTH-PROCESS (except for the interpreter process). One of the
important parameters that tForth keeps in the parameter field of a process is
the address of the workspace that is allocated for it.  As this is an unique
address, it is ideally suited as a Process IDentifier or PID . Various
important parameters are stored at negative offsets from the PID.  You can use
'PID to find the Process IDentifier of a named process.  Example:

 FORTH-PROCESS process
 ' process  ( cfa) 'PID ( PID) H.  ( might print $80031B00)

Do not confuse 'PID and PID. The tForth word PID returns the process
identifier of the interpreter or main process. The server will try to put the
workspace of the main process in transputer on-chip RAM, so PID will in most
cases be very nearly equal to MININT .

Running, stopping and killing processes
---------------------------------------
We now know how to run one of the two kinds of processes tForth supports.
It comes down to executing the word defined by FORTH-PROCESS after having put
the priority and the address of the executable code on the data stack. Remember
that the word RUN is only a conversion operator that serves to convert
execution tokens to machine addresses suitable for these defined words.

Stopping a Forth process is accomplished by having it execute STOP or
\f{stopp,}. A machine process can only stop by executing \f{stopp,}.

Killing a process is the operation of forcing a process to execute STOP or
\f{stopp,}. It probably should be avoided, but it has its uses, especially
when FORGETting code containing process definitions \footnote{tForth already
uses KILL for this purpose internally.}.

KILL needs an execution token on the stack. It will try to stop the process
that is identified by this token, removing the process from all the process
queues. It works by redirecting the stored instruction pointer of the
process to code that executes \f{stopp,}, and making sure the process runs
at least one more time to execute that instruction. There is a check if the 
word corresponding to the execution token is created by a FORTH-PROCESS , but
it can not be 100% fool-proof. If it fails, a random word is stored at a random
memory address.

There are difficulties with killing processes:
 
 - When a process is in a timer queue, you cannot KILL it.  
 - If a process controls one or more semaphores, you'd better not KILL it, 
   as that will lock the resources connected to the semaphores.
 - If a process is doing channel i/o, it is not in any queue and thus the
   workspace can be removed without doing anything special. However, i/o
   implies \em{two} processes are involved. If one of these processes is 
   performing the KILL , there is still no problem. If an outsider process
   (most likely the root process) does the KILL , it is possible that the 
   still living process tries to communicate with the process now KILLed,
   with potentially disastrous results.

These three conditions cannot be detected by tForth itself. The programmer
is expected to override the FORGET fields of processes with such behavior. A
crude solution would be to let the process poll a flag that is set by the
forget part. The forget code PAUSE 's until the process has read the
flag. A transputer-like solution for this is to use a CHANNEL .

      Example usage:   ' name KILL

Synchronized concurrent processing
----------------------------------
We will now describe one of the two tForth constructs that allow synchronized
concurrent processing. Occam would call this a ``PAR construct''. The syntax
is discussed and methods are shown to pass parameters to the separate
processes.  Simple examples are given throughout.

In tForth's PAR construct all started concurrent processes are forced to wait
until each and every process has finished and the main line of sequential
processing can continue. The ``waiting to be finished'' bit, the
synchronization, is the only feature discerning them from a set of asynchronous
concurrent processes as described in the previous paragraphs.  Of course there
are superficial differences too, like that it is not necessary for the user to
allocate named workspaces for each and every process needed in a PAR .
Allocation and deallocation code is generated by the compiler automatically.

As a quick preview, inside a colon definition a standard PAR will look as 
follows:

 VARIABLE gorilla	  1 gorilla !
 VARIABLE bananas	111 bananas !
 
  : ZOO
  	PAR
  	   STARTP  -24 bananas +!  ENDP
	   STARTP    1 gorilla +!  ENDP
	ENDPAR
	CR bananas ?  gorilla ? ;

Between the PAR and ENDPAR two processes are started in parallel. Their
routines are specified between STARTP and ENDP . This example carefully avoids
the problems that arise when in the course of their execution the processes
need to modify the same variable .

Preserving priority 
-------------------
It is not necessary for parallel processes to have the same priority.
Thus one might suspect that the priority of the main sequential process 
depends on the priority of the parallel process finishing last. We can document
that the priority of the main process is explicitly restored by all of
tForth's PAR constructs.

How to define a high-level PAR
------------------------------
tForth's high-level PAR construct is designed to be used with dynamically 
allocated workspaces and routines written in high-level tForth. The workspace
of a sub-process contains all 5 tForth stacks and a copy of the user area of
the main sequential process.

Occam splits of n-1 tasks if asked to run n processes. The root process takes
over one of the remaining parallel tasks and runs it in its own workspace.
This would be wrong if one of the parallel tasks wants to start its own
set of parallel jobs: then two or more processes (at different ``task depths'')
will end up with the same workspace and overwrite each other's data!

tForth therefore starts up exactly n NEW processes for each n PAR . The root
process executes \f{endp,} after it has allocated and started the n tasks.

To define a high-level parallel construct four words are available: PAR ,
STARTP , ENDP and ENDPAR .

PAR marks the start of a list of sub-processes. The routine definition for
each sub-process is enclosed by STARTP and ENDP . These two words generate
code that takes care of workspace allocation and deallocation.  ENDPAR marks
the end of a list of processes.  It compiles code that waits until all of the
processes that are in the list of processes have terminated and makes sure
sequential processing proceeds with the same priority level in effect as 
before the PAR was started.

Example usage, a parallel multiplier: 
     
\ First create a parameter area. Each parallel process can access this area.
\              item#: 0   1    2    3     4      5
	CREATE data   0 , 11 , 12 , 123 , 3311 , 0 ,

: TH	data []CELL ;	( index -- address )

: ZOTZ
	PAR 	
	   STARTP 	\ multiply items 1 and 2 and store at 0
		1 TH @	\ get item 1
		2 TH @	\ get item 2
		  *	\ multiply
		0 TH !	\ store at item 0
	   ENDP
           STARTP 	\ multiply items 3 and 4 and store at 5
	 	3 TH @	\ get item 3
		4 TH @	\ get item 4
		  * 	\ multiply 
		5 TH !	\ store at item 5
	   ENDP
	ENDPAR ;
	
	PREVIOUS

	ZOTZ  CR 0 TH ?  5 TH ?
	
Parameter passing
-----------------
Two words are available to pass parameters to the high-level processes inside
a PAR ENDPAR construct. 
The word :I signals the following STARTP that it should generate code to
transfer n words from the data stack of the main sequential process onto the
data stack of this particular process. The word :F does the same thing, but
now for the floating-point items. After the execution of the code generated
by STARTP , the relevant stack of the root process has dropped the specified
number of items.

Example:

 : WOOF ( -- ) 
 	PAR
  	   12 21  2 :I STARTP  + {{ . SPACE }} ENDP
	   3e PI  2 :F STARTP  F+ E.           ENDP
	ENDPAR 
	.S ;

\ After WOOF , tForth prints:
\ FORTH> WOOF 33  6.141593E0 
\   Data: ---
\ System: ---
\  Float: --- ok
\ FORTH>

Process replication
-------------------
It is possible to start concurrent processes using a DO LOOP related 
construction. The word to be used is TIMES .

<limit> <start> TIMES makes sure the next code fragment that is inside a PAR
... ENDPAR construct is executed <limit> - <start> +1 times. TIMES has the
same functionality as DO . The next STARTP command will act as the
corresponding LOOP statement. The loop index is available using I as with any
other DO ...  LOOP command. You can pass items to the I-th sub-process using
:I .

Example:

: FOO
	PAR  3 0 TIMES  
		   I 2*  I 1+  I DUP *   3 :I 
		STARTP  
		    + +  . 
		ENDP
	ENDPAR ;

\ Execution of FOO generates the following output:
\ FORTH> FOO 1  5  11   ok


**[The tForth server/client scheme]**

The key to tForth being platform independent lies in its server/client
concept. All I/O and memory management is routed through stub words that
connect, via one hardware link, to a server on the host computer. This
server is written in C, using the host-supplied compiler and I/O libraries.
If a user has access to the source code, he can link in any library available
using the hooks in our protocol. In that way we have added graphics facilities
to tForth. The server can be made to work in parallel with tForth, which makes
a lot of a difference when there are only one or two transputers available.

**[Porting tForth]**

As tForth left the bootstrap stage, it became able to metacompile itself.
The resulting source-code cleanup revealed that it is very easy to generate
Forths for other CPU's, using the same base set of source files plus a small
hardware dependent kernel.  Right now, we have build a 32-bit protected mode
Forth for the Intel 386+387 and 486 chips in this way: iForth.  Both tForth
and iForth can run the same metacompiler source code, enabling us to generate
a new tForth with tForth, a tForth with iForth, an iForth with tForth and an
iForth using iForth. Helped by the ANS Forth word ENVIRONMENT? , both Forths
run our standard set of examples and benchmarks -- currently a collection of
over two megabytes of text files.  iForth is not a parallel Forth, so
obviously some examples must be excluded.

**[Where do we go from here?]**

Version 1.0 of tForth does not run on a network of transputers, but the
necessary hooks are there.  At this moment we are experimenting with tForth on
a 64-node transputer system, which we hope will result in an Helios-compliant
tForth version 2.00. This version will provide the high-level tools to load and
distribute tForth code over large networks. Any transputer on the net can be
selected to have full and exclusive access to keyboard, screen and other host
facilities.

In a parallel effort, we are working on kernel ports to the Intel '386+387 and
'486 chips, the Motorola 68000 series and the Texas Instruments TMS320C30 and
'40 chips. Porting problems with byte-order and character size will result in
incremental touch-ups of our present metacompiler, trying to make it truly
universal.


**[literature]**

 [1] IMS T800 Architecture; Technical Note 6. INMOS Ltd, Bristol, 1987.
 [2] Transputer Instruction Set: a compiler writer's guide.
     INMOS Limited, Prentice Hall International (UK) Ltd 1988.
 [3] Inside The Transputer.
     D.A.P. Mitchell, J.A. Thompson, G.A. Manson, G.R. Brooks,
     Blackwell Scientific Publications, 1990
 [4] User Guide Transputer Education Kit; Theory of Operation, Installation,
     Schematics. Computer System Architects.
 [5] A tutorial introduction to OCCAM programming. Dick Pountain, INMOS Ltd. 
     1987
 [6] draft proposed American National Standard for Information Systems -
     Programming Languages - Forth  (X3J14 dpANS-2 - Aug 3, 1991)
     Secretariat Computer and Business Equipment Manufacturers Association,
     311 First Street, N.W., Suite 500, Washington, DC 20001-2178
 [7] Comments received during the first public review of BSR X3.215.199x, 
     dpANS Forth. Secretariat CBEMA.
 [8] dpANS-3, Changed sections only, by John Rible. Secretariat CBEMA.
 [9] Forth-83 Standard. A publication of the Forth Standards Team, 
     P.O. box 4545 Mountain View, CA 94040, USA, August 1983
[10] Object-Oriented Forth; Implementation of Data Structures. 
     Dick Pountain, Academic Press.
[11] Faster Forth; Reducing overhead in threaded interpretive languages.
     Ronald L. Greene, BYTE, June 1984.
[12] A Fast Forth for the 68000, Lori Chavez. Dr. Dobb's Journal, October 1987.
[13] Implementing Forth on the 80386. John E. Lecky, JFAR Vol 5, No 1, 1987
[14] The T9000 Transputer; Product Overview, INMOS Ltd, Bristol, 1991


**[APPENDIX]**  **[list of tForth features]**

- All of the transputer hardware is supported: hardware task-switching,
hardware floating-point, the on-chip timer, the links. For the T4 a simple
floating-point emulator is provided.

- Tracking ANS Forth (presently conforming to dpANS-3). All ANS word sets
and their extensions are supported: CORE BLOCK DOUBLE ERROR FACILITY FILE
FLOAT LOCALS MEMORY TOOLKIT SEARCH and STRING .

- Full 32-bit system, incrementally compiling to machine code.  Speed of
high-level words is about two times slower than hand-optimized assembly
language. (The generated code beats most transputer ``C'' implementations with
regards to speed and size for anything but trivial programs. It is faster than
all of them by an order of magnitude with regards to compile and link speeds:
ready to execute a 4700 line program consisting of 11 source files within 30
seconds, where the INMOS ICC will take 84 seconds to compile and link the Sieve
of Eratosthanes, a single-file program of 39 lines).  Highlevel words can be
mixed with machine code primitives, in-line generation of machine code is
possible.

- tForth can be loaded at any available transputer address, but it will
always claim the on-chip RAM for its workspace (this can be changed if
unacceptable).  tForth can use all the memory that is available (dynamically as
well as statically), or it can be deliberately shrunk to a workable minimum
of 192 KBytes (73 Kbytes free for user programs).

- All OCCAM constructs like PAR, ALT and PRI PAR have tForth counter parts, but
tForth also allows dynamic memory management and recursion, which is not
available in OCCAM.  The on-chip timer and the transputer links can be
accessed with high-level Forth words. Semaphores and queues are implemented
and are used in our I/O and memory management system to enable multi-process
I/O.

- tForth has an integrated optimizer and runs only about 2 times slower than
hand-optimized assembly language.  There is an ASSEMBLER vocabulary to get
the extra speed if needed. The assembler supports all documented (and some 
undocumented) instructions for every transputer model known to us. It is
possible to cross assemble for other models, the assembler emits warnings if
the code is not executable on the host hardware.

- Symbolic disassembler loads on demand.

- Several flavors of debuggers and profilers.  

- String, Complex number, 2D-Matrix, Array and Graphics support (fully
documented source code in library format).

- Online help for every tForth kernel word. 

- A flexible command line editor is included. All of the user's own programming
tools (for instance editor, grep, browser) can be seamlessly integrated into
the tForth environment using the words SYSTEM and OS-IMPORT (not available
under MS-Windows).

- Function key support, command and file history. 

- Compile errors start a user-specified editor with the cursor pointing at the
problem.

- Two megabytes of examples are available for study or inclusion in your own
programs. Examples encompass benchmarks, parallel demo's, graphics programming,
big number arithmetic, text processing tools, neural networks, text compression
and even a few games.

- High level support for module management, localized definitions and 
use of forward references. Smart forgetting automatically handles process
descheduling, memory deallocation and deinstallation of vectors.

- All I/O is vectored. A line of code or a function key press direct all
output to a disk file or a free transputer link.  

- Source code for the C server is available. User functions enable the
calling of any C-library on the host, and thus the host platform may process
in parallel with the transputer.