The counting twinprimes project of the hcc 68000 gg and the
Forth gg.

A prime is a number with no (precisely: only trivial) dividers, such as
17 or 10003. Of course these numbers are divisable by themselves or 1,
so mathematicians prefer to say having "no non-trivial" dividers.
Twinprimes are primes with a difference of 2 e.g. 17 and 19.

Apparently nobody has come up with a better way for counting twinprimes
than using a sieve for finding primes, and inspecting the sieve.
This means that we need not bother to study sophisticated math papers,
and can concentrate on programming a sieve that is optimal for the
computers we use.

There is some controversy about the question how much can be gained by
using assembly language versus a high level programming language. Some
claim that there will allways remain a gap of a factor five in speed. I
agree with that. The reason is that in assembler one will try to addapt
the data structure to the algorithm and the algorithm to the available
assembler instructions and architectural circumstances. . Of course one
can do that in C too (to a certain extent, one will never be able to
take advantage of the SETB setbit instruction of the Pentium to deliver
the previous value in the Carry bit) but I would say that one is using C
for an assembler, a crippled assembler, properly speaking. An assembler
programmer could base his data structures on the existance of a single
instruction. Nobody is talking portability here, or objects.

Patterns with twin primes.
I have introduced a Forth program that implements a sieve where a lot of
numbers are not even considered. The sieve represents only the numbers
 1, 7,11,13,17,19,23,29,
31,37,41,43,47,49,53,59,
61,67,71,73,77,79,83,89
91,97,101,103,107,109,113,119
etc.
So no multiples of 2, 3 or 5 are present in the table. After 30 (=2*3*5)
the pattern repeats itself.
Look how efficient this is! In this part of the table only 49, 77 and
91 are not prime.
Each row fits comfortably in a byte. This means that in a table of 1
Mbyte we can sieve all numbers up till 30,000,000.

Counting the primes goes very fast too, using a lookup table:
instead of counting the bits that are up, we index into a table.
The table looks as follows:
byte value      value in binary         table contains number of bits up
                                        (decimal)
00H             00000000B               0
01H             00000001B               1
02H             00000010B               1
03H             00000011B               2
04H             00000100B               1
..
FDH             11111101B               7
FEH             11111110B               7
FFH             11111111B               8

On the transputer we would never do that. Because there is an assembler
intruction for counting bits.

Lets have another look at the table, with twinprimes in mind.

 1, 7,11,13,17,19,23,29,
31,37,41,43,47,49,53,59,
61,67,71,73,77,79,83,89,
91,97,101,103,107,109,113,119
Apparently we can have twinprimes that cross a row boundary, such as 29
and 31 (not 89 and 91! Not all numbers in the table are primes).
By regrouping we get
1,  7,
11, 13, 17, 19, 23, 29, 31, 37,
 ****    ****        ****
41, 43, 47, 49, 53, 59, 61, 67,
 ****    ****        ****
71, 73, 77, 79, 83, 89  91, 97,
 ****    ****        ****
101,103,107,109,113,119
 ****    ****        ****

Now twin primes cannot cross a boundary and we can do the same trick a
as before:
( 01H represents the 30-folds+11, i.e. the first column, 02H the second
column, etc.)

byte value      valu in binary          table= number of twin primes up
                                        (decimal)
00H             00000000B               0
01H             00000001B               0
02H             00000010B               0
03H             00000011B               1
04H             00000100B               0
..
CFH             01101111B               3
..
FDH             11111101B               2
FEH             11111110B               2
FFH             11111111B               3

Other observations that may help.
The column under 23 and 37 can never be part of a twin prime.
So we need not store these numbers in the table. Leaving them out would
make the program too complicated. What we can do is, regard these
numbers as non-primes from the beginning and just not bother to mark
them as non-primes, this saves us one quarter of the marking effort.

Suppose we treat the first (ten) million as special, maybe with a
special small program. Then we need not bother about the small primes
like 2,3 and 5 that lead to exceptions (3 and 5 form a twin prime, as do
5 and 7). Why not use the well known fact that there are 8169 twins
under 10^6 (or 58980 under 10^7) [Paulo Ribenboim, The new book of prime
number records]. If we keep the length of the table at multiples of 1
million bytes, *starting at that point*, we need not bother about
crossing boundaries. We will allways be able to take the lenght of the
last sieve such that we end at 10^7 10^8 10^14 etc.

Suppose we are to remove the multiples of 17 in the table representing
10,000,000 to 40,000,000 , so our table is 1 Megabyte. It is best to
start at the end of the table counting down, because 40000000/17 hits an
entry in the table while 10000000/17 is outside the table, requiring
more manipulations. 1)

Pitfalls
In the Forth programs the table lenght (40Kbyte) was of the same order
of magnitude as the numbers to be deleted from the table. So we must be
very aware that a single step can take us out of the table to the other
side of the addressing range. Even in a 32-bit machine we encounter this
problem for 2G/30 say 66 Mbytes. My Pentium has 96 mBytes. Using the
BSET instruction with positive offset, the Pentium can handle table of
not more than 2G/16 or 128 Mbytes, an amount of memory that will be
common in a few years or less.

About the primes
If implemented on a transputer the primes could be passed using a bit
table with 30 primes pro byte or one with 16 primes pro byte. Probably
keeping down the bandwidth is the most important here so we should go
for 30 primes pro byte. The main transputer has 8 M, so it can hold a
6.666 table of 200Mbytes allowing us to sieve up to 4*10^16. As far as I
know that is in the world record range. Even if the transputers are
making clever use of the bandwidth, we must expect to have 3 seconds of
cummunication time for each sieve. With 64 transputers and an avarage of
2 Mbytes this means 3 seconds for each 3.6 G sieved, in the end. That
means 10^7 seconds for arriving at 4*10^16. A year is pi*10^6 seconds.
So that is 10/pi or ca. 3 years. The record mentionned in [1] is
1.37*10^14. That is one 60th. So we could beat it if we run the program
for 1/20 of a year. That is not out of the question.... as far as the
bandwidth of the transputers is concerned.....
The messages for new primes are to be arranged as follows:
Each transputer has two partial table, with one active. As soon as the
first transputer has finished its active table, it starts using the
second one, and sends out a request for refill. The second transputer
recieves this request and sends it through, but not until it has finished
its first table.
If there are more values of internal ram present, e.g. 1Mbytes and
2Mbyte, transputers of the same RAM-size are arranged in logical chains.
This means transparantly sending through messages that belong to the
other chain.
A transputer that has finished its range, sends out the count of
twinprimes found, that is at the same time a request for more work.
The main transputer adds the numbers and deals out the work.
At present we have two size of memory. So we could use the links in two
directions. If we manage to arange the transputers like this
->1->1->1->1->1<->2<-2<-2<-2
where the arrows indicate the direction of the main data flow, we would
actually use very little bandwidth.

Memory
We are aiming for a 1,000,000 byte table on transputers of 1Mbyte. This
seems odd but do not forget that 1M is approximately 1,050,000. The
program, the scratch variables and the lookup table should fit in the
8kbytes fast on board RAM, but the unrolling of loops could make the
program explode. I have no indication for the trade off between running
in fast memory and loop unrolling, but INMOS says it seems that fast
memory for the program is not as important as for tables. My first guess
is that two tables of 20 kbytes are to be used for the prime tables,
leaving us 10 kbyte for the program. This means messages at 10 mS
intervals, which may seem quite short, but fall within the capabilities
of the transputer.

1)
  Experience (the Forth program mentionned before) has shown that this
  leads to some possibly expensive manipulations, including a jump.
  (Okay the branch prediction of the Pentium will do a good job on this
  one, but it also simplifies the program.).