The counting twinprimes project of the hcc 68000 gg and the Forth gg. A prime is a number with no (precisely: only trivial) dividers, such as 17 or 10003. Of course these numbers are divisable by themselves or 1, so mathematicians prefer to say having "no non-trivial" dividers. Twinprimes are primes with a difference of 2 e.g. 17 and 19. Apparently nobody has come up with a better way for counting twinprimes than using a sieve for finding primes, and inspecting the sieve. This means that we need not bother to study sophisticated math papers, and can concentrate on programming a sieve that is optimal for the computers we use. There is some controversy about the question how much can be gained by using assembly language versus a high level programming language. Some claim that there will allways remain a gap of a factor five in speed. I agree with that. The reason is that in assembler one will try to addapt the data structure to the algorithm and the algorithm to the available assembler instructions and architectural circumstances. . Of course one can do that in C too (to a certain extent, one will never be able to take advantage of the SETB setbit instruction of the Pentium to deliver the previous value in the Carry bit) but I would say that one is using C for an assembler, a crippled assembler, properly speaking. An assembler programmer could base his data structures on the existance of a single instruction. Nobody is talking portability here, or objects. Patterns with twin primes. I have introduced a Forth program that implements a sieve where a lot of numbers are not even considered. The sieve represents only the numbers 1, 7,11,13,17,19,23,29, 31,37,41,43,47,49,53,59, 61,67,71,73,77,79,83,89 91,97,101,103,107,109,113,119 etc. So no multiples of 2, 3 or 5 are present in the table. After 30 (=2*3*5) the pattern repeats itself. Look how efficient this is! In this part of the table only 49, 77 and 91 are not prime. Each row fits comfortably in a byte. This means that in a table of 1 Mbyte we can sieve all numbers up till 30,000,000. Counting the primes goes very fast too, using a lookup table: instead of counting the bits that are up, we index into a table. The table looks as follows: byte value value in binary table contains number of bits up (decimal) 00H 00000000B 0 01H 00000001B 1 02H 00000010B 1 03H 00000011B 2 04H 00000100B 1 .. FDH 11111101B 7 FEH 11111110B 7 FFH 11111111B 8 On the transputer we would never do that. Because there is an assembler intruction for counting bits. Lets have another look at the table, with twinprimes in mind. 1, 7,11,13,17,19,23,29, 31,37,41,43,47,49,53,59, 61,67,71,73,77,79,83,89, 91,97,101,103,107,109,113,119 Apparently we can have twinprimes that cross a row boundary, such as 29 and 31 (not 89 and 91! Not all numbers in the table are primes). By regrouping we get 1, 7, 11, 13, 17, 19, 23, 29, 31, 37, **** **** **** 41, 43, 47, 49, 53, 59, 61, 67, **** **** **** 71, 73, 77, 79, 83, 89 91, 97, **** **** **** 101,103,107,109,113,119 **** **** **** Now twin primes cannot cross a boundary and we can do the same trick a as before: ( 01H represents the 30-folds+11, i.e. the first column, 02H the second column, etc.) byte value valu in binary table= number of twin primes up (decimal) 00H 00000000B 0 01H 00000001B 0 02H 00000010B 0 03H 00000011B 1 04H 00000100B 0 .. CFH 01101111B 3 .. FDH 11111101B 2 FEH 11111110B 2 FFH 11111111B 3 Other observations that may help. The column under 23 and 37 can never be part of a twin prime. So we need not store these numbers in the table. Leaving them out would make the program too complicated. What we can do is, regard these numbers as non-primes from the beginning and just not bother to mark them as non-primes, this saves us one quarter of the marking effort. Suppose we treat the first (ten) million as special, maybe with a special small program. Then we need not bother about the small primes like 2,3 and 5 that lead to exceptions (3 and 5 form a twin prime, as do 5 and 7). Why not use the well known fact that there are 8169 twins under 10^6 (or 58980 under 10^7) [Paulo Ribenboim, The new book of prime number records]. If we keep the length of the table at multiples of 1 million bytes, *starting at that point*, we need not bother about crossing boundaries. We will allways be able to take the lenght of the last sieve such that we end at 10^7 10^8 10^14 etc. Suppose we are to remove the multiples of 17 in the table representing 10,000,000 to 40,000,000 , so our table is 1 Megabyte. It is best to start at the end of the table counting down, because 40000000/17 hits an entry in the table while 10000000/17 is outside the table, requiring more manipulations. 1) Pitfalls In the Forth programs the table lenght (40Kbyte) was of the same order of magnitude as the numbers to be deleted from the table. So we must be very aware that a single step can take us out of the table to the other side of the addressing range. Even in a 32-bit machine we encounter this problem for 2G/30 say 66 Mbytes. My Pentium has 96 mBytes. Using the BSET instruction with positive offset, the Pentium can handle table of not more than 2G/16 or 128 Mbytes, an amount of memory that will be common in a few years or less. About the primes If implemented on a transputer the primes could be passed using a bit table with 30 primes pro byte or one with 16 primes pro byte. Probably keeping down the bandwidth is the most important here so we should go for 30 primes pro byte. The main transputer has 8 M, so it can hold a 6.666 table of 200Mbytes allowing us to sieve up to 4*10^16. As far as I know that is in the world record range. Even if the transputers are making clever use of the bandwidth, we must expect to have 3 seconds of cummunication time for each sieve. With 64 transputers and an avarage of 2 Mbytes this means 3 seconds for each 3.6 G sieved, in the end. That means 10^7 seconds for arriving at 4*10^16. A year is pi*10^6 seconds. So that is 10/pi or ca. 3 years. The record mentionned in [1] is 1.37*10^14. That is one 60th. So we could beat it if we run the program for 1/20 of a year. That is not out of the question.... as far as the bandwidth of the transputers is concerned..... The messages for new primes are to be arranged as follows: Each transputer has two partial table, with one active. As soon as the first transputer has finished its active table, it starts using the second one, and sends out a request for refill. The second transputer recieves this request and sends it through, but not until it has finished its first table. If there are more values of internal ram present, e.g. 1Mbytes and 2Mbyte, transputers of the same RAM-size are arranged in logical chains. This means transparantly sending through messages that belong to the other chain. A transputer that has finished its range, sends out the count of twinprimes found, that is at the same time a request for more work. The main transputer adds the numbers and deals out the work. At present we have two size of memory. So we could use the links in two directions. If we manage to arange the transputers like this ->1->1->1->1->1<->2<-2<-2<-2 where the arrows indicate the direction of the main data flow, we would actually use very little bandwidth. Memory We are aiming for a 1,000,000 byte table on transputers of 1Mbyte. This seems odd but do not forget that 1M is approximately 1,050,000. The program, the scratch variables and the lookup table should fit in the 8kbytes fast on board RAM, but the unrolling of loops could make the program explode. I have no indication for the trade off between running in fast memory and loop unrolling, but INMOS says it seems that fast memory for the program is not as important as for tables. My first guess is that two tables of 20 kbytes are to be used for the prime tables, leaving us 10 kbyte for the program. This means messages at 10 mS intervals, which may seem quite short, but fall within the capabilities of the transputer. 1) Experience (the Forth program mentionned before) has shown that this leads to some possibly expensive manipulations, including a jump. (Okay the branch prediction of the Pentium will do a good job on this one, but it also simplifies the program.).