The white pass will need to gather in cache a slice large enough to make at least one white piece dynamic. The black pieces do not move during this pass, so in principle the required slices only measure 64 positions, with all black pieces fixed. This cannot be achieved in practice, as the storage order is such that positions differing only in the location of the low-order two black pieces share a cache line. So we are forced to incorporate those pieces as dynamic in the slice as well.
In fact there is a problem in gathering sub-slices that keep a black piece fixed: as the different locations for the dynamic white piece in general come from chunks with a different symmetry, fixed, does not really mean fixed. The black pieces will be mirrored around, and if we would work by slices that keep the black piece in one location after the symemtry operation, the order of such slices in the chunks they were gathered from would be different from chunk to chunk. And the lost(N) lists cannot be accessed randomly; they will have to be read sequentially in canonical order.
So the white pass will gather 64 3-men chunks of 32KB each, to create a 4-men slice, and treat the white moves within this slice. To read entire 3-men chunks means at least a large part of the lost(N) lists can be read sequentially. The lost(N) info has to be read from 64 different lost(N) packed lists for this, as each white piece location has its own lost(N) list. Only a small fraction of those lists is needed, with the high-order pieces (always the black King) of the disk chunks fixed.
We thus still have to fix one black piece, defining a sub-slice, as a bitmap for a 4-men slice already measures 2MB, and is the maximum the CPU cache can handle. Fixing the black King forces us to read the lost(N) list out-of-order for black-K-subslices of the chunk. To make this possible, each lost(N) list is structured as 64 separate lists, one for each black King location. The lost(N) positions read from the list will be (differentially) set up on a 0x88 board, from which the moves of the (single) white dynamic piece will be generated.
Each move will be mapped into a buffer in the spacer area of the 128MB buffer designated to hold the newlyWon info. This is done by clearing the index bits corresponding to the higher-order disk-chunk piece, so that only the bits indicating the three lower-order men and te white dynamic piece are left. This reduced index will be used to access the newly-won bitmap in the spacer area, which was cleared every time we started on a new location of the high-order black piece. The target positions of the moves will be tested against the won bitmap (using the full index) first to establish if they are newly won, or not.
After all positions of a black high-order constellation are treated, we pack the newly-won bitmaps (in the 64 scratch areas), and add them each to one of the 64 wonLists, which are growing in the main buffer area.
When we are building a 7-men EGT, we have to take account of the fact that there are actually two dynamic white pieces, as the white King is made dynamic as well (in a streaming way). This means we have to feed the newlyWon bitmap not only from 64 different lost(N) lists for the fully dynamic piece, but also from lost(N) lists with different King positions. Due to the way the streaming is implemented, this is a one-way process: we generate the newlyWon for a single King location, feeding it with lost(N) positons from several King locations. So we only need to update 64 bitmaps at the time, but read from 64x5 lost(N) lists, (or 64x4, if we defer one white King move), after which we will pack only 64 wonLists.
This white pass is the most taxing for the CPU cache. Both the won and the newlyWon bitmaps occupy 2MB, which would completely occupy a 4MB L2 cache. This is a possible source of inefficiency. Using a Penryn-family CPU, which has 6MB L2-cache, would definitely solve it. On Conroe chips, with 4MB L2-cache, a soluton has yet to be found. The other data, (next to the two bitmaps) will consist of lists, which are sparse most of the time, and only need one-time access. So one can hope that that not too much of the won and newlyWon bitmaps is flushed (and will have to be reloaded from memory later) by accessing these lists. A bit unsatisfactory, perhaps. But at this stage, it is not clear if the in-memory process takes up a significant part of the time anyway, nor if the slowdown of the white pass due to cache thrashing slows it down significatly.