During the black pass we work per disk chunk. Each such chunk has a single lost(N) and wonList associated with it, and their treatment should lead to the production of a single lost(N+1) list. The new lost(N+1) positions will be initiall made as a bitmap. If there were already lost(N+1) position from a previous leapfrog cycle (which treated other white moves), we have to unpack that list first. Otherwise we have to initialize the scratch bitmap filled with zeros. We will need 2MB of scratch storage to hold the lost bitmap. This storage will be re-used for every chunk. Note that symmetry plays no role in the black pass, as all positions in one chunk belong to the same symmetry.
After filling the lost bitmap with zeros, or unpacking the partially complete lost(N+1) into it, we run through the wonList for the chunk. Each position from the wonList will be set up (differentially) on a 0x88 board, from which we generate black retrograde moves. The moves will be performed on the 0x88 board as well as on the index, and from the target (potentially won) positions we will generate forward black moves to probe the won bitmap of the chunk (verification). There is no need to perform the probing moves on the 0x88 board; they are only applied to the index. If all verification moves probe as 'won', we can set the bit of the primary target in the lost bitmap. As soon as one verification move probes as not-won, we can abort the verification process, and leave the lost bitmap untouched. In both cases we undo the retrograde move on the 0x88 board before proceeding reading the wonList.
After the lost bitmap is completed, we run-length code and pack it, possibly overwriting a previous (partial) lost(N+1) list, or allocting new storage directly after the lost(N) list for this chunk. The 2MB storage area then becomes available to create the lost bitmap for the next chunk.
The process as described would put a (too?) heavy demand on L2 cache space: 2MB won bitmap, 2MB lost bitmap, wonList and lost(N+1) list all compete for cache space, making it unlikely (with a 4MB cache) that the entire created lost bitmap will still be cached by the time we pack it. When the black King is the highest-order disk-chunk piece, however, there is a natural streaming: a lost bitmap for all positions with the black King on a1 will be complete as soon as we have read and treated the wonList upto and including the positions with the black King on b2. The packing and releasing of buffer space for the lost bitmap can thus be moved forward, not doing it for the complete chunk after it is complete, but doing it (black) King location by King location, with a delay of 10 squares for the black King position.
Similarly, the won bitmap for a1 will no longer be probed after we treated the wonList for c3, as for all later locations the black King will not be able to reach a1 in 2 moves (one retrograde, and one verification move). To exploit this does not require any explicit programming, it just means that most of the won bitmap will be flushed out of the cache early to make room for other data, without any adverse effects on porfermance. The true working set will at most involve 37 3-men chunks for the won bitmap (e.g. a2-e6 when we are doing c4), and 19 3-men chunks for the lost bitmap (b3-d5 when we are doing c4), i.e. only 56 x 32K < 2MB for the bitmaps. This leaves plenty of room for the wonList and lost(N+1). The former is accessed one-time sequentially anyway, and does not really have to stay in cache for a second access. (But you canot tell that to the cache logic, so it needs space to linger on until it is no longer least-recent used, and becomes eligible for replacement.) The storage holding partially complete lost(N+1) list that was read is probably still cached 10 King locations later, when the updated (completed) lost(N+1) will be written to it.
In practice we can reduce the working set for the won bitmap, by probing the black King moves last; this makes it much less likely that those probes actually have to be done, as other probes, into a much smaller working set of only a single 3-men chunk, are likely to have provided a black escape already. This means that only very few cache lines of the more distant parts of the won bitmap working set will actually have to remain loaded (or will be loaded in advance due to probing with King moves, before they become part of the smaller 3-men working set). The lost bitmap does not have to remain stored at all after it is packed, so the buffer space for it can actually be reduced to 1MB, clearing the high-order bit of the black-King rank in the index together with the bits for the white pieces before accessing it. This makes it more likely that the lost bitmap scratch memory will ever need to be flushed from the cache, something we can then ensure by doing some dummy accesses to parts of it not currently reachable by black King moves.
When building a 7-men, we will (for reasons of gobal memory-requirement optimization) postpone treating the white King moves from one of the King locations feedng ito the chunk, as we want to do the space-consuming white pass before this (also space consuming) data is read in. This means that we are doing the black pass from an incomplete wonList, lacking the input from a single white King move. Thus, in addition of treating the positions indicating in the wonList as starting positions for black retro-moves, we will also have to treat the positions in the lost(N) list for this as-yet untreated white King locaton (the 'deferred' lost(N)). We will thus have two lists open for reading at the same time, the wonList for the current chunk, and the lost(N) list for the chunk with the white King in a single different location. As we work black King location by King location, we have to alternate reading from those lists in 3-men chunks. This does not cause any real complications.
When we are building with less than 4 black men, we will still use the black King as the high-order man of the disk chunk. The extra white man will be the one below that. This means the same streaming based on black-King locality can be used in that case, although using a white piece as highest-order piece would lead to even more perfect streaming, reducing both won and lost bitmap working set to a single 3-men chunk. But this would give no additional advantage, as the black King streaming already makes the working set fit in L2.
There is the complication with the white moves for this piece, though. Although they logically belong to the white pass, We will defer them, like we deferred the treatment of the single King move, to the back pass. This because such a white piece will be automatically dynamic in the 3-men black-K-sub-slices in which this black pass proceeds. So together with the wonList and the lost(N) of the other white-King slice (for 7-men), one of the two partial cycles of a leap-frog cycle will also do a grandparent cycle for the white moves within the 3-men slice. We always choose the partial cycle for this that completes the lost(N+1) info (rather than create a partial list from scratch), in order to keep the partial lost(N) lists (which have to be written to disk, and read back in the other slicing order) as small as possible.
In leapfrog, it might be possible to resort to streaming treatment of two cycles worth of black passes (on different white-King slices) simultaneouly: as soon as we get a complete lost(N+1) packed list for the first cycle, we could use it as the deferred lost(N+1) list to generate (partial) lost(N+2) for the next cycle. This assuming that the white pass for that chunk already has been done. This would save us the trouble of rading the just packed lost(N+1) back into cache, but would again start to heavily tax the cache. As the lost(N+1) lists are quite small during most of the generation process, we did not bother to try this.