16 byte per line hblank copy routine

16 byte per line hblank copy routine
by nitro2k01 on 2018-04-05 (#216405)

This was intended to be a reply to tepples' thread about OAM allocation but I figured it would make a good thread on its own.

For inspiration, I've written a stack copy routine which can copy 16 consecutive bytes to VRAM in one HBlank+trailing mode 2, if the line is free of sprites. When there are only a "few" sprites on the line, it's still able to safely copy 14 bytes. If there are "many" sprites, the timings are even stricter. In my particular case, I made it copy only 14 bytes, and implemented logic to skip lines with "many" sprites in, which in my case was easier than varying the number of bytes being copies. In my case I used it in my Flappy Bird clone to produce a parallax scrolling background for the scenery behind the pipes.

The setup is as follows:

Code:

   ld   A,$08         ; HBlank as LCD interrupt source
   ldh   [STAT],A

   ld   A,2         ; LCD interrupt
   ldh   [IE],A

Nothing too weird there. The code is using the HALT opcode to synchronize the copy, so IME is assumed to be 0 through-out. (Ie: interrupt execution is disabled using DI.)

Here's a slightly redacted version of the routine with some game specific logic removed:

Code:

; Copy 16 bytes in one HBlank (mode 0+mode 2)
STACKCOPY_LCD::
   ld   [RAMCODE-RAMCODE_S+ldspopcode16+1],SP   ; Save SP at the load SP opcode at the end.
   ld   SP,HL               ; Load source address from HL into SP.
   ld   H,D               ; \ Load destination address into HL.
   ld   L,E               ; /
.fastcopyloop
   ; 0
   pop   DE               ; Prefetch.
   xor   A               ; \ Clear pending registers.
   ldh   [IFLAG],A            ; /

   ld   A,E               ; Prefetch.

   halt                  ; Wait for HBlank to happen.
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A

   
rept   6
   pop   DE               ; Main unrolled loop body. 
   ld   A,E
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A
endr

   ; 7
   pop   DE
   ld   A,E
   ld   [HL+],A
   ld   [HL],D               ; Save some time on the last byte for good measure.
   inc   HL
   
   ldh   A,[skipline]

   ld   E,A

   ldh   A,[LY]
   cp   E
   jr   z,.skiplines
   
.afterskiplines
   ldh   A,[linesctr]
   dec   A
   ldh   [linesctr],A
   jr   nz,.fastcopyloop

   ld   E,L
   ld   D,H
   ld   HL,[SP+0]            ; Restore source pointer for later use.

   jp   RAMCODE-RAMCODE_S+ldspopcode16

Explanation:

First, SP is saved so it can be restored later. This code may need some explanation. I have copied code to RAM and I'm using a bit of pointer arithmetic to point to argument part of an LD SP, $xxxx opcode. This is done so that when done, the code can jump to the restoration routine which would execute ld SP, $xxxx; ret.

Code:

; The RAM code source. Somewhere in ROM...
RAMCODE_S::
   ; Maybe some other code here...
ldspopcode16::
   ld   SP,0000               ; This is overwritten at the start of the code.
   ret
RAMCODE_S_End::

; The RAM code destination. Somewhere in RAM...
SECTION "RAMCODE",BSS
RAMCODE::
   ds   RAMCODE_S_End-RAMCODE_S         ; Buffer for the RAM code.

Next SP and HL are prepared from the input parameters.

The main routine consists of an unrolled loop of 8 copies of the following code, which copies two bytes:

Code:

   pop   DE               ; Main unrolled loop body. 
   ld   A,E
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A

However, the first and last iterations are slightly different so only 6 of the iterations look exactly like that.

The first iteration prepares as much data as possible before the accessible period starts to prevent wasting precious cycles. It clears IF and runs HALT in order to synchronize to HBlank. When the CPU wakes up, it writes the first byte.

The last iteration also has a small difference. It writes D to [HL] instead of going through A, which would consume one extra instruction cycle. (Ie 4 machine cycles.) It means HL will have to be incremented afterwards, but this is ok since the incrementation is not timing sensitive, unlike the write.

After that, it checks whether we need to skip any lines because they have too many sprites. This logic is omitted from this example. Then it counts down linesctr and returns when all requested ata has been copied. Lastly it restores the HL and DE to the source and target address as they would be after the copy is done.

The example code copies 16 bytes per HBlank which requires that no sprites are shown on any line where the routine is executed. You could change rept 6 to a lower value if needed because sprites were used. In my Flappy Bird clone I use rept 5 which copies 14 bytes, as mentioned.

As per tepples' requirements, the routine could be adapted for use with 1 bpp tiles or OAM at a lower data rate.

Here's the clock calculation for the routine:

Code:

   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles
   
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 20 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 36 cycles (*6)

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   [HL],D   ;  8
   ; = 32 cycles

This gives a total of 268 cycles, 16 cycles less than the 284 cycles a HBlank+mode 2 would last without sprites. 4-8 of those cycles are used by the nop that's needed after the halt, I'm pretty sure. So this code can copy one tile per line.

For the case of OAM, we should go by the most pessimistic value of HBlank, 201 cycles. 201-32-36-8=141 cycles left for the inner loop part. 141/36=3 (remainder 32) so this routine could run 5 cycles, and thus copy 10 bytes, or 2.5 whole entries into OAM.

For the case of 1 BPP graphics, the routine would look a bit different. Here we make a few assumptions:

The palette is set such that you only need to update one of the bytes per pixel row.
Additionally that this byte is the odd address. What this does is that we can safely use inc L to increment the destination address because the inc L instruction will only ever be used to increment an even value, which cannot possibly cross carry over to the high byte. Such addresses are instead handled by the ld [HL+],A instruction, which does a full 16 bit increment internally.

Code:

   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles

   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 28 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 44 cycles

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   [HL],D   ;  8
   ; = 36 cycles
   inc  L        ;  4 (outside the cycle count)

Doing the cycle calculation for 284 available cycles we get: 284-28-36-8=212 left for inner part. 212/44=4 (remainder 36 cycles). So this code could run 6 iterations, and copy 12 bytes, which corresponds to 1.5 tiles since tiles are 8 bytes big in 1 bpp format.

All these figures could be nudged ever so slightly upward, maybe 1 extra byte per loop cycle, with more controlled timings. But at that point you get diminishing returns.

So in summary:
VRAM (full copy): 1 tile/line
VRAM (1bpp expand): 1.5 tiles/line
OAM: 2.5 entries/line