16-bit software delay routine

16-bit software delay routine
by Bisqwit on 2012-03-28 (#91902)

This routine delays a run-time specified number of cycles, plus a fixed constant number of cycles (33). The constant number includes the number of cycles the JSR+RTS takes.
Pass the number of cycles to delay in A:X with X having the low 8 bits and A having the high 8 bits of the number of cycles to delay.
Requires no absolute jumps / relocations. Preserves X,Y. Does require page-aligning so none of the JSRs cross page boundary. Written for CA65.

Code:

; Delays A:X clocks+overhead
; Time: 256*A+X+33 clocks (including JSR)
; Clobbers A. Preserves X,Y.
delay_256a_x_33_clocks:
        cmp #1                  ; +2; 2 cycles overhead
        bcs @do256              ; +2; 4 cycles overhead
        ; 0-255 cycles remain, overhead = 4
        txa                     ; +2; 6
        ;;;;;;;;;;;;;;;;
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. Total: 27. 6+27=33
        ;          ;    Cycles        Accumulator     Carry flag
        ;          ; 0  1  2  3  4       (hex)        0 1 2 3 4 
        sec        ; 0  0  0  0  0   00 01 02 03 04   1 1 1 1 1 
:       sbc #5     ; 2  2  2  2  2   FB FC FD FE FF   0 0 0 0 0
        bcs :-     ; 4  4  4  4  4   FB FC FD FE FF   0 0 0 0 0
        lsr a      ; 6  6  6  6  6   7D 7E 7E 7F 7F   1 0 1 0 1
        bcc :+     ; 8  8  8  8  8   7D 7E 7E 7F 7F   1 0 1 0 1
:       sbc #$7E   ;10 11 10 11 10   FF FF 00 00 01   0 0 1 1 1
        bcc :+     ;12 13 12 13 12   FF FF 00 00 01   0 0 1 1 1
        beq :+     ;      14 15 14         00 00 01       1 1 1
        bne :+     ;            16               01           1
:       rts        ;15 16 17 18 19   This loop from http://6502org.wikidot.com/software-delay
@do256: ; do 256 cycles.        ; 5 cycles done so far. C is set from CMP
        sbc #1                  ; 2 cycles
        pha                     ; 3 cycles
         lda #(34*2-1)          ; 2 cycles
         ;                      ;12 cycles done so far
:        sec                    ; 2 cycles  (sec is only needed
         sbc #2                 ; 2 cycles   to make loop 7 cycles)
         bcs :-                 ; 3 cycles for taken branch
         ;                      ; -1 cycles for untaken branch
         ;12 + 34*7 - 1 = 249 done so far, 7 missing
        pla                        ; 4 cycles
        bcc delay_256a_x_33_clocks ; 3 cycles ; C is unset from SBC

I could not find such a routine online nor in Blargg's library, so I wrote my own. The sub-256 cycle part is copied from 6502org wiki. Blargg's library has one, but entering it would require a jump, so this ends up having a smaller total overhead.

Here is a version that has the semantics for A and X reversed. X contains the high-order 8 bits, A contains the lower 8 bits. X is zeroed, Y is preserved. It reuses the sub-256 cycle delay routine from Blargg's library (which can be entered separately). The overhead is 30 cycles.

Code:

; Delays X:A clocks+overhead
; Time: 256*X+A+30 clocks (including JSR)
; Clobbers A,X. Preserves Y.
delay_256x_a_30_clocks:
        cpx #0                  ; +2
        beq delay_a_25_clocks   ; +3  (25+5 = 30 cycles overhead)
@do256: ; do 256 cycles. 4 cycles so far. Loop is 1+2+4+4+2+2 = 15 bytes.
        pha             ; +3
         lda #(256-42)  ; +2
         ;              ; 9 cycles done so far. Carry is set from CPX
:        adc #1         ; +2
         bne :-         ; +3 for taken branch
                        ; -1 for untaken branch
:        adc #(256/6)   ; 2 cycles
         bcc :-         ; +3 for taken branch
                        ; -1 for untaken branch
         ; 9 + 42*5-1 + 6*5-1 = 247 done so far; 9 missing
        pla             ; +4
        dex             ; +2
        bcs delay_256x_a_30_clocks ; +3. Carry is set from ADC
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR)
:       sbc #7          ; carry set by CMP
delay_a_25_clocks:
        cmp #7
        bcs :-          ; do multiples of 7
        lsr a           ; bit 0
        bcs :+
:                       ; A=clocks/2, either 0,1,2,3
        beq @zero       ; 0: 5
        lsr a
        beq :+          ; 1: 7
        bcc :+          ; 2: 9
@zero:  bne :+          ; 3: 11
:       rts             ; (thanks to dclxvi for the algorithm)

If relocations are not a problem, then the routines can be replaced with these, respectively:

Code:

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Clobbers A. Preserves X,Y. Has relocations.
; Time: 256*A+X+31 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
:       ; do 256 cycles.        ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
        sbc #1                  ; 2 cycles - Carry was set from cmp
        pha                     ; 3 cycles
         lda #(256-25-10-2-4)   ; +2
         jsr delay_a_25_clocks
        pla                     ; 4 cycles
delay_256a_x_31_clocks:
        cmp #1                  ; +2; 2 cycles overhead
        bcs :-                  ; +2; 4 cycles overhead
        ; 0-255 cycles remain, overhead = 4
        txa                     ; +2; 6; +27 = 33
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
        ;          ;    Cycles        Accumulator     Carry flag
        ;          ; 0  1  2  3  4       (hex)        0 1 2 3 4 
        sec        ; 0  0  0  0  0   00 01 02 03 04   1 1 1 1 1 
:       sbc #5     ; 2  2  2  2  2   FB FC FD FE FF   0 0 0 0 0
        bcs :-     ; 4  4  4  4  4   FB FC FD FE FF   0 0 0 0 0
        lsr a      ; 6  6  6  6  6   7D 7E 7E 7F 7F   1 0 1 0 1
        bcc :+     ; 8  8  8  8  8   7D 7E 7E 7F 7F   1 0 1 0 1
:       sbc #$7E   ;10 11 10 11 10   FF FF 00 00 01   0 0 1 1 1
        bcc :+     ;12 13 12 13 12   FF FF 00 00 01   0 0 1 1 1
        beq :+     ;      14 15 14         00 00 01       1 1 1
        bne :+     ;            16               01           1
:       rts        ;15 16 17 18 19   (thanks to dclxvi for the algorithm)
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X:A clocks+overhead
; Clobbers A,X. Preserves Y. Has relocations.
; Time: 256*X+A+30 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
        cpx #0                  ; +2
        beq delay_a_25_clocks   ; +3  (25+5 = 30 cycles overhead)
        ; do 256 cycles.        ;  4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
        dex                     ;  2 cycles
        pha                     ;  3 cycles
         lda #(256-25-9-2-7)    ; +2
         jsr delay_a_25_clocks
        pla                        ; 4
        jmp delay_256x_a_30_clocks ; 3.
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
:       sbc #7          ; carry set by CMP
delay_a_25_clocks:
        cmp #7
        bcs :-          ; do multiples of 7
        ;               ; Cycles          Accumulator            Carry           Zero
        lsr a           ; 0 0 0 0 0 0 0   00 01 02 03 04 05 06   0 0 0 0 0 0 0   ? ? ? ? ? ? ? 
        bcs :+          ; 2 2 2 2 2 2 2   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0 
:       beq @zero       ; 4 5 4 5 4 5 4   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0
        lsr a           ; : : 6 7 6 7 6   :: :: 01 01 02 02 03   : : 0 1 0 1 0   : : 0 0 0 0 0 
        beq :+          ; : : 8 9 8 9 8   :: :: 00 00 01 01 01   : : 1 1 0 0 1   : : 1 1 0 0 0 
        bcc :+          ; : : : : A B A   :: :: :: :: 01 01 01   : : : : 0 0 1   : : : : 0 0 0
@zero:  bne :+          ; 7 8 : : : : C   00 01 :: :: :: :: 01   0 1 : : : : 1   1 1 : : : : 0
:       rts             ; 9 A B C D E F   (thanks to dclxvi for the algorithm)

Re: 16-bit software delay routine
by Bisqwit on 2016-03-26 (#166874)

I have added this delay code in the Wiki: http://wiki.nesdev.com/w/index.php/Delay_code

I also created a page that contains the shortest possible delay-code sequences for a constant number of cycles to delay given different constraints. http://wiki.nesdev.com/w/index.php/Fixed_cycle_delay

Re: 16-bit software delay routine
by thefox on 2016-03-26 (#166880)

Nice.

Re: 16-bit software delay routine
by thefox on 2016-04-02 (#167483)

What is the license for the code snippets? We don't (apparently) have a forced license for wiki additions, so it might be a good idea to state it explicitly on the corresponding wiki page.

Re: 16-bit software delay routine
by Bisqwit on 2016-04-02 (#167486)

I suppose the license is MIT.

I am working on an updated version of the archive for all variants of 0-20000 cycles delay. It should be finished in a few days.
It is considered to be under the same license.

Re: 16-bit software delay routine
by Bisqwit on 2016-04-13 (#168329)

There will be this page.
http://bisqwit.iki.fi/utils/nesdelay.php