niiiiice...
Truth be told, after cranking this out in a week and change, my brain doesn't want to jump right back into 65xx ASM just yet. It would much rather finish up Star Ocean 3 and possibly start roughing out a new C++ game.
The block preview mode in your TG codebase is a nice idea- I might rig something similar in a later version of mine. On the whole, though, I'm certain my code is slower. Case in point: my 16-bit multiply
Code:
mul16:
stx mul16_xcache
lda mul16Flag1
ora mul16Flag2
and #%00000010
beq mul16_no_overflow
sta mul16Flag2 ; what? You want I should preserve the negative flag on a junk call?
mul16_no_overflow:
; basic shift-and-add method
; keep halving mul*1 and popping bits off mul*2
; if the bit off mul*2 is a 1, add the remaining mul*1 to result
lda #0
sta lsh16Flag
sta rsh16Flag
sta add16Flag1
sta add16Flag2
sta add16Hi2
sta add16Lo2
lda mul16Hi1
sta rsh16Hi
lda mul16Lo1
sta rsh16Lo
lda mul16Hi2
sta lsh16Hi
lda mul16Lo2
sta lsh16Lo
clc
jsr rsh16 ; since the highest power place in mul*2 is 1/2
ldx #0
mul16_loop:
jsr lsh16 ; which pops the shifted-out bit into carry
bcc mul16_loop_no_add ; so we can act on it right away
lda rsh16Hi
sta add16Hi1
lda rsh16Lo
sta add16Lo1
jsr add16
mul16_loop_no_add:
jsr rsh16
inx
cpx #15 ; after 15 rshs, we're guaranteed to have 0 in the rsh input
bne mul16_loop
; visual break to bookend the loop
lda add16Hi2
sta mul16Hi2
lda add16Lo2
sta mul16Lo2
lda mul16Flag1
eor mul16Flag2
sta mul16Flag2 ; safe, since we know we can't have overflowed, so only the sign flags might be unequal, producing a negative
ldx mul16_xcache
rts
; for reference, the above math is done on non-2's-complement 16-bit values, highest place being 1/2, lowest being 1/64k, with a flag byte consisting of 6 unused bits followed by an overflow flag and a negative flag
Kinda important for mandelbrot, and yet every call probably spends more cycles shifting stuff between the inputs of my various other routines than accomplishing actual computation. That, and I use straight-up shift-and-add. I'd have to spend an hour parsing your innermost multiply code before it would make total sense, but it looks like you take some shortcuts at the higher levels. I almost used 32-bit precision, but by the time I wrapped my head back around the math to see how easy it was, I was too lazy to go and change all my zeropage allocation for more subroutine input bytes.
I also have a nice little restraining order in there called itersPerNMI which I've set quite low indeed for the sake of the music. Come to think of it, I should reset my counter in the NMI routine rather than the mandelbrot loop since that's not the only place I ever waitNMI... *changes code* ... great. Now it chugs even more
I could just dec the address rather than dey and reload y to catch what are probably NMIs that occur just before I'd wait for an NMI but that would cost 5 cycles in my inner loop as opposed to 2... bleh. Clearly more work is needed.
When I allow as much frameskip as is needed to crunch out an entire tile before actively waiting, iirc it runs a good deal faster. But the music hiccoughs something fierce.
Is that an iso I see with FractalEngine? Meaning I could run it on my actual TurboDuo? Crazy talk!
I'll have to get by on Nestopia and good sense for mine unless I can scrounge a dev cart and/or EEPROM burner.
edit: new version uploading as I type. My iterations-per-frame counting was way off, so my wait calls were eating a lot of time. I decided to nix the whole iteration-counting deal and instead just let frameskips happen and update the music as needed. The result cuts runtime to 75% what it used to be.