Fast 2D blitting on Super FX

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Fast 2D blitting on Super FX
by on (#184043)
I've been messing with some conceptual codes for high-speed drawing of untransformed 2D graphics. They require specialized graphics formats which are not interoperable; I'm thinking I'd test for speed beforehand and use whichever one was fastest for a given object - as long as all the code can fit in the instruction cache at the same time...

Of course, these codes are probably not final. I'm still not very good at Super FX.

I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching. But I haven't written anything like that yet.

I've got to say, I find the 8-bit busing to be far more aggravating in the case of the Super FX than in the case of the S-CPU. What they were trying to do with the Super FX really needed more bits per word than they had...

Code:
; SINGLE-PIXEL BLITTING (slowest and most general):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line
   inc R14      ; increment ROM address, triggering a buffer load

Start:
   getc         ; get pixel data (one byte per pixel) from ROM buffer
   inc R14      ; and increment ROM address
   loop         ; decrement pixel count; if not zero, go to address in R13, ie: "Start"
   plot         ; plot pixel and increment X-counter in R1 (since the GSU is pipelined, this byte gets executed regardless)

   getb         ; get carriage return X-component (goes in R0)
   inc R14      ; increment ROM address
   with R1      ; update X-coordinate
   sub R0       ; by subtracting carriage return X-component
   inc R2       ; increment Y-coordinate
   to R12       ; update LOOP index register
   getb         ; with pixel count for next line
   inc R14      ; increment ROM address
   loop         ; decrement pixel count and branch to Start if not zero
   nop          ; dummy fill pipeline (nothing else to do before GETC, and the ROM buffer isn't ready anyway)

; The main loop has only two bytes between INC R14 and GETC, so in high-speed mode it's probably 6 cycles rather than 4.
; Blitting a sliver in 4bpp is probably at least 40 cycles, but that's still only 5 cycles per pixel, so this method is
; bottlenecked by code unless you're drawing in 8bpp.

Code:
; DUAL-PIXEL BLITTING (faster for long solid runs, slower for short runs, doesn't support gaps):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line, plus two if odd
   inc R14      ; increment ROM address, triggering a buffer load
   with R12     ; operate on pixel count
SStart:
   lsr          ; turn pixel count into pixel pair count
   bcc DStart   ; if the pixel count was even, go to dual-pixel blitting
   nop          ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
   getc         ; fetch the first pixel from the ROM buffer
   inc R14      ; increment the ROM address
   loop         ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
   plot         ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

   bra EndL     ; go to end of line (at this point it's been determined that the line was only one pixel long)
   getb         ; get carriage return X-component in R0 (happens after branch)
DStart:
   getc         ; get pixel pair
   inc R14      ; increment ROM address
   plot         ; plot pixel to buffer and increment X-coordinate
   loop         ; decrement pixel pair count and go to DStart if nonzero
   plot         ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

   getb         ; get carriage return X-component in R0
EndL:
   inc R14      ; increment ROM address
   with R1      ; update X-coordinate
   sub R0       ; with carriage return value
   inc R2       ; increment Y-coordinate
   to R12       ; refresh pixel counter
   getb         ; with next line's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   dec R12      ; decrement pixel count (hence the +1 for lines other than the first)
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count

; This one uses the dither functionality to plot two pixels per byte fetched from ROM.  Naturally this means all the
; graphics have to be duplicated in ROM so there's a version for each value of the dither bit (XOR of the X and Y
; bottom bits).  Also, since dither can't plot transparent with non-transparent (it always checks the bottom of the
; colour register for colour #0, because it's checking the dither bit at the same time and doesn't yet know which half
; to use), this method does not support gaps in a line.

Code:
; DUAL-PIXEL WITH GAPS (a bit slower than basic dual-pixel blitting, but more flexible):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line, plus two if odd
   inc R14      ; increment ROM address, triggering a buffer load
   with R12     ; operate on pixel count
SStart:
   lsr          ; turn pixel count into pixel pair count
   bcc DStart   ; if the pixel count was even, go to dual-pixel blitting
   nop          ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
   getc         ; fetch the first pixel from the ROM buffer
   inc R14      ; increment the ROM address
   loop         ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
   plot         ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

   bra EndL     ; go to end of line (at this point it's been determined that the line was only one pixel long)
   getb         ; get X increment in R0, shifted left and added to the Y increment bit
DStart:
   getc         ; get pixel pair
   inc R14      ; increment ROM address
   plot         ; plot pixel to buffer and increment X-coordinate
   loop         ; decrement pixel pair count and go to DStart if nonzero
   plot         ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

   getb         ; get X increment in R0, shifted left and added to the Y increment bit
EndL:
   inc R14      ; increment ROM address
   sex          ; ensure that negative X increments remain negative when shifted
   lsr          ; shift X increment into position, pushing the Y increment out into the carry flag
   bcs NewLine  ; if the Y increment was one, go to NewLine (duplicated code for speed)
   with R1      ; update X-coordinate
   sub R0       ; with X increment
   to R12       ; refresh pixel counter
   getb         ; with next run's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   dec R12      ; decrement pixel count (hence the +1 for runs other than the first)
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count
   bra EndBlit  ; branch past duplicated code
NewLine:
   sub R0       ; update X-coordinate with X increment
   to R12       ; refresh pixel counter
   getb         ; with next line's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   inc R2       ; increment Y-coordinate
   dec R12      ; decrement pixel count
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count
EndBlit:

; This one encodes the X-coordinate carriage return value shifted left with a Y-increment bit shoved in on the right, so
; as to allow the algorithm to jump across gaps in a line without jumping down.  This limits the size of the object
; somewhat, since there are now only 7 bits for the X-increment value, but I'm not too worried about that. I could
; encode TWO Y-increment bits this way, so as to allow vertical gaps in the object, but with what most of the graphics
; in my game look like, I doubt plotting a transparent pixel now and then is less efficient than doing a bunch of extra
; maneuvering at the end of every single run of solid pixels...

Thoughts? Have I made any obvious mistakes like misunderstanding how to use an instruction?

I suppose dumps of untested code aren't especially useful or interesting, since there's no indication of what might or might not be wrong...



EDIT: Just had an idea:

Code:
   getb
   inc R14
   color
   plot
   mult R3   ; where R3 contains 0010h
   swap
   color
   loop
   plot


Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.

On the other hand, my single-pixel blit routine is even slower, and the extra pixel this method tacks onto odd-sized lines is transparent and can't cause a sliver overflow, so it might actually be better...
Re: Fast 2D blitting on Super FX
by on (#184267)
Here's a question. Does the Super FX chip really read the instruction "to" before it reads "get" or does it get assembled into a single instructions?
Re: Fast 2D blitting on Super FX
by on (#184268)
It's two instructions. The RISC is strong with this one.
Re: Fast 2D blitting on Super FX
by on (#184269)
Z80 and HuC6280 also have prefix instructions that modify the following instruction.
Re: Fast 2D blitting on Super FX
by on (#184273)
The problem is the 8-bit ROM bus. If you want single-cycle execution, you need single-cycle load, and if you want single-cycle load you need single-word instructions. With 16 general-purpose registers and 8 bits per instruction, there's only so much you can do without prefixes, and even some unary operations aren't necessarily important enough to burn 1/16 of the opcode matrix on.

At least it defaults to using R0 if you don't specify. It's a bit like having an accumulator. (But then, I had never programmed anything in assembly that wasn't accumulator-based before this, so for all I know this sort of 'preferred' register thing is common...)

...

It's an interesting chip. It seems to have specific support for texture mapping, via the MERGE opcode - the idea is apparently that R7 and R8 are 8.8 subtexel representations of the texture indices, which can be updated easily by the rasterization code, and MERGE takes the high byte of each one and sticks them together into a single 16-bit index, which can be then used to pick a texel out of the ROM buffer.

Yes, this means that the addressable texture memory on the Super FX is 16 times as large as on the Nintendo 64...
Re: Fast 2D blitting on Super FX
by on (#186418)
93143 wrote:
EDIT: Just had an idea:

Code:
   getb
   inc R14
   color
   plot
   mult R3   ; where R3 contains 0010h
   swap
   color
   loop
   plot


Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.


I'm not sure what your idea was but it looks like you're trying to do what the dither flag already does in hardware (doesn't work in 8bpp screen mode).
Code:
ibt r0, #%00010 ; \
cmode           ; / set flag 1 (enable dither) in color mode register
ibt r0, #$21    ; \
color           ; / alternate between color 1 and 2
ibt r12, #16    ; draw 16 pixels
move r13, r15   ; set loop point to next instruction
loop
plot            ; color plotted = (r1^r2)&1 ? high 4 bits : low 4 bits
Re: Fast 2D blitting on Super FX
by on (#186435)
Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.

I want to be able to blit objects with gaps and/or odd numbers of pixels in a run without accidentally erasing part of what's underneath them. At the same time, I want to keep the RAM buffer more or less saturated, which should be easier if I can pull two pixels with a single ROM buffer cycle. Two of the three methods I posted (not counting the snippet you quoted) use dither for this, but with safeguards to prevent zero overwrites and missed pixels.

(There's admittedly not a lot of context for those methods; just assume that dither has been turned on for the "dual-pixel" ones...)

...

I've just noticed something, and tried it out. Bit 2 of the plot mode register is supposed to switch between lower and upper nibbles for COLOR or GETC, but...

Code:
   getb
   inc R14
   from R3
   cmode   ; set bit 2 to 0
   color
   plot
   from R4
   cmode   ; set bit 2 to 1
   color
   loop
   plot

That's way slower than the one where I used multiplication to do the bit shift. Heck, it's slower than just using LSR four times. What have I missed?

...I guess it was intended for accessing compressed source graphics for transforms or texture mapping, not speeding up 1:1 pixel copying. In that context, you can just use GETC instead of GETB followed by COLOR. But it still seems hardly worth it compared to MULT+SWAP, considering it needs to be done every time instead of just on even-numbered texels...

It would have been nice to have an instruction to directly flip the nibbles in the color register.

(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)
Re: Fast 2D blitting on Super FX
by on (#186779)
93143 wrote:
Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.

That's right, reminds me of a test rom that seems to indicate emulators get this wrong, which is understandable assuming no licensed game uses this incorrectly.
http://imgur.com/a/FZCgX (higan 101 has the old behaviour)
The white blocks behind the background are sprites, black is cgram entry 0.
I don't have anything to test on so I can't confirm that the patched version is the correct behaviour.

Quote:
(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)

Due to the lack of barrel shifter asl/lsl is basically just add, except it affects the overflow flag.
Code:
add r0; add r0; add r0 // r0 << 3
with r2; add r2 // r2 << 1
to r1; from r3; add r3 // r1 = r3 << 1
Re: Fast 2D blitting on Super FX
by on (#189780)
Okay, now I'm annoyed.

I should have paid more attention to the calculation method for plot addresses. Apparently you can't plot off the edges of the screen, even though the screen is less than 256 pixels high. Nor does it wrap intelligently - if you plot a pixel at Y=192 in 192-line mode, it ends up on line 0 in the next column over. If you plot a pixel at Y = -1 (ie: 255), it ends up on line 63.

Now I have to choose between drawing to sprite tables all the time (and having to rearrange the data before downloading to VRAM) and putting checks before and/or in the main drawing loop to handle partially offscreen bullets. At least the X coordinate works fine, since I'm not using the full width...

ARM9 wrote:
emulators get this wrong

Well, it's a good thing I checked, then...

Quote:
Due to the lack of barrel shifter asl/lsl is basically just add

That is an excellent point. I'm not used to being able to add a register to itself. (At least, I wasn't before I marathoned a nontrivial Super FX program this past weekend - writing 65816 code felt weird after that...)

Speaking of shifting, I found that using a table of random numbers was (in my application, which required simultaneous access to a sine table in a different bank) not dramatically faster than just running xorshift16:

Code:
   move R0, R1   ; copy random number into accumulator
   add R0        ; shift left 4 bits
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   lsr           ; shift right 3 bits
   lsr
   lsr
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   add R0        ; shift left 7 bits
   add R0
   add R0
   add R0
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1

Actually, my initial attempt at a table of rands was slower than the above algorithm, but I'm getting better at this quickly...

Fun fact: in high-speed mode, this PRNG executes in the same amount of time it takes an S-CPU in FastROM to load a 16-bit number from direct page...
Re: Fast 2D blitting on Super FX
by on (#189869)
Okay, yeah, I think I can do better than that:

Code:
   move R0, R1   ; copy random number into accumulator
   add R0        ; shift left 5 bits
   add R0
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   hib           ; shift right 9 bits
   lsr
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   lob           ; shift left 8 bits
   swap
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1

23 cycles instead of 28. Nipping at the heels of a reasonable table load algorithm (repeated bankswitching is surprisingly expensive if you don't have registers to burn, and of course memory access itself is painful at five cycles per byte)...

...

If my math is correct, which it may very well not be, my bullet drawing loop seems to be maximally inefficient. I added up the cycles it would take to run if it wasn't held up by RAM buffer wait states, and the difference between that and the number of cycles it's actually taking seems to be roughly equal to the average number of cycles it should take to flush the pixel caches for all of the necessary sliver blit operations. In other words, there seems to be no parallel processing advantage showing up at all.

I'm hoping I did something stupid somewhere that's eating a ton of cycles to no purpose...
Re: Fast 2D blitting on Super FX
by on (#190917)
Dodge this.

Attachment:
FXtest1.sfc [128 KiB]
Downloaded 148 times

640 bullets at 60 fps in 224x192. The screen flashes red if it drops a frame, but you shouldn't ever see that happen. With 656 bullets I get the occasional flash, but 640 runs perfectly in higan for over two minutes, which is longer than the period of the PRNG.
Re: Fast 2D blitting on Super FX
by on (#190938)
Is this using 2bpp 8x8 bullets?, but damn 640 is a lot.
Re: Fast 2D blitting on Super FX
by on (#191009)
Yes, it's 2bpp. The bullets are actually 6x7, which is closer to the size (and shape, with the SNES PAR) of this type of bullet in the original game, and is moreover slightly quicker to draw.

I couldn't do 640 bullets under these constraints with a dither-based general-purpose drawing routine; there was too much overhead, and it topped out near 500. To get to 640 I had to forget about loading from the ROM buffer and just unroll the whole bullet in code. All the parts that need to be fast still fit in the cache, and my bullet list format allows multiple lists with dedicated handling loops to exist under a common bullet cap, so I think it's a legitimate approach for a number of pattern types.

I also didn't bother checking for collision with the player, but I imagine it wouldn't be all that onerous as I can simply leave the player's position in a pair of registers without adding more than a couple of cycles to the drawing code. Actually checking for collision shouldn't take as long as pulling the position from RAM would...

The actual game will have a 144-pixel-wide playfield, which gains me about 50,000 cycles, or roughly 25% extra compute time per frame, since I won't need to spend so much time waiting for DMA and clearing the framebuffer. And a lot of the bullet patterns need to be 4bpp and hence 30 fps, which should allow me to get much closer to the theoretical pixel buffer flush time, particularly with larger bullets. For really big ones, if the background isn't Mode 7 I can reserve part of OAM for the GSU and just use real sprites...

...

Just noting here that I was wrong about "maximally inefficient". I just didn't count the cycles carefully enough. There is a significant parallel processing bonus; it's just not as large as I was hoping, probably because the lines I'm blitting are so short...
Re: Fast 2D blitting on Super FX
by on (#191069)
Wow, that is really impressive :)
I think it's kind of bullet benchmarking... I guess you are using the SFX to draw the bullet, i wonder how much you can do without the SFX :) Something i wonder, do you have any kind of double buffering with the SFX memory ? So you can use one bank to work with while you are DMAing the other bank to VRAM ? If that is the case you can really maximum usage of the SFX chip.
Re: Fast 2D blitting on Super FX
by on (#191113)
I did a demo like this a while ago and I got 256 bullets at 30fps, though I had to make the bullets really small like 5x5.
Re: Fast 2D blitting on Super FX
by on (#191135)
Stef wrote:
Wow, that is really impressive :)

Thanks!

Quote:
Something i wonder, do you have any kind of double buffering with the SFX memory ?

Yes. The SNES can change the screen base register after the frame has been drawn, which allows the Super FX to start work on the next frame before the current one has been fully transferred. This can actually happen before the first available VBlank, as the Super FX's stop instruction issues an interrupt to the SNES by default (you can mask it if you want). The SNES can also temporarily suspend the Super FX's access to Game Pak RAM by changing a flag, which puts the Super FX in a wait state the next time it needs RAM access; this allows the SNES to proceed with the transfer during VBlank without having to forcibly kill the Super FX program.

I'm hoping to be able to keep lag to a minimum when doing 4bpp by drawing one half of the playfield first (possibly the bottom half, because it's likely to be faster) and copying it to VRAM before the other half is finished. Hopefully this won't result in glaring priority issues near the seam... actually, now that I think of it, I could use the anti-wrapping method in FXtest1 to prevent that...

psycopathicteen wrote:
I did a demo like this a while ago and I got 256 bullets at 30fps, though I had to make the bullets really small like 5x5.

For the convenience of the reader: viewtopic.php?f=12&t=13834&start=45#p164693

Impressive work. (Also gave me an idea for transferring 2bpp graphics into a sprite table - rendering into the source format is going to be interesting, but the transfer itself works beautifully...)
Re: Fast 2D blitting on Super FX
by on (#191204)
Oh yeah i remember than 256 bullets stuff at 30 FPS on stock SNES, very impressive as well ;)
You are both obsessed by bullet hell shooter X'D

Quote:
Yes. The SNES can change the screen base register after the frame has been drawn, which allows the Super FX to start work on the next frame before the current one has been fully transferred. This can actually happen before the first available VBlank, as the Super FX's stop instruction issues an interrupt to the SNES by default (you can mask it if you want). The SNES can also temporarily suspend the Super FX's access to Game Pak RAM by changing a flag, which puts the Super FX in a wait state the next time it needs RAM access; this allows the SNES to proceed with the transfer during VBlank without having to forcibly kill the Super FX program


That is cool so the SFX can be used almost at 100% if you cleverly use the double buffering :)
I wanted to do that sort of software sprite rendering (mostly for bullet) on the Megadrive as the sprite multiplexing cannot work as i expected unfortunately. You cannot use 2bpp rendering on Megadrive (or to be more precise you can do it but it won't bring any speed improvement in that case) so i have to use classic 4bpp rendering. To be honest given the code done by psycopathicteen, i believe it will be quite difficult to match the same performance level on the MD using 4bpp rendering. That kind of code is perfectly adapted to 65816, it uses the fast disp+indexed addressing mode, fast immediate ops and also take advantage of the 16 bits memory operation allowed by the 65816.
Re: Fast 2D blitting on Super FX
by on (#191235)
Very cool tech demo, great to see more people programming the superfx!

If you were to target PAL you could do 4bpp at 50fps.

One problem with double buffering is that no cart has more than 64K ram, so if someone were to do double buffering at 8bpp they'd either have to gut the resolution or put 128K on donor carts. I suppose you could do the latter regardless as some sort of weak, makeshift copy protection.
Re: Fast 2D blitting on Super FX
by on (#191322)
ARM9 wrote:
Very cool tech demo, great to see more people programming the superfx!

Thanks! Hopefully this is just the beginning...

Quote:
If you were to target PAL you could do 4bpp at 50fps.

That's true. The demo resolution of 224x192 doesn't seem to quite fit, but a slightly smaller screen would. In fact, at 2bpp there seems to be enough room for the whole screen (256x224), as long as you don't use overscan...

In my case, though, this demo is just an algorithm test/training exercise. I'm attempting a faithful port of an existing game, and while most of it is indeed too colourful for 2bpp, large chunks of it are too busy for the Super FX to maintain 50 fps at 4bpp. I have serious doubts about holding 30. Unless I've grossly overestimated the rendering load, the extra DMA bandwidth would be an embarrassment of riches for the most part.

Besides, I'm Canadian and the game is Japanese...

Quote:
One problem with double buffering is that no cart has more than 64K ram, so if someone were to do double buffering at 8bpp they'd either have to gut the resolution or put 128K on donor carts. I suppose you could do the latter regardless as some sort of weak, makeshift copy protection.

I may not end up needing 128 KB of GPRAM (the only 8bpp rendering I've encountered so far is a fairly small chunk of the title screen), but I do need CPU ROM, which no existing Super FX cart has any of... I don't think the usual emulators can even load CPU ROM - not that it matters, as due to mid-scanline shenanigans nothing below higan v095 accuracy can so much as run my display engine, so I suppose I can just use a manifest...