I've been messing with some conceptual codes for high-speed drawing of untransformed 2D graphics. They require specialized graphics formats which are not interoperable; I'm thinking I'd test for speed beforehand and use whichever one was fastest for a given object - as long as all the code can fit in the instruction cache at the same time...
Of course, these codes are probably not final. I'm still not very good at Super FX.
I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching. But I haven't written anything like that yet.
I've got to say, I find the 8-bit busing to be far more aggravating in the case of the Super FX than in the case of the S-CPU. What they were trying to do with the Super FX really needed more bits per word than they had...
Thoughts? Have I made any obvious mistakes like misunderstanding how to use an instruction?
I suppose dumps of untested code aren't especially useful or interesting, since there's no indication of what might or might not be wrong...
EDIT: Just had an idea:
Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.
On the other hand, my single-pixel blit routine is even slower, and the extra pixel this method tacks onto odd-sized lines is transparent and can't cause a sliver overflow, so it might actually be better...
Of course, these codes are probably not final. I'm still not very good at Super FX.
I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching. But I haven't written anything like that yet.
I've got to say, I find the 8-bit busing to be far more aggravating in the case of the Super FX than in the case of the S-CPU. What they were trying to do with the Super FX really needed more bits per word than they had...
Code:
; SINGLE-PIXEL BLITTING (slowest and most general):
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line
inc R14 ; increment ROM address, triggering a buffer load
Start:
getc ; get pixel data (one byte per pixel) from ROM buffer
inc R14 ; and increment ROM address
loop ; decrement pixel count; if not zero, go to address in R13, ie: "Start"
plot ; plot pixel and increment X-counter in R1 (since the GSU is pipelined, this byte gets executed regardless)
getb ; get carriage return X-component (goes in R0)
inc R14 ; increment ROM address
with R1 ; update X-coordinate
sub R0 ; by subtracting carriage return X-component
inc R2 ; increment Y-coordinate
to R12 ; update LOOP index register
getb ; with pixel count for next line
inc R14 ; increment ROM address
loop ; decrement pixel count and branch to Start if not zero
nop ; dummy fill pipeline (nothing else to do before GETC, and the ROM buffer isn't ready anyway)
; The main loop has only two bytes between INC R14 and GETC, so in high-speed mode it's probably 6 cycles rather than 4.
; Blitting a sliver in 4bpp is probably at least 40 cycles, but that's still only 5 cycles per pixel, so this method is
; bottlenecked by code unless you're drawing in 8bpp.
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line
inc R14 ; increment ROM address, triggering a buffer load
Start:
getc ; get pixel data (one byte per pixel) from ROM buffer
inc R14 ; and increment ROM address
loop ; decrement pixel count; if not zero, go to address in R13, ie: "Start"
plot ; plot pixel and increment X-counter in R1 (since the GSU is pipelined, this byte gets executed regardless)
getb ; get carriage return X-component (goes in R0)
inc R14 ; increment ROM address
with R1 ; update X-coordinate
sub R0 ; by subtracting carriage return X-component
inc R2 ; increment Y-coordinate
to R12 ; update LOOP index register
getb ; with pixel count for next line
inc R14 ; increment ROM address
loop ; decrement pixel count and branch to Start if not zero
nop ; dummy fill pipeline (nothing else to do before GETC, and the ROM buffer isn't ready anyway)
; The main loop has only two bytes between INC R14 and GETC, so in high-speed mode it's probably 6 cycles rather than 4.
; Blitting a sliver in 4bpp is probably at least 40 cycles, but that's still only 5 cycles per pixel, so this method is
; bottlenecked by code unless you're drawing in 8bpp.
Code:
; DUAL-PIXEL BLITTING (faster for long solid runs, slower for short runs, doesn't support gaps):
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line, plus two if odd
inc R14 ; increment ROM address, triggering a buffer load
with R12 ; operate on pixel count
SStart:
lsr ; turn pixel count into pixel pair count
bcc DStart ; if the pixel count was even, go to dual-pixel blitting
nop ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
getc ; fetch the first pixel from the ROM buffer
inc R14 ; increment the ROM address
loop ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
plot ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)
bra EndL ; go to end of line (at this point it's been determined that the line was only one pixel long)
getb ; get carriage return X-component in R0 (happens after branch)
DStart:
getc ; get pixel pair
inc R14 ; increment ROM address
plot ; plot pixel to buffer and increment X-coordinate
loop ; decrement pixel pair count and go to DStart if nonzero
plot ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate
getb ; get carriage return X-component in R0
EndL:
inc R14 ; increment ROM address
with R1 ; update X-coordinate
sub R0 ; with carriage return value
inc R2 ; increment Y-coordinate
to R12 ; refresh pixel counter
getb ; with next line's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
dec R12 ; decrement pixel count (hence the +1 for lines other than the first)
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
; This one uses the dither functionality to plot two pixels per byte fetched from ROM. Naturally this means all the
; graphics have to be duplicated in ROM so there's a version for each value of the dither bit (XOR of the X and Y
; bottom bits). Also, since dither can't plot transparent with non-transparent (it always checks the bottom of the
; colour register for colour #0, because it's checking the dither bit at the same time and doesn't yet know which half
; to use), this method does not support gaps in a line.
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line, plus two if odd
inc R14 ; increment ROM address, triggering a buffer load
with R12 ; operate on pixel count
SStart:
lsr ; turn pixel count into pixel pair count
bcc DStart ; if the pixel count was even, go to dual-pixel blitting
nop ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
getc ; fetch the first pixel from the ROM buffer
inc R14 ; increment the ROM address
loop ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
plot ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)
bra EndL ; go to end of line (at this point it's been determined that the line was only one pixel long)
getb ; get carriage return X-component in R0 (happens after branch)
DStart:
getc ; get pixel pair
inc R14 ; increment ROM address
plot ; plot pixel to buffer and increment X-coordinate
loop ; decrement pixel pair count and go to DStart if nonzero
plot ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate
getb ; get carriage return X-component in R0
EndL:
inc R14 ; increment ROM address
with R1 ; update X-coordinate
sub R0 ; with carriage return value
inc R2 ; increment Y-coordinate
to R12 ; refresh pixel counter
getb ; with next line's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
dec R12 ; decrement pixel count (hence the +1 for lines other than the first)
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
; This one uses the dither functionality to plot two pixels per byte fetched from ROM. Naturally this means all the
; graphics have to be duplicated in ROM so there's a version for each value of the dither bit (XOR of the X and Y
; bottom bits). Also, since dither can't plot transparent with non-transparent (it always checks the bottom of the
; colour register for colour #0, because it's checking the dither bit at the same time and doesn't yet know which half
; to use), this method does not support gaps in a line.
Code:
; DUAL-PIXEL WITH GAPS (a bit slower than basic dual-pixel blitting, but more flexible):
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line, plus two if odd
inc R14 ; increment ROM address, triggering a buffer load
with R12 ; operate on pixel count
SStart:
lsr ; turn pixel count into pixel pair count
bcc DStart ; if the pixel count was even, go to dual-pixel blitting
nop ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
getc ; fetch the first pixel from the ROM buffer
inc R14 ; increment the ROM address
loop ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
plot ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)
bra EndL ; go to end of line (at this point it's been determined that the line was only one pixel long)
getb ; get X increment in R0, shifted left and added to the Y increment bit
DStart:
getc ; get pixel pair
inc R14 ; increment ROM address
plot ; plot pixel to buffer and increment X-coordinate
loop ; decrement pixel pair count and go to DStart if nonzero
plot ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate
getb ; get X increment in R0, shifted left and added to the Y increment bit
EndL:
inc R14 ; increment ROM address
sex ; ensure that negative X increments remain negative when shifted
lsr ; shift X increment into position, pushing the Y increment out into the carry flag
bcs NewLine ; if the Y increment was one, go to NewLine (duplicated code for speed)
with R1 ; update X-coordinate
sub R0 ; with X increment
to R12 ; refresh pixel counter
getb ; with next run's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
dec R12 ; decrement pixel count (hence the +1 for runs other than the first)
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
bra EndBlit ; branch past duplicated code
NewLine:
sub R0 ; update X-coordinate with X increment
to R12 ; refresh pixel counter
getb ; with next line's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
inc R2 ; increment Y-coordinate
dec R12 ; decrement pixel count
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
EndBlit:
; This one encodes the X-coordinate carriage return value shifted left with a Y-increment bit shoved in on the right, so
; as to allow the algorithm to jump across gaps in a line without jumping down. This limits the size of the object
; somewhat, since there are now only 7 bits for the X-increment value, but I'm not too worried about that. I could
; encode TWO Y-increment bits this way, so as to allow vertical gaps in the object, but with what most of the graphics
; in my game look like, I doubt plotting a transparent pixel now and then is less efficient than doing a bunch of extra
; maneuvering at the end of every single run of solid pixels...
to R12 ; pixel count goes in the LOOP index register
getb ; get pixel count for first line, plus two if odd
inc R14 ; increment ROM address, triggering a buffer load
with R12 ; operate on pixel count
SStart:
lsr ; turn pixel count into pixel pair count
bcc DStart ; if the pixel count was even, go to dual-pixel blitting
nop ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
getc ; fetch the first pixel from the ROM buffer
inc R14 ; increment the ROM address
loop ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
plot ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)
bra EndL ; go to end of line (at this point it's been determined that the line was only one pixel long)
getb ; get X increment in R0, shifted left and added to the Y increment bit
DStart:
getc ; get pixel pair
inc R14 ; increment ROM address
plot ; plot pixel to buffer and increment X-coordinate
loop ; decrement pixel pair count and go to DStart if nonzero
plot ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate
getb ; get X increment in R0, shifted left and added to the Y increment bit
EndL:
inc R14 ; increment ROM address
sex ; ensure that negative X increments remain negative when shifted
lsr ; shift X increment into position, pushing the Y increment out into the carry flag
bcs NewLine ; if the Y increment was one, go to NewLine (duplicated code for speed)
with R1 ; update X-coordinate
sub R0 ; with X increment
to R12 ; refresh pixel counter
getb ; with next run's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
dec R12 ; decrement pixel count (hence the +1 for runs other than the first)
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
bra EndBlit ; branch past duplicated code
NewLine:
sub R0 ; update X-coordinate with X increment
to R12 ; refresh pixel counter
getb ; with next line's pixel count, plus three if odd and one if even
inc R14 ; increment ROM address
inc R2 ; increment Y-coordinate
dec R12 ; decrement pixel count
bne SStart ; branch to SStart if pixel count is nonzero
with R12 ; set up for right shift of pixel count
EndBlit:
; This one encodes the X-coordinate carriage return value shifted left with a Y-increment bit shoved in on the right, so
; as to allow the algorithm to jump across gaps in a line without jumping down. This limits the size of the object
; somewhat, since there are now only 7 bits for the X-increment value, but I'm not too worried about that. I could
; encode TWO Y-increment bits this way, so as to allow vertical gaps in the object, but with what most of the graphics
; in my game look like, I doubt plotting a transparent pixel now and then is less efficient than doing a bunch of extra
; maneuvering at the end of every single run of solid pixels...
Thoughts? Have I made any obvious mistakes like misunderstanding how to use an instruction?
I suppose dumps of untested code aren't especially useful or interesting, since there's no indication of what might or might not be wrong...
EDIT: Just had an idea:
Code:
getb
inc R14
color
plot
mult R3 ; where R3 contains 0010h
swap
color
loop
plot
inc R14
color
plot
mult R3 ; where R3 contains 0010h
swap
color
loop
plot
Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.
On the other hand, my single-pixel blit routine is even slower, and the extra pixel this method tacks onto odd-sized lines is transparent and can't cause a sliver overflow, so it might actually be better...