(If this technique is already known, then excuse my tomfoolery )
So I was looking into popslide by tepples, and while reading though the code another technique occurred to me. It doesn't work with the NES Stripe Image format, but it should be quite a lot faster (and thus be able to push more VRAM data per vblank).
While popslide and similar techniques works like a sort of mini-interpreter, using an unused part of the stack as a blob of instructions and data, my technique drops the "interpreter" part in favor of using the "Reverse RTS trick", filling the stack with addresses to various video updater methods followed by their data.
So, it might look something like this:
Stack:
SetPalette1_Low
SetPalette1_High
Color1
Color2
Color3
FillArbitraryData_Low
FillArbitraryData_High
DataLength
Data1
Data2
Data3
Data4
Terminator_Low
Terminator_High
Each "RTS" call on those methods will magically chain into the next one (costing only 6 cycles!), until it hits the terminator method which takes us out of this loop. Unlike a mini-interpreter, things don't become slower the more alternatives we create, so we can create very fast highly specialized methods for setting specific kinds of VRAM (like the palette).
We can even use a macro to create an unrolled "LDA # -> STA" loop for the absolutely fastest bandwidth 1 byte per 6 cycles for static ROM data like text.
Furthermore, since we can jump to arbitrary addresses, we can create huge unrolled loops of "PLA -> STA", and then jump an arbitrary distance into that unrolled loop and use that as our starting point to push a certain amount of bytes, at no additional cost!
The possibilities are limitless. You can write methods that push data exactly to the dot, not even needing to push a length byte to the stack.
The cost of this technique:
Adjusting the stack and starting the process: 16 cycles (including initial JSR)
Static cost to start a segment: 6 cycles
Cost per segment: variable (but always lower than popslide due to specialization)
Cost to end the process: 15 cycles (including final RTS)
So I was looking into popslide by tepples, and while reading though the code another technique occurred to me. It doesn't work with the NES Stripe Image format, but it should be quite a lot faster (and thus be able to push more VRAM data per vblank).
While popslide and similar techniques works like a sort of mini-interpreter, using an unused part of the stack as a blob of instructions and data, my technique drops the "interpreter" part in favor of using the "Reverse RTS trick", filling the stack with addresses to various video updater methods followed by their data.
So, it might look something like this:
Code:
SetPalette1:
; Set vram address to palette 1
LDA #$3F
STA $2006
PLA #$01
STA $2006
; Pull out the palette values and give them to VRAM
PLA
STA $2007
PLA
STA $2007
PLA
STA $2007
RTS
FillArbitraryData:
; Set arbitrary address
PHA
STA $2006
PHA
STA $2006
; Loop here filling $2007
RTS
Terminator:
; Restore stack to it's normal self here
RTS
; Set vram address to palette 1
LDA #$3F
STA $2006
PLA #$01
STA $2006
; Pull out the palette values and give them to VRAM
PLA
STA $2007
PLA
STA $2007
PLA
STA $2007
RTS
FillArbitraryData:
; Set arbitrary address
PHA
STA $2006
PHA
STA $2006
; Loop here filling $2007
RTS
Terminator:
; Restore stack to it's normal self here
RTS
Stack:
SetPalette1_Low
SetPalette1_High
Color1
Color2
Color3
FillArbitraryData_Low
FillArbitraryData_High
DataLength
Data1
Data2
Data3
Data4
Terminator_Low
Terminator_High
Each "RTS" call on those methods will magically chain into the next one (costing only 6 cycles!), until it hits the terminator method which takes us out of this loop. Unlike a mini-interpreter, things don't become slower the more alternatives we create, so we can create very fast highly specialized methods for setting specific kinds of VRAM (like the palette).
We can even use a macro to create an unrolled "LDA # -> STA" loop for the absolutely fastest bandwidth 1 byte per 6 cycles for static ROM data like text.
Furthermore, since we can jump to arbitrary addresses, we can create huge unrolled loops of "PLA -> STA", and then jump an arbitrary distance into that unrolled loop and use that as our starting point to push a certain amount of bytes, at no additional cost!
The possibilities are limitless. You can write methods that push data exactly to the dot, not even needing to push a length byte to the stack.
The cost of this technique:
Adjusting the stack and starting the process: 16 cycles (including initial JSR)
Static cost to start a segment: 6 cycles
Cost per segment: variable (but always lower than popslide due to specialization)
Cost to end the process: 15 cycles (including final RTS)