New technique for pushing video data faster

New technique for pushing video data faster
by Drakim on 2018-01-22 (#212128)

(If this technique is already known, then excuse my tomfoolery :lol:

)

So I was looking into popslide by tepples, and while reading though the code another technique occurred to me. It doesn't work with the NES Stripe Image format, but it should be quite a lot faster (and thus be able to push more VRAM data per vblank).

While popslide and similar techniques works like a sort of mini-interpreter, using an unused part of the stack as a blob of instructions and data, my technique drops the "interpreter" part in favor of using the "Reverse RTS trick", filling the stack with addresses to various video updater methods followed by their data.

So, it might look something like this:

Code:

SetPalette1:
  ; Set vram address to palette 1
  LDA #$3F
  STA $2006
  PLA #$01
  STA $2006
  ; Pull out the palette values and give them to VRAM
  PLA
  STA $2007
  PLA
  STA $2007
  PLA
  STA $2007
  RTS

FillArbitraryData:
  ; Set arbitrary address
  PHA
  STA $2006
  PHA
  STA $2006
  ; Loop here filling $2007
  RTS

Terminator:
  ; Restore stack to it's normal self here
  RTS

Stack:
SetPalette1_Low
SetPalette1_High
Color1
Color2
Color3
FillArbitraryData_Low
FillArbitraryData_High
DataLength
Data1
Data2
Data3
Data4
Terminator_Low
Terminator_High

Each "RTS" call on those methods will magically chain into the next one (costing only 6 cycles!), until it hits the terminator method which takes us out of this loop. Unlike a mini-interpreter, things don't become slower the more alternatives we create, so we can create very fast highly specialized methods for setting specific kinds of VRAM (like the palette).

We can even use a macro to create an unrolled "LDA # -> STA" loop for the absolutely fastest bandwidth 1 byte per 6 cycles for static ROM data like text.

Furthermore, since we can jump to arbitrary addresses, we can create huge unrolled loops of "PLA -> STA", and then jump an arbitrary distance into that unrolled loop and use that as our starting point to push a certain amount of bytes, at no additional cost!

The possibilities are limitless. You can write methods that push data exactly to the dot, not even needing to push a length byte to the stack.

The cost of this technique:
Adjusting the stack and starting the process: 16 cycles (including initial JSR)
Static cost to start a segment: 6 cycles
Cost per segment: variable (but always lower than popslide due to specialization)
Cost to end the process: 15 cycles (including final RTS)

Re: New technique for pushing video data faster
by pubby on 2018-01-22 (#212143)

This has been used before :wink:

but it's still a really cool use of CPS and I don't think most people know about it.

I still prefer zeropage buffers for most updates though.

Re: New technique for pushing video data faster
by Dwedit on 2018-01-22 (#212144)

Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

Re: New technique for pushing video data faster
by tepples on 2018-01-22 (#212153)

Dwedit wrote:

Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

Technically correct, in that you can empty the buffer one-third faster (6 cycles per byte compared to 8 cycles per byte). But with the stride of five bytes per immediate value and the update buffer straddling more than one 256-byte page, it might not be possible to fill the buffer quite as fast. In addition, devoting more of the 2048 bytes of unexpanded CPU RAM to such a buffer is likely to cause other parts of the program, such as game logic, to become slower as algorithmic space/time tradeoffs elsewhere in the program get tilted toward using more time. This is why the 30 fps version of "Bad Apple" requires WRAM expansion (the Kirby's Adventure board), whereas the 15 fps version does not (the Mega Man 4 and Mega Man 6 board).

Re: New technique for pushing video data faster
by slembcke on 2018-01-22 (#212158)

That's pretty clever. I've been musing about similar stack abuses for tail call hacking. In hindsight now it seems perfectly clear, push the destination to the stack and return. Thanks.

Re: New technique for pushing video data faster
by Bregalad on 2018-01-23 (#212164)

Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.

pubby wrote:

This has been used before :wink:

PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.

Quote:

I still prefer zeropage buffers for most updates though.

Me too, but in some (rare) cases it's not fast enough.

Quote:

Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

LDA #imm / STA $2006 is about the fasest you can go, the only faster possibility is to include clever use of X and Y register for 2 most common values and avoid a LDA here; also some clever usage of INX/INY/DEX/DEY/LSR/ASL/ROL/ROR will be the same speed but with less bytes. I think Tokumaru already made a thread about that. Since generating such code in RAM would be difficult, it was mostly intended at staying in ROM, even if this eats a lot of it.

Re: New technique for pushing video data faster
by Drakim on 2018-01-23 (#212168)

Bregalad wrote:

Why do you use those strange names instead of $2006 and $2007? Makes the code completely unreadable to me.

Fixed!

Bregalad wrote:

pubby wrote:

Popslides are still slower than in-ram LDA #xx \ STA $2007 slides.

With my RTS technique, you can point to RAM and execute it as a routine just like pubby prefers, or you can point to ROM to a series of tightly placed opcodes like Tokumaru's tricks, or you can put the values you want on the stack itself and point to a routine that pulls them out.

I got a lot of ROM space so I've been creating various specialized routines to do certain tasks as fast as physically possible. One of my routines is just straight up 1024 "STA $2007" in a row, I jump to the routine start plus an offset so that if I only want to write 50 bytes, it jumps to start+(1024-50)*3, and bam, 50 unrolled STA statements, all at the same low cost of an RTS statement like everything else.

Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).

Bregalad wrote:

pubby wrote:

This has been used before :wink:

PLA/STA chains, definitely. The RTS I'm not so sure whether it had been used before in this context.

Does that mean I get to give the technique a fancy name?

Re: New technique for pushing video data faster
by FrankenGraphics on 2018-01-23 (#212169)

Quote:

Does that mean I get to give the technique a fancy name?

By all means, yes! Also makes future convos easier to have when you can name drop techniques.
On the other hand if you don't people are going to refer to it as "drakim slide" and thus you'd be immortalized, like how i'd refer to a certain controller/dpcm fix as "rahsennors' controller/dpcm fix". :wink:

Re: New technique for pushing video data faster
by Rahsennor on 2018-01-23 (#212201)

I used exactly this technique for an unreleased game called Hull Breach!. I got it straight off the wiki, and thought it was a well-known method. I called it "RTS chaining".

Then I decided assembler wasn't for me, and my compiler can't cope with this trick, so I haven't used it since.

Re: New technique for pushing video data faster
by FrankenGraphics on 2018-01-23 (#212203)

Quote:

an unreleased game called Hull Breach!.

You rang?

Re: New technique for pushing video data faster
by pubby on 2018-01-23 (#212206)

I've used it before too.

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.

Re: New technique for pushing video data faster
by Drakim on 2018-01-24 (#212210)

Rahsennor wrote:

Are you sure you got it off the wiki? I can't find a single instance of the phrase "RTS chaining" or even "chaining" on the wiki. Heck, I can't even find "RTS chaining" related to programming anywhere on the web. Are you sure that's the proper name for the technique, and that you aren't just mixing it up with the "RTS Trick"? (or is my google-fu too weak?)

Edit: Oh wait, I read that wrong, you called it RTS chaining. Any idea what it was called in the wiki?

pubby wrote:

I'm not sure I understand, how can you use JSR to push addresses onto the stack without actually jumping there at the same time? If this worked, shouldn't the vanilla "RTS trick" be using this as well?

Edit: Ah, I think I may have found something similar being discussed before:

https://forums.nesdev.com/viewtopic.php?f=2&t=13285

I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.

Re: New technique for pushing video data faster
by Bregalad on 2018-01-24 (#212215)

Drakim wrote:

I'm unsure why this technique hasn't caught on more, it seems like the strictly superior way, unless you really need a full 256 sized stack for gameplay.

Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower) nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way. For this technique to be useful, you need significantly extra VRAM bandwidth, but still not enough so that extreme technique is not worth it.

Re: New technique for pushing video data faster
by Oziphantom on 2018-01-24 (#212216)

Drakim wrote:

Compared to having to do LDA $,X -> CMP $ -> BEQ or a jumptable for "custom" VRAM vblank routines it's both easier to implement, faster AND gives more options. Although RTS is more expensive than a relative branch, you'd have to make your video stripe uselessly primitive to not go above 6 cycles for the selection process (and then you'd still lose out at length due to lack of specialization routines).

Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI. If you go over a frame and interrupt yourself X_X

The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical, and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.

Re: New technique for pushing video data faster
by Drakim on 2018-01-24 (#212220)

Bregalad wrote:

Well this technique is very great but it's neither the easiest (using "normal" buffer with "normal" code is easier to handle, but slower)

Is it? Hmmm, I mean, I concede the point that just setting up a "normal dumb" buffer is way easier, and it's probably what you should do when you are learning what's going on.

I was more thinking about popslide's way of using the bottom-stack as a faster buffer (which seems like the de facto way of doing it, to the point where people are comparing alternative implementations), which involves the exact same stack switching techniques as I use. But where popslide (and any similar interpreter) has to write a tiny mini-parser for the stack data, consuming bytes and acting on it, my code is (after the stack switch) just an "RTS" statement.

It jumps you right into a subroutine such as Set_BackgroundColor or Set_Palette, it's way easier and more straightforward at that point, in my opinion. Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.

Quote:

nor the fastest (creating raw chains of lda #imm - sta $2007 is faster), so people coding games where standard VRAM bandwith is enough will do it the standard way, and people doing fancy stuff pushing the system to its extreme limits will use the fastest way.

Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.

While I do think that if you are writing some psedo-3d tech demo that needs to upload to vram really fast in a specific way, then nothing can beat your handcrafted code in speed.

But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique. Simply because the RTS technique works like a reverse jumptable, and the overhead setup for stack switching is only 16 cycles. And no matter how crafty you are, your game will need some means of selectively pushing vram since you can't do everything every frame. And whatever you pick, I think the RTS technique would beat it, but maybe I'm being overly optimistic!

Re: New technique for pushing video data faster
by slembcke on 2018-01-24 (#212223)

RTS Chaining, I think they meant this:
http://wiki.nesdev.com/w/index.php/RTS_Trick

I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.

pubby wrote:

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.

Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.

Re: New technique for pushing video data faster
by Drakim on 2018-01-24 (#212227)

slembcke wrote:

I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.

Yup, that's the beauty of it. If you add more stuff later in the frame you simply overwrite the terminator function and write a new terminator function at the new offset in the stack.

I've decided to use 127 bytes for my video stack so that I can use the BMI instruction to quickly check if the video stack pointer has become too large each time I try to push in more.

I've also been experimenting with making the terminator function's address be symmetrical to speed it up further (something like $C0C0), since then you can just mindlessly fill the bottom of the stack in advance (with an unrolled loop), and not have to bother appending it after have written in your "2 byte command token". It speeds up high-video update situations but slows down low-video update situations, which is generally a good tradeoff.

Another cool trick I've mentioned earlier is that you can jump an arbitrary length into any of your subroutines as a starting point, which can be used in very neat ways.

slembcke wrote:

pubby wrote:

One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.

Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.

Yeah, I've been twisting my head trying to figure out how it would work, but I just can't see it. Does he somehow exit out of the routines with a JMP? But even if he did that it would be the wrong part of the stack that's filled up.

Re: New technique for pushing video data faster
by Drakim on 2018-01-24 (#212228)

Oziphantom wrote:

You are going to have to explain that again, I'm not able to follow. :oops:

Quote:

If you go over a frame and interrupt yourself X_X

Now that's a pretty good point. I guess I should either make really sure that no interrupt can happen while my video routine does it's work, or that interrupts are disabled while it works.

Quote:

The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical

That's not completely true, if you leave yourself some room you can still use the faux video stack as a regular stack as long as you are careful. JSR, RTS, PHA and PLA will work just as expected as long as you don't push the stack past the $0/$FF point.

Quote:

and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.

It increases the static cost of each video "segment" from 6 to 12 cycles, and you lose the ability to cram in extra values for each video subroutine to pull out with PLA, and each "jump entry" in your queue takes 4 bytes of RAM intead of 2.

Not so sure I agree it's almost as good, it seems like an inferior technique (but with the added bonus of being able to place it anywhere in RAM, or even have ROM versions for cutscenes and the like where things are exactly the same)

Re: New technique for pushing video data faster
by Bregalad on 2018-01-24 (#212232)

Drakim wrote:

Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.

Well there's nothing to agree or disagree upon. I was just trying to explain why using stack as VRAM-buffer is not so common even though we all agree it's a cool trick. My point was that while this trick is indeed very cool, it's neither the fasest nor the easiest. To which you respond, "but this trick is so cool" - I never said it wasn't !

Quote:

But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique.

Doing it with normal buffers without any optimisation allows a bandwith of aprox 80 bytes + sprite DMA, which is enough to update one line of nametable plus one line of attribute table plus palettes, or two lines of nametables plus attribute table but without palettes. With some partially unrolled loops or zero-page usage, it's possible easily to update two lines + attribute table + palettes + sprite DMA, with only light optimization. So the only application where further optimization is needed is for CHR-RAM updates, or maybe if updating a really large NT area in a single frame is needed.

Re: New technique for pushing video data faster
by tepples on 2018-01-24 (#212239)

Drakim wrote:

Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.

Popslide uses the same packet format as Super Mario Bros., which is described on both Data Crystal and NESdev Wiki. There are four packet types (horizontal literal, horizontal run, vertical literal, and vertical run). It's designed to be generic, where you can make your own subroutine that adds packets in a particular shape to the list without having to touch the updater code. The stuff in nstripe.s is mostly convenience for forming packets based on shapes that arose during the development of The Curse of Possum Hollow.

Having only one interpreter also simplifies the NMI handler, which is important to reduce the size of code in the fixed bank or the pseudo-fixed repeating area in a game using 32K switching.

Re: New technique for pushing video data faster
by pubby on 2018-01-24 (#212266)

Drakim wrote:

It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.

Code:

    jmp foo
continue:

; ...

foo:
    jsr continue
    ; some code
    rts

Re: New technique for pushing video data faster
by Rahsennor on 2018-01-24 (#212299)

FrankenGraphics wrote:

You rang?

Sorry...

Drakim wrote:

I can't find it on the wiki now either. My devlog notes the technique on 2016-05-18 as "RTS-display-list-on-stack" and then never mentions it again, but it's definitely right there in the source:

Code:

.org $C000

bank .byte 0,1,2,3,4,5,6,7 ; UNROM conflict-avoidance table

irq:
reset:
   lda #0
   sta $2000  ; disable NMI
   sta $2001  ; disable rendering
   sta bank+0 ; switch in bank 0
   jmp init


.dsb $FF00 - *

; copy 0-32 bytes to the PPU
nmi_copy:
   nop ; offset for RTS (I'm lazy)
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; PLA/STA x32
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; 4 bytes each
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
   pla
   sta $2006
   pla
   sta $2006
   rts

; write 32 zeroes to the PPU
nmi_zero:
   nop
   ldy #8
   lda #0
 nmi_zero1:
   sta $2007
   sta $2007
   sta $2007
   sta $2007
   dey
   bne nmi_zero1
   pla
   sta $2006
   pla
   sta $2006
   rts

; close the RTS chain
nmi_end:
   nop

   lda scroll_x
   sta $2005
   lda scroll_y
   sta $2005
   lda #$88     ; TODO: nametable select
   sta $2000

   ldx nmi_s
   txs
   ldy nmi_y
   ldx nmi_x
   lda nmi_a
nmi_abort:
   rti

nmi:
   lsr nmi_flag
   bcc nmi_abort
   sta nmi_a
   stx nmi_x
   sty nmi_y
   tsx
   stx nmi_s
   lda #>oam
   sta $4014
   ldx #$FF
   txs
   pla
   sta $2006
   pla
   sta $2006
   rts

.dsb $FFFA - *       ; pad to interrupt vectors

.word nmi
.word reset
.word irq

I also used this method on PC a fair bit, back when I was doing assembler coding for DOS and so on, so maybe I mentally inserted it into the "RTS Trick" page as a matter of course?

EDIT: I also saw Tokumaru's thread the day it was posted, so I may have got it from there. My treatment of $2006 is identical so it's likely that was it.

Re: New technique for pushing video data faster
by Drakim on 2018-01-25 (#212304)

Bregalad wrote:

Ah, my bad, sorry!

Quote:

Thanks, that's a pretty good point. It didn't occur to me that a game's scope might allow it to push everything to vram on every frame without hitting the ceiling, my mental idea of how much you could push was much much lower.

pubby wrote:

It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.

Ah, I see. That does indeed shave off 1 cycle (10 vs 9).

If you REALLY wanna go crazy on the savings you can do LDA, PHA, PHA if the subroutine is placed on an address where it's high and low byte are identical, then you have it down to 8 cycles.

Re: New technique for pushing video data faster
by tokumaru on 2018-01-25 (#212313)

I definitely posted a method exactly like this a while back.

Re: New technique for pushing video data faster
by Oziphantom on 2018-01-25 (#212341)

Drakim wrote:

Oziphantom wrote:

You are going to have to explain that again, I'm not able to follow. :oops:

Its a One Byte dispatch method. I.e your $100 looks like
$100 <MyFunc-1
$101 >MyFunc-1
$102 <OtherFunc-1
$103 >OtherFunc-1

then if you want to dispatch OtherFunc you do
LDX #2
TXS
RTS
however normally you would not immed load X and put it on some "on X" variable.

Quote:

The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical

Quote:

and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.

I was imagining that you would have a mostly fixed "display list" as things move, but don't get added or removed from the screen that often. Thus you save time by not regenerating the whole list each frame. If you are regenerating it each frame then sure any other stack use, will trash stuff, but you don't care as you are going to rebuilt it before next NMI. However if you FRAME out and don't get all the stack rebuilt before the NMI and some of your other systems have "trashed" the display list X_X

Re: New technique for pushing video data faster
by Drakim on 2018-01-26 (#212352)

tokumaru wrote:

I definitely posted a method exactly like this a while back.

Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind

In your thread you asked for improvements, and there are a couple:

1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it. Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks. For instance, we can make one to change the background fill color in the PPU, which is always $3F00, so it's better to write LDA #$3F, LDA #$00 than to use PLA, PLA. Not only is it faster cyclewise during the vblank, but it uses up less of the faux stack.

But even more important than that....

2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.

Re: New technique for pushing video data faster
by tokumaru on 2018-01-26 (#212355)

Drakim wrote:

Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind

If this is supposed to be immortalized in the wiki or something I don't mind being credited somewhere, but having it named after me is a bit too much IMO. :mrgreen:

Quote:

1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it.

The reason I did it like that was so I could reuse the same 32x unrolled loop of PLA+STA to transfer any number of bytes from 1 to 32, by simply JSRing to the middle of the sequence. If I were to set the address inside the routine, I'd need 32 separate routines for the same functionality without any loss of performance.

Quote:

Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks.

I did end up ditching the generic approach in favor of completely specialized routines myself, but aside from the rare instances where you'd use hardcoded addresses, my old method could do specialized routines too, which could set other VRAM addresses if needed.

Quote:

2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.

Sure, that's much cleaner and straightforward, but like I said, setting the VRAM address before calling the next routine is merely a ROM saving feature so that the same generic transfer routine can deal with data lengths from 1 to 32. The superfluous address setting at the end of the chain will waste a few cycles, but that's all.

I wasn't 100% satisfied with that solution either, but someone looking for a completely generic VRAM update system of "copy X bytes from the stack to PPU address $XXXX" that doesn't have much ROM space might want to go with that.

I ended up switching to completely specialized update routines, so I have complete freedom to store the data in whatever format is more convenient for each task (many times outside of the stack, even), but I still have the RTS directly call the next update routine.

Re: New technique for pushing video data faster
by Drakim on 2018-01-26 (#212359)

tokumaru wrote:

You can solve this by first queuing up a specialized "set address" subroutine, and then queuing up the unrolled loop subroutine right after (plus the offset to get the desired amount of loops). It adds 6 cycles (the extra RTS) to do this, but just the unnecessary JMP statement at the end of the old solution costs 3 cycles, so this technique's unrolled loop only really costs 6 - 3 = 3 cycles more. As long as you do a few more video operations queued up afterwards you end up using less cycles, by avoiding the extra JMP each time.

Re: New technique for pushing video data faster
by tokumaru on 2018-01-26 (#212361)

But the extra JMP and the bogus VRAM address are used only once, and that's during a not so precious time (i.e. they can safely take place during the pre-render scanline) but an extra 6 cycles per update routine call when CPU time is at a premium (i.e. vblank), that's much worse IMO.

Note that I'm not arguing that this is the best possible solution, just that there might be demand out there for all of these solutions. The VRAM address presetting is for those who're looking for a fast generic vblank handler but don't have much ROM to spare (e.g. NROM projects with more VRAM updates than typical).

Re: New technique for pushing video data faster
by Drakim on 2018-01-26 (#212363)

I think I'm misunderstanding you, or you are misunderstanding me. :wink:

Here is the gist what I understood you to be saying:

Quote:

We have to arrange the stack so that it's [videoaddress] first and [subroutineaddress-1] afterwards. When it's vblank time, we pull out the [videoaddress], use it to set $2006 correctly, and then we use RTS to jump to our [subroutineaddress-1], which now begins pushing data into $2007. When it's done, it calls JMP UpdateVRAM to begin the process again.

The reason we cannot do the opposite order, beginning by using RTS to jump to [subroutineaddress-1], and then have the subroutine itself pull the [videoaddress] and set $2006 correctly, and then begin pushing data into $2007, is because the subroutine might be an unrolled loop, and we might want to jump to [subroutineaddress-1+60] to skip a bunch of the unrolled loop.

If we did that, we'd be jumping past the part of the subroutine that pulls out the [videoaddress] and set $2006 correctly. Therefore that part cannot be inside the subroutine.

And here is the gist what I'm saying:

Quote:

To solve this issue, we simply have to avoid setting the address in the unrolled loop's subroutine. We do this by setting the stack like this: [subroutineSetAddress-1][videoaddress][subroutineUnrolledLoop-1+60]

What happens now is that our first RTS jumps us to subroutineSetAddress, which uses two PLA to set $2006 correctly, and then calls RTS, which jumps us to subroutineUnrolledLoop (plus the offset!) with the address already set.

This means we are using an additional RTS instruction for unrolled loops, which costs 6 more cycles, but we save 3 cycles for every single video subroutine by avoiding the "JMP UpdateVRAM", which means it more than makes up for it, and we end up using less cycles in the long run.

Edit: Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?

Re: New technique for pushing video data faster
by tepples on 2018-01-26 (#212371)

In this post about the Spectre attacks on branch prediction, Dwedit mentioned ROP, or return-oriented programming. Before ROP became famous for use in stack-smashing attacks, this technique of storing a list of subroutines to be called was known as threaded code.

The name Popslide was derived from NOP slide and clockslide, both of which involve a jump into an unrolled loop, like a computed Duff's device.

The technique discussed here combines storing a subroutine sequence in the return stack (the "ROP") with jumping into an unrolled copy loop (like the "slides"). So could we call it ROPslide?

Re: New technique for pushing video data faster
by Drakim on 2018-01-26 (#212387)

Sounds good

I've come up with another neat little trick. Just like with unrolled loops, it can sometimes be beneficial to jump a ways into a subroutine to skip some unnecessary code. If you set the PPU "increment mode" at the beginning of every subroutine, you can opt to keep track of it manually as you build your video stack outside of vblank, and simply plus $5 to the subroutine's address to skip over the initial LDA #, STA $2000 when you know the increment mode is already correct.

Unlike having "set increment mode to 1" and "set increment mode to 32" subroutines, this trick doesn't carry any extra overhead of jumping around additional times in the video stack.

Re: New technique for pushing video data faster
by tokumaru on 2018-01-28 (#212575)

Drakim wrote:

Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?

I went looking for the posts I made about this and apparently this is the last piece of code I posted about this. The basic idea was to have 2 unrolled loops, both copying up to 32 bytes, but one sets the VRAM address before RTS'ing to the next update, while the other is meant to be used for the final update, so it doesn't set the VRAM address at the end. With this setup, there's no overhead at all, even the JMP at the end can be eliminated if the 2nd unrolled loop flows directly into the rest of the NMI handler.

But then I decided that having two unrolled loops was too much trouble, and that it'd be better to just let the last update set a bogus VRAM address and RTS to the remainder of the NMI handler. Here's what the code could look like:

As soon as possible in the NMI handler:

Code:

  ;(swap the stack pointer)

  ;execute the first update
  pla
  sta $2006
  pla
  sta $2006
  rts

RestoreStackPointer:

  ;(restore the stack pointer)

The unrolled data transfer loop, which has 32 possible entry points:

Code:

Copy32Bytes:

  pla
  sta $2007

Copy31Bytes:

  pla
  sta $2007

  ;(...)

Copy1Byte:

  pla
  sta $2007

  ;execute the next update
  pla
  sta $2006
  pla
  sta $2006
  rts

In this case, the only overhead is setting up a bogus VRAM address that won't be used for anything, something that happens only once per vblank, after the final data transfer (can be during the pre-render scanline, so it doesn't waste any actual vblank time).

Like I said before, this is mostly for those looking to use generic VRAM update code without giving up on speed or lots of ROM space. I personally prefer to use specialized update routines so I can take advantage of certain characteristics of the NES architecture (e.g. palette mirroring, changing only the low byte of the VRAM address, etc.) to make updates even faster.

Re: New technique for pushing video data faster
by Drakim on 2018-01-29 (#212585)

Oh, I understand finally!

Your technique is cool in that you shave off the RTS in the subroutine that sets the address before the mass copy:

Quote:

Video_SetAddress:
pla
sta $2006
pla
sta $2006
rts

By baking the address setting into the previous mass copy. Very clever!

While this is probably the fastest way of pushing raw bytes from the faux stack to the VRAM directly in a "video command" manner (probably only beaten by doing it all static every vblank and not having any "command" system at all), I think you were right in switching to specialized routines, there are just too many situations on the NES that doesn't involve copying enormous amounts of bytes mindlessly. Plus, the specialized routines allows for micro optimizations here and there, like using "LDA #" instead of "PLA" when it's a static VRAM address, or skipping over the "set increment" code if the increment mode is already correct, or having specialized unrolled loops that doesn't PLA if the same value appears multiple times in a row. The sum of optimizations might very well end up being faster.

Still, I'm going to have a think about the possibility of using this in my own specialized system in the case of several mass-copying routines being queued up in a row. :beer:

Re: New technique for pushing video data faster
by tokumaru on 2018-01-29 (#212606)

Yeah, there are several opportunities for optimizations we can do with specialized update routines. You already mentioned how you can use a hardcoded VRAM address for palette updates, but there are many other optimizations you can do in this particular case:

- start writing at $3F01 rather than $3F00, and set the background color later, when you reach $3F10 (a mirror of $3F00);

- don't update any of the "non-displayable" entries ($3F04, $3F08, $3F0C) or their mirrors ($3F14, $3F18, $3F1C), just bit $2007 instead, to advance the VRAM address and skip these;

- read the palette data straight from the place where it's normally stored, not from the stack, saving the time it'd take to copy it there.

Let's see how much time we could save:

- set VRAM address to $3F01 (12 cycles);
- update 3 colors (24 cycles);
- skip $3F04 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F08 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F0C (4 cycles);
- update 3 colors (24 cycles); 108
- update the background color via $3F10 (8 cycles);
- update 3 colors (24 cycles);
- skip $3F14 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F18 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F1C (4 cycles);
- update 3 colors (24 cycles);
- call the next update (6 cycles);

That's a total of 242 cycles, compared to the 278 cycles that a "raw transfer" of 32 bytes would take, and don't forget the time you save outside of vblank, by not copying the palette data to the stack.