Maximal arbitrary beamchasing? Code golf

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Maximal arbitrary beamchasing? Code golf
by on (#172604)
Code:
<Bushytail>   beam chasing refers more to the beam that makes scanlines on a CRT, and trying to make code for graphical effects fast enough to fit
<Bushytail>   it comes up more on the 2600 for obvious reasons

This got me wondering just how many HBlank writes one can stuff into the NES. I choes to do this without WRAM, because it makes it actually possible to continuously run (as it takes a lot longer to rewrite the WRAM of an unrolled-loop, and picture-time is greater than Vblank, so…very much not having time in VBlank to rewrite all the HBlank code)

So, the biggest way to do it is to have a 240-byte table for each address/data of write. Initially I was thinking only of PPU registers, not mapper registers…which requires more than one byte of rewritable address.

Code:
;Maximum beamchasing...in RAM.
;Hblank is 28⅓, non is 86⅓ cycles
    ldx #16 (end of vblank)
looptop: (8)
    lda $write1addr,x ;256-byte tables of where we want each write
    sta $wr1+2 (16)
    lda $write2addr,x
    sta $wr2+2 (24)
    lda $write3addr,x
    sta $wr3+2 (32)
    lda $write4addr,x
    sta $wr4+2 (40)
    lda $write5addr,x
    sta $wr5+2 (48)
    lda $write6addr,x
    sta $wr6+2 (56)
    lda $write4,x;and what we want written
    sta $re4+1 (64)
    lda $write5,x
    sta $re5+1 (72)
    lda $write6,x
    sta $re6+1 (80)
   
    lda $write1,x (84)
    ldy $write2,x (88)
    sty $re2+1 (92)
    ldy #write3,x (96)
    stx $FF  (100)
re2:ldx #00 ;overwritten
wr1:sta $20ZZ   ;addr overwritten, write cycle is when hblank begins (103+1)
wr2:stx $20ZZ   ;addr overwritten (5)
wr3:sty $20ZZ   ;addr overwritten (9)
re4:lda #00     ;value overwritten (11)
wr4:sta $20ZZ   ;addr overwritten (15)
re5:lda #00     ;value overwritten (17)
wr5:sta $20ZZ   ;addr overwritten (21)
re6:lda #00     ;value overwritten (23)
wr6:sta $20ZZ   ;addr overwritten (27)
    ;…and we're out of hblank time. One cycle (and one-third) of leeway.
    ldx $ff ;(3)nonblank times
    inx ;(5)
    bne looptop;(8)
;so, 102+28 if we're perfect, we have the cycle in hblank but not out to spare
;92+28 if we're executing out of ZP

Should probably unrollx3 at least just to make it easy to deal with the third-cycles.
Obviously if one is writing out of WRAM one could unroll it all the way and just use ld#imm to easily fit, but that requires WRAM. I want t osee if it can fit in ZP in such a way, because it makes it easy to "bankswitch" our arbitrary tables (rewrite the 12 values, relatively easy to fit in Vblank)

You can get two more writes if you're doing the 2006/5/5/6 thing,but obviously you have to find some cycles to put them in.
Presently 91 bytes, and those 92+28 cycles (if ZP)...unrolling 3 times will drop some cycles, and make it easier to deal with the ⅔ cycle per line accruing.

(pre-post edit: save a cycle by changing the "save x" store to point at the load-x and making the ldx #imm; also means not having to not save a ZP slot for that.)

Of course, if we fix two of the writes to scroll registers, that will save the rewriting which-register-bytes…which is enough to drop it to fit a 3-unrolled into ZP, and also get it actually fitting under the cycle count, though sync cycles still need to be considered…

edit: or fix two to "disable render enable render", which makes for THREE ditched tables (2xaddr, 1xdata for the disable-render value)...but at cost of true-arbitrary writes.

edit2: added leading explanation. It occurs tome that a CHR bankswitch might be a desired write as well, which would require making one of the writes have its hi-address rewritable. Also fixed the ldx, as there are only 240 scanlines to write. :oops:

edit3,4: In sum: "[How] Can we fit six arbitrary Hblank (PPU register/CHR bank/VRAM) writes in every scanline every frame? If not, how much freedom needs sacrificing to fit them in?"
Re: Maximal arbitrary beamchasing? Code golf
by on (#172606)
I have no idea what you're trying to do. Care to write an introduction to your post to give it some context?
Re: Maximal arbitrary beamchasing? Code golf
by on (#172612)
I feel like this post was written by some kind of automated nesdev post bot. It contains words that you would find in a typical post, but not in any order that makes any sense to me.

Quote:
Can we fit six arbitrary Hblank (PPU register/CHR bank/VRAM) writes in every scanline


Edit...my rough math says Hblank is about 30 cycles...
I suppose, sta stx sty is about 12 cycles, lda sta 8 cycles, lda sta 8 cycles...5 writes, per Hblank with timed code.

Even if you can time code for the entire screen, that gives you no rendering time for game logic, so is this for some kind of tech demo that changes a BG color every scanline?

(Edited) Disch in another post says Hblank is "28 cycles" long.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172613)
dougeff wrote:
I feel like this post was written by some kind of automated nesdev post bot. It contains words that you would find in a typical post, but not in any order that makes any sense to me.

Quote:
Can we fit six arbitrary Hblank (PPU register/CHR bank/VRAM) writes in every scanline
LOL this is pretty funny. I want to try...


Quote:
Does MMC5 allow horizontal VRAM updates on consecutive odd cycle sprite NMIs?


I'm sorry, no offense intended and please split me if this is memeworthy but I found dougeff's post very humorous.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172614)
What Myask said is true; there are 341 pixels, 256 during rendering and 85 during hblank, so there are 28⅓ instruction cycles during hblank.

Except that whether we can use all 85 cycles during hblank depends on the nature of the raster effect. We might have as few as 62 pixels. (The light blue area on Ulfalizer's timing diagram; while the PPU is fetching patterns for sprites). Afterwards, we might collide with the tile fetches for the left-most two columns.

Also, the the relative alignment of the CPU and PPU means that we'll rarely get all 28(or 20) cycles; we probably only actually have 27(or 19) even given precision to single cycles.

Cycle-perfect timing means the first write can finish on the first cycle of hblank. That leads 26/18 cycles for any subsequent loads and stores.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172615)
Awww you edited it. I suppose I'll have to edit mine too. Anyway, yeah I see you figured out the context of the sentence.

Here's a thread from when I first started playing with them. viewtopic.php?f=2&t=13360 The discussion convinced me to limit to techniques not requiring disabling rendering during a scanline, for gameplay at least. There are specific situations where it could be useful, like a scroll bar, but for the most part, it doesn't seem feasible in gameplay, therefore a generic solution to do the technique at the best extent possible for the hardware isn't particularly useful, as it would have to be tailored to when and where something like a palette change could happen.

If you could somehow figure out a way to fit one palette color change in hBlank, that could be big, but that alone would take some magic. People keep figuring out new things all of the time though.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172616)
Sorry, you guys are too quick for me. I restored the original, so the replies make sense.

I have a bad habit of posting before I've fully thought through the comment, and then editing my comment after it's been posted.

Another edit...If the first write to PPU is the first half of a PPU address, then "yes" you can fit 6 in a Hblank, the first occurring just prior to it.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172617)
Myask wrote:
"[How] Can we fit six arbitrary Hblank (PPU register/CHR bank/VRAM) writes in every scanline every frame? If not, how much freedom needs sacrificing to fit them in?"
Look at my annotated Indiana Jones title screen. 7 writes on three scanlines in a row, five clusters each vsync. (Yes, the first write takes advantage of the first write to $2006 being buffered)

... Oh, man, I don't know/keep forgetting the offset between PPU cycles and when pixel N shows up on-screen. That makes this annoying to say anything useful.

The last write (enable) is timed such that it has just a little clearance (pixel 311 at latest) before the first real nametable fetch restarts (cycle 320) ... and they've also got the "conceal leftmost 8 columns" bit on, whatever that means.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172621)
Check out this attachment if you want to see some old and ugly code where I tried to pull something like this off. It does no less than 8 PPU register writes every scanline. It tries to shut the screen off early on every line, so the usable horizontal size is actually smaller than the NES screen. It was partly successful, it worked some of the time. I didn't understand why at the time, but I think it must have been from the differing CPU/PPU alignments on power-up. IIRC, moving it back 1 cycle or forward 1 cycle hosed everything up, so it was a fun little experiment that almost worked. I'm sure it could have been done better.
Re: Maximal arbitrary beamchasing? Code golf
by on (#172832)
Just a random thought on this matter.

If you really want to cram as many operations as possible in a scanline, I think you're going to have to write machine code into RAM, and load absolute values.

I'd be interested in seeing some developments in the raster arena that fit into game design. Doing fancy things in this department often requires some significant trade-off of cycles, inflexibility, or difficulty in implementation.

Perhaps a technique to safely cram updates after an already timed scroll split could be useful. Doing something like a status bar or some minimal parallax isn't often too hard. I started thinking about this last night and considered doing the same as I described, with absolute values in RAM, so that I could assign a unique palette to my top status bar.