Battletoads' $2007 writing system

Battletoads' $2007 writing system
by Bregalad on 2005-03-24 (#1692)

Has everyone ever traced some pieces of the code in Battletoads ?
The player moving is done just by changing the content of the same tiles (like in some SNES games like Chrono Trigger), and it's able to write a lot of stuff in $2007 in VBlank by doing :

Code:

pla              ;3 cycles
sta $2007    ;4 (7 cycles per write)
pla              ;3
sta $2007    ;4

insted of :
lda $xxxx,X  ;6 cycles
sta $2007     ;4
inx               ;2 (12 cycles per write)
etc...

... and I belived that all games with 32kb ROM bakswitching were technically gabrage...

by blargg on 2005-03-24 (#1693)

I posted a thread about this technique to 6502.org a while back.

Using S as fast index register

The stack register (S) can be used as an extra index register for going through a small buffer more rapidly than possible with X and Y. It might be useful where a buffer of needs to be quickly read to or written from some output device. The data is simply pushed on the stack, then popped off the stack. Both operations are faster than using an index register, and leave both index registers free for other use.

This example quickly outputs a buffer of 0-terminated data to a memory-mapped device outside of zero-page. Each byte takes 11 cycles to read from the buffer and output:
Code:
lda #0 ; 0 terminator
pha
... ; push data on stack

jmp next
read sta port ; write to device
next pla
bne read

This example quickly reads data from a device and stops when it receives 0. Each byte takes 10 cycles to input and write to the buffer:

Code:
tsx ; save current stack pointer
stx end

write lda port ; read from device
pha
bne write

read pla
... ; use data
tsx
cpx end
bne read

By putting the buffer at the bottom of page 1, S can be used as both a counter and index for a write buffer. Each byte takes 12 cycles to input and write to the buffer:

Code:
tsx ; save stack
stx stack

ldx #size
txs

loop lda port
pha
tsx
bne loop

... ; use data

ldx stack ; restore stack
tsx

By putting the buffer at the top of page 1, S can be used as both a counter and index for a read buffer. The normal stack would need to be placed lower in page 1 to coexist with this scheme. Each byte takes 13 cycles to read from the buffer and output:
Code:
ldx #0 ; init stack
txs

... ; push data on stack

loop pla
sta port
tsx
bne loop

by tepples on 2005-03-24 (#1694)

Theoretical best case scenario on NTSC
There are 20 full scanlines (341*20 PPU cycles) plus roughly 256 PPU cycles between when $2002.D7 is set to 1 and when loopy_T is copied to loopy_V. This equals (341*20+256)/3=2358 CPU cycles. With this technique demonstrated in Battletoads, what all can we fit into vblank?
1792 cycles: Copy all 256 bytes from stack to VRAM using this method in a completely unrolled loop.
28 cycles: Set up scroll registers.
525 cycles: Copy sprite table to PPU (done last for safety).
That makes 2345 cycles, which is just barely under the limit without any sort of turn-the-screen-off-early trickery.

More practical scenario for NTSC
Because most games will use some of the stack for at least something "normal", we'll see how long a 192 byte buffer takes:
1344 cycles: Copy 192 bytes from stack to VRAM using this method in a completely unrolled loop.
28 cycles: Set up scroll registers.
525 cycles: Copy sprite table to PPU (done last for safety).
Total: 1897 cycles. This seems to indicate that a complete unrolling may not be necessary, that it may be possible to reset the VRAM destination pointer ($2006) a few times during the copy.

EDIT: correctness

by Bregalad on 2005-03-24 (#1695)

Well, battletoads seems to also upload the palette every VBlank.
For uploading a serie of tiles, $2006 can be uploaded only once, but often for coupling Name Table and Attribute Table writing, it shall be typically 3 string, 2 for two NameTable rows (or columns) and one for the attribute row. Hum, writing attribute table in collumn mode can't be done just by set the bit $2000.4, we should upload $2006 for every byte while scrooling horizontaly (this could be avoided in a vertical mirroring hardware, but Battletoads is 1-screen mirroring so it shall do it this way).
I think the better way would be to separate the NameTbl/AttributeTbl buffer with standard $2006, $2006, $2007 sheme and the PatternTbl buffer that could be in the stack page, with only one write to $2006 and lots of $2007 writes.
So in VBlank, only if the NameTbl/AttributeTbl buffer is free, we can process to upload eventual Pattern Table data with this superfast methode. If the buffer is 192 bytes, it would allow us to overwrite 12 tiles per frames. Cool ! The only problem is that the effect would be slower when the screen is scrooling.
Also, this would be useless on a game with VROM.
Last thing, I think uploading the buffer the usual way (sta $100,x etc....) would be better so it will allow you to call various subroutines during the fill of the buffer, and this would be of course impossible with only PHAs.
And the goal of buffers like this is to take all your time to fill them and them update data fastly to the PPU.

Ah, yes, I suspect Battletoads to force a longer NMI with $2001, and then turn the screen of a bit later because there is unused space on the screen above the status bar. That way, it can have more writes to $2007, and mix the usual and the superfast way to accomplish them.

by blargg on 2005-03-24 (#1697)

Nice breakdown, tepples. If you needed the whole stack, you could save/restore it to/from temporary area at 15 clocks per byte (8 to save it and 7 to restore it). If a normal 64 byte stack area were being used and the whole 256 bytes were needed for a quick buffer copy, swapping these 64 bytes out and back in would take 960 cycles total. This of course would be done in non-cycle-critical code.

Heh, a mapper which allowed changing the stack and/or zero page would be neat (like on the 65816 in the SNES).

by tepples on 2005-03-24 (#1699)

Bregalad wrote:
Also, this would be useless on a game with VROM.

You don't know how many bytes a puzzle game has to push to update its playfield during complicated gravity effects, do you? If you don't want to see wipe effects, or you want to update both players' playfields at the same time, you need to be able to write 200 or more nametable bytes at once.

by Bregalad on 2005-03-25 (#1702)

tepples wrote:
You don't know how many bytes a puzzle game has to push to update its playfield during complicated gravity effects, do you? If you don't want to see wipe effects, or you want to update both players' playfields at the same time, you need to be able to write 200 or more nametable bytes at once.

By "puzzle game", do you mean "tetris game" ?
The stark thing could be used for more nametable data, but usually, a write to the name table is more than one $2006 PPU index, unless you want to change the content of 6 rows fastly (192 bytes -> 6 rows), witout touch to attribute tables. Otherwise a buffer with $2006, $2006, x times $2007 is possible with the stack, but it would allow less bytes to be pushed in, cause of the adresses, and you have to be very carfully, because if you do something wrong, the programm would frezee.
If you want it just for a particular NameTbl effect, that's the number of $2007 writes after the same $2006 adress is always the same, this could be much easyier to do.
To bankswich the stark would be a nice idea, this would need a card with 512 bytes of SRAM on it that could be swapped ineed the zero page and the stack into the NES's RAM, but I don't think it's possible without any modification to the NES itself (I lack in knowledge about that). At this point, redirect $4014 DMA to $2007 would be better (but here updating sprites would be very slow, and the this would probabily be worse than before)[/code]

by Memblers on 2005-03-25 (#1704)

The stack trick is really cool. But if there's lots of RAM to spare, the speed of self-modifying code can't be beat. Definitely demands WRAM though.

I've used a huge unrolled code array loaded into RAM with just immediate loads and stores to $2007. It costs 5 bytes of RAM and 6 cycles for every byte to be copied. The loading code can change the values used with the immediate loads, and the register location to $2006 when needed. And insert an RTS where you want it to end (or maybe even faster, a JMP, heheh).

Of course, it also would need to keep track of where it made the $2006/$2007 and RTS changes so it can revert quickly to 'blank' when needed. But noone ever said self-modifying code is simple.

It may be slow to load up the buffer, but it's very, very quick to unload it this way.

by tepples on 2005-03-25 (#1705)

Bregalad wrote:
tepples wrote:
You don't know how many bytes a puzzle game has to push to update its playfield during complicated gravity effects, do you? If you don't want to see wipe effects, or you want to update both players' playfields at the same time, you need to be able to write 200 or more nametable bytes at once.

By "puzzle game", do you mean "tetris game" ?

Correct, I'm thinking of Tetris and its progeny: Columns, Puyo, Zoop, Puzzle Fighter, and lots more.

by Bregalad on 2005-03-25 (#1706)

I've never do anything like that, but if I would myself do something like tetris or puzzle fighter, I would just do the falling puzzle piece in sprites and the rest in BG, that should work very fine, and the only time when you need to upload the name table would be when a flying piece become ground.¨

Self modifing code, you said ?
So it wold call a subroutine into $6000 or something that do for example :
lda #$20
sta $2006
lda #$c0
sta $2006
lda #$27
sta $2007
lda #$65
sta $2007
.....
lda #$3a
sta $2007
rts

You mean something like that ? It would do 2+4=6 cycles per writes, with beats the Battletoads' technique, and it additionally isn't limited to ~192 bytes. But, damn, it would waste A LOT of RAM !!! 5 bytes per write !
If you want to use the whole NMI time, 2345 cycles, minus 525 cycles for sprite DMA, and 28 cycles for setup scroll registers, this do 1792 remaining cycles, and divide that per 6, we could in theory write 298 bytes (in pratice this would be less because you also need to push A, X and Y, etc....), so, it would be able to fully overwrite up to 18 tiles in pattern table or up to 9 rows in the name table, or 8 name table rows with the remaining time for the attributes.
Damn, it's really great, but now we have to bother with RAM spacing. (298*5)+1 = 1491 = $5d3, so if the buffer start at $6000, it would end at $65d2, wasting about one sixth of the WRAM space. Well, it's a lot, but it's a good escient use after all. But it's absolutely impossible without WRAM.

Look at Final Fantasy II, when you got the "ring" at the begining of the game you can push B ans Select button to watch the overworld map.
The map is circular and you can move the cursor, the programm does calculation in order to render it a bit like a 3d engine and changes the tiles when you moove arround (this can be show in a pattern table wiewer like in VirtuaNES or..... Nesticle). But mooving is damn SLOW ! You just moove of one milimeter and it takes about a segond (it was something like 65 frames), actually it take a lot of time to calculate and overwrite the map's tiles, about 4 tiles per frame. Imagine, with the stack system we could do 12 tiles per frame and with the WRAM system 18 tiles ! So it would be respectively 3 and 4,5 times faster ! Hum, I may hack my rom to have it doing that. That way, I could explore FF2's world faster.

by tepples on 2005-03-25 (#1707)

Bregalad wrote:
I've never do anything like that, but if I would myself do something like tetris or puzzle fighter, I would just do the falling puzzle piece in sprites and the rest in BG

Until you get more than eight falling pieces on one scanline or more than sixty-four falling pieces on one screen. You must be thinking of what would be done more in a Super NES or GBA situation.

Quote:
Self modifing code, you said ?
So it wold call a subroutine into $6000 or something that do for example :

And run up extra replication costs for having more banks of RAM at $6000, especially when the choice is RAM vs. open bus at $6000.

by Bregalad on 2005-03-25 (#1708)

How do you want "more than 8 falling objects on a scanline" or "more than 64 falling objects on the sceen" ? You just have several sprites doing one objects for every player, isn't it ?
And the SNES also has a sprite limit per line (32 sprites I think), and this is sometimes shown in Secret of Mana. I don't know about the GBA, but as far I know think it's sprites possibilities are near-unlimited (there is 128 scaling and rotating individual sprites with variable size and they can be all on the same scanline, as far I know).

For FF2, I was wrong, the calculation is slow but the way it updates the tiles is pretty fast (up to 8 tiles per frame). It does it the usual way in a large-rolled loop like :
Code:
ldx #$00
loop:
lda $500,X
sta $2007
lda $501,X
sta $2007
lda $502,X
sta $2007
....
lda $50f,X
sta $2007
txa
clc
adc #$10
tax
dey
bne loop

The stack thing wouldn't improve it scince it calculates 16 tiles (doing about 4 frames of calculation !!) and update 2x8 tiles, so update 8/8 or 12/4 would be the same. The only way to improve this would be the way Memblers said, it would be able to fill the wole 16 tiles row in one frame, but it would need a lot of WRAM and I think it's already used for other issues in FF2, and if it wouldn't it would fast up the programm for one frame between six, so this won't fast up enough.

PS: The way Membler's said could also be used for an intro of a game witout WRAM, so the whole NES ram exculding the reserved parts (i.e. $300-$7ff) would be used for a buffer like that *before* the games use this ram for a normal use during gameplay. This would allow, for example, a cool grafical effect when the title screen pops up.

by tepples on 2005-03-25 (#1711)

Bregalad wrote:
How do you want "more than 8 falling objects on a scanline" or "more than 64 falling objects on the sceen" ? You just have several sprites doing one objects for every player, isn't it ?

If you're not handling the falling pieces as background, then you have to handle the falling pieces as sprites. In Tetris, the playfield is 10 blocks wide by 20 blocks tall; if you make a line at the bottom of the screen, you'll get a lot more than 64 falling objects. In Puyo or Wario's Woods, as the objects are 16 pixels by 16 pixels, you would have to use two 8x16 pixel sprites for each falling object. Would you rather use flickery OAM cycling?

And puzzle games aren't the only games that would benefit from nametable animation. Look at some of the bigger sub-bosses from the Mega Man series, the ones that fade out the rest of the screen when they're drawn.

Quote:
And the SNES also has a sprite limit per line (32 sprites I think), and this is sometimes shown in Secret of Mana.

It's actually 256 sprite pixels per line (which makes a difference if you're using larger sprites). I've noticed it showing up in Super Mario RPG.

Quote:
I don't know about the GBA, but as far I know think it's sprites possibilities are near-unlimited (there is 128 scaling and rotating individual sprites with variable size and they can be all on the same scanline, as far I know).

GBA has 1210 rendering cycles per scanline, or 954 rendering cycles per scanline if you turn on the ability to write to OAM during horizontal blank, which a few games do to get more than 128 simultaneous sprites or more than 32 distinct rot/scale matrices. Each pixel of a non-rot/scale sprite takes 1 rendering cycle; each pixel of a rot/scale sprite takes 3.25 rendering cycles (26 cycles per 8 pixels of width).

by Bregalad on 2005-03-26 (#1712)

I've no knowledge about puzzle game, but I didn't say writing every piece as sprite, only the one who are just falling. When they land on something, they would suddently become BG, and I guess a "usual" buffer can handle this perfectly witout using any "superfast" buffer.
About MegaMan, I think the BG semi-bosses fades just with palette. And if you don't want to have a palette fade out, you can do a cool effect by clearing the exterior tiles first and the interior tiles at the end, for example, if you see what I mean. I think it's better to use "superfast" buffers only when this is really needed, for example to do animation with ChrRam, and with a game whith ChrRom it becomes less needed, but it would be usefull too.

by Memblers on 2005-03-26 (#1713)

Bregalad wrote:
I've no knowledge about puzzle game, but I didn't say writing every piece as sprite, only the one who are just falling.

But it's possible for every piece on the screen to be falling, if you remove the ones that were on the bottom. So puzzle games have to be ready to handle the worst cases.

by Bregalad on 2005-03-26 (#1714)

Aha, I got it now.
Yes, so this would need fast buffer, I understand. Technical challenge, heh.