Mapper 7 32kb Switching kernel

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Mapper 7 32kb Switching kernel
by on (#41852)
I was bored this afternoon and I decided to do some experiment on how to build a mapper 7 setup (also valuable on any mapper that allow full 32kb switching at the same time like MMC1 or MMC5)

The problem comes from the 32kb bankswitching. It can be quite troublesome, because as code is often executed from ROM, when writing to a register in order to do banswitching, this will immediately bring us in the new bank without having the time to jump somewhere safe, and this could be quite a hard task to handle.
Fortunately, if you put a string of bytes that is the same in all banks to do this, you can jsr to a bankswitching routine which is the exactly the same in all banks. Or you could have a routine in RAM that changes the adress of the ROM while it's not used, but you'd still have to come with one RESET; NMI and IRQ routine per bank.

I have no idea how commercial games handled this (I haven't even looked in them at all), but I guess it would be really great if you could jsr to a routine in another bank just as easily as if it were in the same bank (in the view of the main code). By tricking the stack a little this is possible to do, and I made it so it would be as much ROM efficient as possible.

So I have a tiny kernel of about 100 bytes which you will be able to put in all PRGROM banks, and that kernel handles interrupts and bankswitching.

Code:
_Table
   .db $00, $01, $02, $03
   .db $04, $05, $06, $07

_Reset
   sei
   cld
__   lda #:Start
   sta _b+1.w      ;Bankswitch start bank in
   sta LastBank.w
   jmp Start      ;Go to actual rest code

_NMI
   pha               ;NMI interrup
   txa
   pha
   tya
   pha
   lda LastBank.w
   pha
   lda #:NMI         ;Save state and get bank for actual code
   tay
   sta _Table.w,Y
   jmp NMI          ;Jump to actual interrupt code

_IRQ                ;Same as above
   pha
   txa
   pha
   tya
   pha
   lda LastBank.w
   pha
   lda #:IRQ
   tay
   sta _Table.w,Y
   jmp IRQ

_endInt
   pla                 ;Restore state
   jsr _MapperWrite
   pla
   tay
   pla
   tax
   pla
   rti

_LongJSR                  ;This jsr to any routine in any bank easily
   pla
   sta PointerL
   clc
   adc #$03
   tay
   pla
   sta PointerH   ;Get pointer where return adress is
   adc #$00      ;and add 3 to it
   pha
   tya
   pha

   lda LastBank.w
   pha         ;Push old bank number
   lda #>(_MapperReturn-1)
   pha
   lda #<(_MapperReturn-1)
   pha         ;Return adress to restore old bank

/*   ldy #$02          ;VERSION N°1
   lda (Pointer),Y
   pha
   dey
   lda (Pointer),Y
   pha
   ldy #$03
   lda (Pointer),Y
   tay
   sta _Table.w,Y      ;Bankswitch the bank we want
   sta LastBank
   rts      */

   ldy #$01             ;VERSION N°2
   lda (Pointer),Y
   sta JumpPtrL.w
   iny
   lda (Pointer),Y
   sta JumpPtrH.w
   iny
   lda (Pointer),Y
   tay
   sta _Table.w,Y      ;Could be sta (Pointer),Y but this will still
   sta LastBank.w      ;cause potential conflicts on real hardware !
   jmp (JumpPtr)

_MapperReturn
   pla                      ;When RTS, automatically return there
_MapperWrite
   sta LastBank.w
   tay
   sta _Table.w,Y
   rts

If you wanted to do call a routine normally you'll do that :
Code:
jsr Routine


And with that kernel, all you have to change if the routine is not in the same bank is :
Code:
jsr Longjsr
.dw Routine
.db :Routine   (the : is for getting bank number in WLA, I don't know how other assemblers handles this).

In fact there is 2 versions of it, the normal (uncommented) version, and a slightly faster version (that is commented) that uses rts insead of jmp (indirect), but you'll have to add a -1 after the .dw
And I use now parenthesis for indirection as dish suggested so that it is compatible with 65816 syntax.

The kernel will automatically save the old bank, bankswitch the new bank, and resume the old before returning after the .db statement to continue the code normally. That way you can do many subroutine calls from all banks to all banks very easily. The only drawback is that it will take more stack space and more CPU time but you get nothing for noting. Also, it's impossible to pass arguments with the A or Y registers, only X could be used, and that applies for both input and output arguments which is a shame.

According to WLA doccumentation, it does supports multiple labels with the same name if and only if they are at the same adress. By wroting this I could verify this is not true, no matter what I try duplicate labels at the same adress are not supported. However, I evnetually did a trick to not have to get rid of labels and get everything manutally.
You could just copy/pase the code and add numbers after labels for each bank but this is ugly and if you want to change the kernel it will be troublesome to go changing in each bank.
So I just use local labels, and afterthat, I use .export directive so that they are available to the rest of the programm. I get warnings for exporting many times the same label, but not any errors so this comiles fine 8)

In order to make the Longjsr as transparent as possible I'd liked to have a macro under WLA that would automatically call Longjsr if the caller is not in the same bank as the called, and just a normal jsr otherwise.
Unfortunately I can't get it to work, I keep getting errors. I have the following macro, if anyone has the solution what is wrong please tell me :
Quote:
.macro jsl args label
__ ;This is a dummy label
.ifneq :label :_b ;Check if the argument has the same bank as the dummy label
jsr LongJSR
.dw label ;If not, use the LongJSR routine from the kernel
.db :label
.else
jsr Label ;If so, just jsr normally
.endif
.endm


Let me know if you like the kernel and what could be improved.

by on (#41854)
What I generally do is have my bankswitching code in RAM. Though it's just fine to have a small copy in every bank, since you may need the vectors in every bank anyways. 256 or 512 bytes wasted out of each 32kB bank doesn't amount to a lot, considering how much easier it makes it.

Squeedo uses 32kB banks, and by necessity it has to start in the last bank, so I put pull-up resistors on it. You probably wouldn't need to do that.

For the reset at least you could easily just use 16 bytes at $FFF0 to config it and jump to the real startup bank.

Quote:
I have no idea how commercial games handled this (I haven't even looked in them at all), but I guess it would be really great if you could jsr to a routine in another bank just as easily as if it were in the same bank (in the view of the main code). By tricking the stack a little this is possible to do, and I made it so it would be as much ROM efficient as possible.

I believe some Atari 2600 games did something like that, JSR to switch banks. But there were a lot fewer control signals on the cart edge of that system, kind of odd for making add-ons (very unlike the NES).

by on (#41857)
Banskwitching from RAM sounds fine, but after the bankswitching, where to jump ? You could hold the destination adress in XY, but eventually in your code you'll doing to do bilions of times :
Code:
lda #:SomeRoutine
ldx #<SomeRoutine
ldy #>SomeRoutine
jmp RAMBankswitch

Instaed of doing billions of time
Code:
jsr LongJSR
.dw SomeRoutine
.db :SomeRoutine

Not only it saves 3 bytes each time, but also it makes it the routine in question be able to just rts so that the flow of the programm continues normally. In the above case, it would have to "jump" back into the main programm by itself, and then it's not callable from another point and loses the advantage of subroutines !
I like the fact that a rubroutine will do a "rts" and that any needed bankswitching is automatic to return to the main programm.

by on (#41862)
I can confirm that the FF7 chinese pirate game puts a bunch of code into RAM. I'd assume it probably uses 32K bankswitching.

by on (#41886)
You could always use BRK instead of JSR to invoke these "remote" functions. I mean presumably you're not going to be using DMC IRQs in a 32k switching environment anyway, so why not just hijack the interrupt vector for something a little more useful? Plus if you go through a jump table and use single byte parameters instead of full three-byte addresses then there's no need adjust the stack on return.

Then again precisely what bank-switching scheme you end up using probably doesn't make much difference in the long run. I mean I appreciate nice generic systems and clever hacks as much as the next guy, but if you're switching frequently enough or have enough entry points for it to be an issue then you've probably got bigger problems to worry about.
Not that I've ever attempted a large project on a bank-switched machine, but in the long run I suspect you'll pretty much always end up painstakingly laying out the subsystems and their associated data in such a way as to minimize transitions and reduce bank spill.

by on (#41887)
The enhanced Apple IIe had to fit a 20 KB BIOS+BASIC into the same space as the Apple II Plus's 12 KB BIOS+BASIC. So it was bankswitched, and a lot of routines would load a syscall number and jump to a dispatcher. The dispatcher would switch in the alternate bank, set up the address of the routine on the stack based on the syscall number, and RTS to the routine. Then the routine would "far return" by jumping to the other side of the dispatcher.

The 65C816-based Apple IIGS did something similar, with a routine at $00/F89C that would switch from emulation mode to native mode, look up an address in the jump table, call it, and switch back to emulation mode. If you have a IIGS or emulator handy, start looking for "9C F8" in the ROM to learn more.

by on (#41916)
Quote:
You could always use BRK instead of JSR to invoke these "remote" functions. I mean presumably you're not going to be using DMC IRQs in a 32k switching environment anyway, so why not just hijack the interrupt vector for something a little more useful? Plus if you go through a jump table and use single byte parameters instead of full three-byte addresses then there's no need adjust the stack on return.

This is a very interesting idea, and that would take 4 bytes per call instead of the normal 3 which is better than the 6 I needed in my version.
However, BRK will put the actual return adress instead of -1 as JSR does. I would still need to add 1 to it before returning, unless I can make with a 2-byte agrument, but how could I ? I need the full adress of the target routine and the bank number. I could have it pre-loaded in A, but I would need to waste one byte for lda immediate. Altough it's not that bad, because you'll be able to switch faster that way if no adjustement to the return adress is done.

I'm going to try this out.

by on (#41925)
Bregalad wrote:
Quote:
Plus if you go through a jump table and use single byte parameters instead of full three-byte addresses then there's no need adjust the stack on return.

This is a very interesting idea, and that would take 4 bytes per call instead of the normal 3 which is better than the 6 I needed in my version.
However, BRK will put the actual return adress instead of -1 as JSR does.

The disassembler in the Apple IIGS BIOS always disassembled BRK as a two-byte opcode: BRK $23, BRK $AB, etc. You could use this byte to look up the address and bank of the target routine in a table.

by on (#41929)
I don't quite like this idea as it limits the target routines to 256 pre-defined routine and that would be annoying to deal with (altough workable).
And since you'd have to get data from the lockup table I bet it wouldn't even be faster.

by on (#41935)
Bregalad wrote:
I don't quite like this idea as it limits the target routines to 256 pre-defined routine

You could treat each bank as a module, where banks present a well-defined interface to other banks. For example, the sound engine might present about three entry points: startMusic, startSoundEffect, and runSound.

by on (#41936)
Or just do linking the stupid TASM way:
Use an assembler which can export labels
First delete all export files
Assemble each source file separately to get its exports
Then assemble each source file for real, now that all exports are available.
This of course assumes the assembler will generate exports in the event of an error.