To clear up a couple things in the past few posts:
1. On the 65816, before entering an interrupt (IRQ or NMI, doesn't matter), the following things are pushed onto the stack automatically by the CPU in this order: K (active program bank), PC (high), PC (low), and P. When exiting an interrupt (i.e. via RTI), those values are pulled/popped off the stack in reverse order (i.e. normally).
So yes, you do not need PHP/PLP. I suppose this was a "precautionary habit" of mine from my IIGS days where I didn't have books describing the processors' behaviour (I got those near the end of my IIGS stint). However, see point #2 before removing this.
If you need reference material for my statements, refer to the
Programming the 65816 (including the 6502, 65C02, and 65802) by Western Design Center PDF (WDC used to have this on their site but it's since gone/missing, but we keep a copy). See the section on Interrupts and System Control Instructions for details.
2. Regarding PHA/PHX/PHY -- "the accumulator always has sixteen bits in it regardless of whether or not the processor says an 8-bit accumulator or a 16-bit accumulator" is a true statement (for those reading it: read it VERY VERY SLOWLY), but has no relevancy with regards to those stack operations. My point written more simply: if you do SEP #$30 / PHA, you're going to push 1 byte onto the stack because the accumulator is 8-bit.
I think we all know that REP #$20 / LDA #$1234 / SEP #$20 / LDA #$FF will result in an accumulator (internally in the CPU) that now contains the value $12FF (but since a=8 you can only manipulate the lower byte). But that fact has no relevancy to concerns over the stack operations unless you're doing something like REP #$20 / LDA #$1234 / PHA / SEP $#20 / PLA (at this point you'd still have a byte left on the stack from the previous 16-bit push).
Thus, the advantage of doing PHA/PHX/PHY/PHP at the start of the NMI routine, followed by PLP/PLY/PLX/PLA at the end -- particularly the use of PHP/PLP here -- is that if you screw around with the accumulator or X/Y sizes (using REP/SEP) in the NMI routine, when exiting NMI and restoring the contents of A/X/Y, you're not going to end up with a stack that eventually overflows or underflows (due to register size differences). Yes, the CPU will effectively do the PHP/PLP for you, but there's no way for you to "run some code after the CPU internally does PLP" to ensure your previous PHA/PHX/PHY statements get popped off the stack with the same sizes they were when you pushed them on at the start of your NMI routine.
The other solution is to do what KungFuKirby did -- explicitly set the accumulator and x/y index sizes to 16-bit using REP #$30 and then do your pushes, and at the end of your routine again do REP #$30 and do your pulls. Which method is better? REP = 3 cycles, PHP = 3 cycles, PLP = 4 cycles. So by using "explicitly use REP" method, you save 1 cycle. Whoop de doo.
3. In KungFuKirby's routine, the reason he does PHB/PLB is because he tinkers with B later (the LDA #$0000 ... PHK / PLB to set B = $00). The reason he does PHD/PLD is because of the LDA #$0000 / TCD which sets D = $0000.
You shouldn't have any concerns about K after exiting NMI because the CPU takes care of that for you. However, there IS a problem if you intermix emulation mode (ex. SEC / XCE) and native mode (to the OP: you can ignore this, this is just me pointing out an esoteric case). Quoting the aforementioned reference material:
Quote:
In native mode, the program bank (K) is pushed onto the stack first, before the program counter and the status register; but in emulation mode it is lost. This means that if a 65816 program is running in emulation mode in a bank other than 0 when an interrupt occurs, there will be no way of knowing where to return to after the interrupt is processed because the origianl bank will have been lost.
This unavoidable but fairly esoteric problem can be dealt with in two ways: the first is simply never to run in emulation mode outside bank 0. The second solution is to store the value of K in a known location before entering emulation mode with a non-zero K register (and is described later in this chapter).
Anyway -- in short, I don't see how any of the stack manipulation code the OP is using would be causing any kind of problem "on the PowerPak" (more specifically, on hardware -- or an emulator for that matter). Analysis of that is a red herring, IMO.
My opinion is that the problem is elsewhere. I have not looked at the latest code version.