tepples wrote:
Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.
If the destination is the multiplication register then you can just use D, with no need to bank switch. Should you need to bank switch though, as I had mentioned before, you can do it while waiting for the multiplication result.
tepples wrote:
For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.
If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.
Edit: Oops, I missed a reply.
lidnariq wrote:
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.
You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
Quote:
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
Quote:
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)