Avoiding using signed multiplication while in Mode 7

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Avoiding using signed multiplication while in Mode 7
by on (#211924)
If I'm planning on making a Mode 7 level that uses some of the same enemies as non-Mode 7 levels, is there any reason to use $211b, $211c and $2134 in the first place?
Re: Avoiding using signed multiplication while in Mode 7
by on (#211925)
To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211928)
You don't necessarily need to wait 8 and have the CPU do nothing. Why not spend that time doing additional processing to mask that latency?

lidnariq wrote:
In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers

If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.

Quote:
and D is unlikely to point to the multiplication registers.

Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.

Quote:
Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211929)
lidnariq wrote:
Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Holding one factor constant is the case for applications that (ab)use the multiplier as a faux barrel shifter. In VWF rendering, for example, each bitplane in a tile is "multiplied" by a particular power of two in order to shift it left by so many bits. This works with the CPU multiplier but not the PPU one because of the signedness constraint.

HihiDanni wrote:
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once.

But if you have set D to $4200 to use multiplication, you have no place to store pointers to the data that you're processing using (dd),Y or [dd],Y addressing, unless you take the cycle and bank flexibility hit of using (dd,S),Y.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211931)
HihiDanni wrote:
If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.

Quote:
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.

Quote:
Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211932)
What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211936)
HihiDanni wrote:
What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.

For one thing, a base address used with aaaaaa,X or [dd],Y is 24-bit, whereas an offset or base plus offset is only 16-bit. Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank. For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211940)
tepples wrote:
Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.

If the destination is the multiplication register then you can just use D, with no need to bank switch. Should you need to bank switch though, as I had mentioned before, you can do it while waiting for the multiplication result.

tepples wrote:
For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.

If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.

Edit: Oops, I missed a reply.

lidnariq wrote:
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.

You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).

Quote:
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.

You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)

Quote:
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.

The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
Re: Avoiding using signed multiplication while in Mode 7
by on (#211941)
HihiDanni wrote:
tepples wrote:
Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.

If the destination is the multiplication register then you can just use D, with no need to bank switch.

The loop I had envisioned was read source, write to multiplier, read multiplier, write to destination.

HihiDanni wrote:
If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.

Good for reading characters out of "<Arnold> I still love Vista, baby\0", not so much for shifting the glyphs that represent each letter.

HihiDanni wrote:
You can do a single 16-bit load into C.

With a 6-cycle (36mc) penalty for REP plus SEP, if needed.

I guess I might need to give an example of what this sort of compositing code might look like.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211944)
tepples wrote:
With a 6-cycle (36mc) penalty for REP plus SEP, if needed.

As it turns out, REP and SEP indeed take three cycles each (for some reason I was imagining two cycles) so that might not be the most optimal way to do it after all.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211946)
HihiDanni wrote:
You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
But that requires that your in-memory structure already be compatible with that (multiplier and multiplicand in adjacent bytes). There could well be reasons that's not feasible.

Quote:
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
That was not the argument I was trying to make.

I was attempting to state that the performance benefit for using direct page access to $42xx is small in comparison to all the other overhead that it doesn't matter much. And especially in the case of $4200, you get literally no other benefit to doing so; there's nothing else there that is useful to have faster access to (Not even two bytes of RAM!). Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.

Quote:
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
What I was really trying to aim at was something to the effect of:

The way to cram the fastest speed out of either multiplier is to minimize as much time spent on I/O, because I/O is the bottleneck. Setting D to point at the multiplication registers can help ... but only if there isn't somewhere else it would be more useful to have it stay at. Setting the multiplier once, and just updating the multiplicand can help, but only if you're doing a bunch of multiplications in a row all by the same multiplier.


Specifically regarding the topic starter, the question was "Should I write two versions of the code, one that uses the faster PPU multiplier and one that uses the slower CPU multiplier? Or just use the slower CPU multiplier for everything?". Everything in my reply was my reasoning to the conclusion: "IF you can do your math in the u7·u8→u15 least common denominator of both multipliers, there is no significant benefit to using one over the other (and go ahead and use the CPU multiplier exclusively). The overwhelmingly biggest benefit from the PPU multiplier comes from if you need fewer total multiplications (and fewer cycles spent on I/O)"
Re: Avoiding using signed multiplication while in Mode 7
by on (#211948)
Your original post makes sense now; thanks for the clarification.

lidnariq wrote:
Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.

Yeah, I had envisioned the multiplications being done in a tight loop. Probably useful for Mode 7 (though it'd depend on how exactly Mode 7 is being used here, whether there will be any perspective effects or not). For object thinkers I'd leave D alone.
Re: Avoiding using signed multiplication while in Mode 7
by on (#211972)
I think I should test everything out with the $42xx registers to see if I need to make 2 different multiplication routines.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212056)
It appears that the performance loss is negligible. :D

Now I need to make something with mode 7.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212568)
lidnariq wrote:
To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.


So, load the operation opcodes is only 1/3 faster using the PPU, but its power doing the multiplications is much more noticeable once time loaded... is like that?.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212615)
Unfortunately my sprite rotation routine happens outside of the main loop, and can be interrupted by NMI at any time, and uses multiplication registers so it needs to disable interrupts. Doing so might cause it to go into NMI pretty late, causing black bars if the DMA is also being used.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212627)
The multiplication registers do their work in the background, and software only has to read and write them, right? I'd think that the state of those registers would survive the NMI interrupt, as long as you don't touch those registers in your interrupt routine. Let me know if I'm wrong though.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212630)
psycopathicteen wrote:
uses multiplication registers so it needs to disable interrupts.

I mean, that's only a problem if the interrupt is using the multiplier too, right? I suppose one might wish to leave the option open...

Alternately, you could dedicate a couple of banks to an 8x8 multiplication table. I think the ALU should be a bit faster, especially if you use the waiting time intelligently, but you could use tabulated multiplication in interrupts without worrying about what the main code is doing, and if you wanted to waste four banks instead of two you could do 7x7 signed multiply just as easily.

Let's see:

Code:
sep #$20
lda A
xba
lda B
rep #$20
asl
tax
bcs +
lda.l multable_low,x
bra ++
+ lda.l multable_high,x
++

Is there a better way to do that?

Hold on. If you split the results into low-byte and high-byte tables:
Code:
sep #$20
lda A
xba
lda B
tax
lda.l multable_hib,x
xba
lda.l multable_lob,x
rep #$20

...exactly the same length (but more predictable). Huh.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212633)
The entire game runs in the "NMI" code. All that's being done outside of NMI, is rotating sprites, and that takes many frames to finish.

I decided to check what scanline the code is on before doing the multiplication and if it is on the last 2 lines in a frame it waits for the next frame.
Re: Avoiding using signed multiplication while in Mode 7
by on (#212716)
Now I just need to figure out how to seamlessly switch to Mode 7 mid level. I'd have to make an alternate version of the tile-map scrolling code. I can probably limit the first area of the level to use the second 16kB chunk of VRAM, and only use the first half the Mode-7 layer's memory.