Yet another DMA optimization thread

Yet another DMA optimization thread
by psycopathicteen on 2016-06-09 (#173174)

So my current DMA routine is unrolled loop. The repeating part looks like this:

Code:

lda.w {dma_address}+{n},x                             //5        2
sta $02                                               //4 9      0
lda.w {dma_bank}+{n},x                                //5 14     2
sta $04                                               //4 18     0
lda.w {dma_destination}+{n},x                         //5 23     2
sta $2116                                             //5 28     0
sty $420b                                             //4 32     0

//32 + 6/3 = 34 fast cycles

Where's the length word? Well, it's hidden inside the top byte of the "bank" word. Since I'm only updating little individual sprites, it works because I don't have to deal with chunks bigger than 256. Unfortunately it's no longer a "general purpose" DMA routine. If we want to make it more "general purpose" while keeping the speed, we would have to optimize it even more.

Now lets PEI all over on those DMA registers!

Code:

txs                                             //2        0
pei ({dma_legnth}+{n}+1)                        //6 8      2
pei ({dma_bank}+{n})                            //6 14     2
pei ({dma_address}+{n})                         //6 20     2
ldy.b {dma_destination}+{n}                     //4 24     2
sty $2116                                       //5 29     0
sta $420b                                       //4 33     0

33 + 8/3 = 35.667 fast cycles

With the extra length word, it is slightly slower than the first method without it.

Re: Yet another DMA optimization thread
by tepples on 2016-06-10 (#173195)

psycopathicteen wrote:

Now lets PEI all over on those DMA registers!

This is during SEI, correct? Because PEA/PEI on top of registers that aren't readable can be dangerous if you get an IRQ at the wrong time, and the gaps before and between DMA channels' register sets aren't readable.

How many distinct DMAs do you have per vblank? This helps determine how much you save if you prepare all 8 DMA channels (or the 6 you aren't using for HDMA) during draw time and then activate them at the start of vblank. It's probably a lot though, given that each is on the order of 64 bytes (2 tiles as top half of a 16x16) or 128 bytes (4 tiles as top half of two 16x16s). So I see how the overhead can be substantial, as DMA copies 3 bytes per 4 fast cycles, or 128 bytes in 171 cycles.

Is sprite cel VRAM so jam-packed that it would hurt to always allocate 8 16x16 sprites at a time, so you can do an entire strip as one 1024-byte chunk? If not, can sprites with adjacent destinations be coalesced to reduce $2116 writes?

Re: Yet another DMA optimization thread
by psycopathicteen on 2016-06-12 (#173345)

If I SEI before and CLI afterwards, would it work exactly the same?

Re: Yet another DMA optimization thread
by tepples on 2016-06-12 (#173347)

Yes, as long as you restore the stack pointer before you reenable interrupts.

Re: Yet another DMA optimization thread
by psycopathicteen on 2016-06-12 (#173358)

I hope this would be the final optimization for DMA code. Keeping everything under vblank has always been a pain in the butt.

Re: Yet another DMA optimization thread
by psycopathicteen on 2016-06-17 (#173729)

You know what? I think I'll just use a separate list longer DMAs, instead of going through the hassle.