So my current DMA routine is unrolled loop. The repeating part looks like this:
Where's the length word? Well, it's hidden inside the top byte of the "bank" word. Since I'm only updating little individual sprites, it works because I don't have to deal with chunks bigger than 256. Unfortunately it's no longer a "general purpose" DMA routine. If we want to make it more "general purpose" while keeping the speed, we would have to optimize it even more.
Now lets PEI all over on those DMA registers!
With the extra length word, it is slightly slower than the first method without it.
Code:
lda.w {dma_address}+{n},x //5 2
sta $02 //4 9 0
lda.w {dma_bank}+{n},x //5 14 2
sta $04 //4 18 0
lda.w {dma_destination}+{n},x //5 23 2
sta $2116 //5 28 0
sty $420b //4 32 0
//32 + 6/3 = 34 fast cycles
sta $02 //4 9 0
lda.w {dma_bank}+{n},x //5 14 2
sta $04 //4 18 0
lda.w {dma_destination}+{n},x //5 23 2
sta $2116 //5 28 0
sty $420b //4 32 0
//32 + 6/3 = 34 fast cycles
Where's the length word? Well, it's hidden inside the top byte of the "bank" word. Since I'm only updating little individual sprites, it works because I don't have to deal with chunks bigger than 256. Unfortunately it's no longer a "general purpose" DMA routine. If we want to make it more "general purpose" while keeping the speed, we would have to optimize it even more.
Now lets PEI all over on those DMA registers!
Code:
txs //2 0
pei ({dma_legnth}+{n}+1) //6 8 2
pei ({dma_bank}+{n}) //6 14 2
pei ({dma_address}+{n}) //6 20 2
ldy.b {dma_destination}+{n} //4 24 2
sty $2116 //5 29 0
sta $420b //4 33 0
33 + 8/3 = 35.667 fast cycles
pei ({dma_legnth}+{n}+1) //6 8 2
pei ({dma_bank}+{n}) //6 14 2
pei ({dma_address}+{n}) //6 20 2
ldy.b {dma_destination}+{n} //4 24 2
sty $2116 //5 29 0
sta $420b //4 33 0
33 + 8/3 = 35.667 fast cycles
With the extra length word, it is slightly slower than the first method without it.