I'll ask again - how long is a $4014 sprite DMA in CPU cycles? I suspect my code isn't right. The idea should be simple, but I guess it's more complicated.
Currently, test ROM irq_and_dma fails, even in Nintendulator. Plus, the DMC and SPR DMA test ROM looks ok, but... I'm unsure.
EDIT: ok, my emulator executes 513 cycles (+1 on odd cycles).
Or, alternatively put: How does the DMA work?
My current emulation approach (untested) works like this:
- On the cycle when $4014 is written to; the value from bus is read and captured, but nothing else happens (cycle 0)
- On the next cycle after that, RDY is signalled, and a 1 cycle of wait is performed. (cycle 1)
- Then, the next 512 cycles, the addressline is alternatingly signalled with R:value*0x100 and W:0x2004. The DMA does neither write nor read the values on the bus; it only controls the addresses and reading/writing modes. The RAM will put values on the bus and the PPU will read values from the bus, when their respective addresses are signalled. (cycles 2-513)
- On the final cycle, RDY is reset, and the DMA controller goes back to its default state. (cycle 514)
This approach does not cooperate in any manner with the DMC DMA. Currently, the DMC has its own DMA, which works as such:
- On the cycle when the wavelength counter demands a next sample, and the sample buffer is empty, RDY is signalled, and nothing else is done for DMC. (cycle 1)
- Two next cycles, idle, for a total of 3 idle cycles. (cycles 2-3)
- Fourth cycle, address line is signalled with R:address. (cycle 4)
- Fifth cycle, value is read from the bus (one cycle of wait was necessary for the value to become available on the bus) and RDY is reset, and the DMC unit continues its business. (cycle 5)
If the DMA and the DMC happen simultaneously, they will do ill things.
The CPU works as such:
- At the beginning of cycle, if the address line is in Reading mode, RDY is checked.
- If RDY was not signalled, the current cycle is processed. Otherwise, nothing is for this cycle.
- Processing the cycle (i.e. when RDY is not a problem), involves reading the bus (if previous cycle was supposed to signal a read), doing actions, and possibly putting a value on the bus (if the current cycle is going to begin a write), and programming the addressline with an address and either a read or a write.
Without testing I already know that this does not work properly when DMA or DMC has interrupted the processing and put other values on the address bus than expected by the CPU in between of the end of previous cycle and the beginning of current cycle.
I would appreciate any and all corrections to these workflows.
Personally, I'd prefer the following format for such description...
Code:
Write instructions (STA, STX, STY, SAX)
# address R/W description
--- ------- --- ------------------------------------------
1 PC R fetch opcode, increment PC
2 PC R fetch low byte of address, increment PC
3 PC R fetch high byte of address, increment PC
4 address W write register to effective address
Anyway...
Quote:
- On the cycle when $4014 is written to; the value from bus is read and captured, but nothing else happens (cycle 0)
AFAIK, the value on the bus should be from the last fetch (high byte of address); quite vague.
Quote:
- On the next cycle after that, RDY is signalled, and a 1 cycle of wait is performed. (cycle 1)
Well, yes, I could confirm such info a few months ago, after debugging my sprite dma code; and extra cycle is at the beginning, not at the end of it.
For clarity, Zepper, my post was not an attempt to answer your questions. I was asking more questions from an alternative perspective. Sorry about the confusion.
Bisqwit wrote:
For clarity, Zepper, my post was not an attempt to answer your questions. I was asking more questions from an alternative perspective. Sorry about the confusion.
Yes, that's ok.
SPR DMA should take 513 cycles if it starts on an even cycle, 514 if it starts on an odd cycle.
Remember that SPR DMA only takes place when the instruction has finished executing, and not at the write cycle of the instruction! You can check this behavior by using a RMW instruction with $4014, it will only do one sprite dma at the end of the instruction.
I m able to pass irq and dma, as well as both spr and dma tests using this implementation.
But which page will get copied out? For example, if the program does ROL $4014 ($0E $14 $40), the CPU reads $40 from open bus, writes back $40, and then writes back $80 or $81 depending on carry. Does it end up copying $4000-$40FF or $8000-$80FF/$8100-$81FF? The difference can still be discerned with sprite 0.
tepples wrote:
But which page will get copied out? For example, if the program does ROL $4014 ($0E $14 $40), the CPU reads $40 from open bus, writes back $40, and then writes back $80 or $81 depending on carry.
Probably only the "STA $4014" is being seen as sprite DMA, as "standard". Your example is perfectly possible, but non-standard... and yes, good point.
Quote:
Does it end up copying $4000-$40FF or $8000-$80FF/$8100-$81FF? The difference can still be discerned with sprite 0.
Could it be tested in a Powerpak?
I actually came to the forums to ask this very question! I was wondering what would happen if you executed a RMW instruction on the $4014 register.
I would think that there are only 2 possible cases:
case 1 - The PPU ignores two writes in a row, so the page from $4000 - $40FF is copied (Garbage is displayed on the screen?). 513 + n cycles are consumed.
case 2 - The PPU doesn't ignore two writes in a row, so 2 DMA transfers take place, the first from $4000, the second from $8000 or $8100 depending on carry. 513 + 513 + n cycles are consumed.
This is obviously a very specific quirk of the NES hardware, as no commercial game I've encountered does this, as far as I've seen (I haven't searched the disassembly of every commercial ROM). But it would be very interesting to know the answer.
Just when I thought I was being clever, you guys are already discussing!
STA $4014 seems to be just a bunch of STA $2004 stacked one after another, without the associated overhead of the opcode fetching.
It seems logical me (but again, this is just speculation) what the CPU stores the written $4014 value to a temporary reg. So when you use ROL $4014, you write to that same reg twice, and only the last written value will be used at the and of the ROL instruction/beginning of SPRDMA transfer.
Because of that, I believe that when using ROL $4014, the CPU will copy either $80 or $81, depending on the C flag.
However, it might happen that the temp reg cannot be written for two consecutive cycles (that happens, for example, with the MMC1 mapper). In that case, $40 would be used.
A test would be nice though
Edit: This has been discussed before:
http://nesdev.com/bbs/viewtopic.php?p=16685
crudelios wrote:
Because of that, I believe that when using ROL $4014, the CPU will copy either $80 or $81, depending on the C flag.
I thought about this more. The ROL would write $40 to $4014, and depending on how quickly the PPU initiates the transfer either immediately, or soon thereafter stalls the CPU by stealing the bus. Which is why I thought 2 transfers would seem logical, but I guess according to blargg only one transfer happens.
Not to question blargg, but that doesn't seem right unless Ricoh was aware of this and designed the PPU to only acknowledge the first of multiple sequential writes. It would be nice to have a thorough test of this on a real NES though.
Until then, blargg is definitely a reliable source.
Well, the DMA can't start until the previous instruction finishes, because the 6502 does everything on instruction boundaries. So I think it follows how that would preclude multiple transfers...
RDY can be pulled low to pause any read, but not writes, as I understand it. That's probably why DMC DMA waits 4 cycles instead of two: to make sure that the three consecutive writes of a BRK, /IRQ, or /NMI have finished.