CPU timing precision

Re: FineX, FineY and zero hit
by Fumarumota on 2015-12-11 (#160514)

zeroone wrote:

aLaix wrote:

Now the thing is that if i use 2 ticks of delay Bart vs space mutants shakes the status bar, but battletoads passes second level.
If i use 3 ticks of delay bart wont shake, but battletoads will hang.
both values passes all blargg tests.

Welcome to timing hell. Describe your CPU design and how the PPU is kept in sync with the CPU.

Hi guys, sorry to jump in...

Our CPU is designed to be cycle accurate. It operates the following way:

- It has a "Run" method that is called from main. It runs the amount of cycles worth 1 "frame" at NTSC clock speed (No PAL support yet), any cycles remaining are returned and run in the next frame.
- It emulates page crosses and dummy reads/writes. Matches Nintendulator logs for nestest.nes and passes all Blargg instruction tests.
- Every CPU cycle is either a read or a write cycle, calling its appropriate handler every time, as described in this doc: http://nesdev.com/6502_cpu.txt.
- Inside the CPU read / write handlers, the PPU runs for 3 cycles (3 dots). PPU runs first, then CPU read / write is handled. This is: 3 PPU cycles, then 1 CPU cycle.
- Interrupts are polled after the second-last cycle of every instruction, except for branches, which poll interrupts as described in the wiki page. - It passes "CLI latency" test.
- On sprite DMA's, cpu is suspended by 513 or 514 cycles, but PPU continues running 3 cycles for each CPU cycle.
- We fail "NMI on timing" test by one cycle. Every other PPU timing test passes.

Thank in advance for you valuable help!

Re: FineX, FineY and zero hit
by Disch on 2015-12-11 (#160515)

Are you skipping the pre-render PPU cycle on odd frames?

Re: FineX, FineY and zero hit
by aLaix on 2015-12-12 (#160557)

Yes, and the ppu passes the odd/even tests.

Re: FineX, FineY and zero hit
by Zepper on 2015-12-13 (#160582)

Quote:

- It has a "Run" method that is called from main. It runs the amount of cycles worth 1 "frame" at NTSC clock speed (No PAL support yet), any cycles remaining are returned and run in the next frame.

My emu (RockNES) runs for 1 full instruction instead of a "frame".

Quote:

- It emulates page crosses and dummy reads/writes. Matches Nintendulator logs for nestest.nes and passes all Blargg instruction tests.
- Every CPU cycle is either a read or a write cycle, calling its appropriate handler every time, as described in this doc: http://nesdev.com/6502_cpu.txt.
- Inside the CPU read / write handlers, the PPU runs for 3 cycles (3 dots). PPU runs first, then CPU read / write is handled. This is: 3 PPU cycles, then 1 CPU cycle.
- Interrupts are polled after the second-last cycle of every instruction, except for branches, which poll interrupts as described in the wiki page. - It passes "CLI latency" test.

Pretty much like I do! Polling interrupts after the second-last cycle doesn't matter AFAIK.

Quote:

- On sprite DMA's, cpu is suspended by 513 or 514 cycles, but PPU continues running 3 cycles for each CPU cycle.

It's not that easy. You have the DMC triggering IRQs/fetching a sample during a sprite DMA, plus IRQ/NMI triggering on sprite DMA too.

Quote:

- We fail "NMI on timing" test by one cycle. Every other PPU timing test passes.

I believe this can be fixed by ignoring NMIs ($2000 write) at PPU cycle 341 (or zero, depends of your counting).

Re: FineX, FineY and zero hit
by aLaix on 2015-12-14 (#160674)

Hello guys, thank you for all the support so far!
we were checking battletoads and it does hangs sporadically using whatever delay value. So we will spend some time to debug it. (in fact, program the save state module, make the debugger capable to debug this, and check everything, cpu and ppu) :wink:

We will let you know the discoveries from this.

@Zepper

Quote:

My emu (RockNES) runs for 1 full instruction instead of a "frame".

We run a "frame" because we run all the instructions necessary for a frame (including the corresponding ppu cycles) and then we check how much time it spent, then we sleep the remaining time in order to wait for the next frame.

Quote:

It's not that easy. You have the DMC triggering IRQs/fetching a sample during a sprite DMA, plus IRQ/NMI triggering on sprite DMA too.

When we run a cpu cycle or a ppu cycle we are taking in consideration their corresponding IRQ/NMI polling/triggering.

Quote:

I believe this can be fixed by ignoring NMIs ($2000 write) at PPU cycle 341 (or zero, depends of your counting).

Is this a hack? or a NES behaviour? :shock:

Re: FineX, FineY and zero hit
by Zepper on 2015-12-14 (#160678)

No hack. It's been a long time since I fixed it. See test ROM "supression.s", if I'm not mistaken.

Code:

; Tests behavior when $2002 is read near time
; VBL flag is set.
;
; Reads $2002 one PPU clock later each time.
; Prints whether VBL flag read back as set, and
; whether NMI occurred.
;
; 00 - N
; 01 - N
; 02 - N
; 03 - N        ; normal behavior
; 04 - -        ; flag never set, no NMI
; 05 V -        ; flag read back as set, but no NMI
; 06 V -
; 07 V N        ; normal behavior
; 08 V N
; 09 V N

Re: FineX, FineY and zero hit
by zeroone on 2015-12-14 (#160682)

@Zepper

Does RockNES ever hang on Battletoads stage 2?

Re: FineX, FineY and zero hit
by Zepper on 2015-12-15 (#160695)

Yes.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-15 (#160705)

See the bottom of this thread.

The most recent Nintendulator beta (as of the time of that thread) has the same issue with Bart vs the Space Mutants and Battletoads stage 2. However, the latest official release of Nintendulator works properly. The PPU timing in the beta version closely mimics the timing tables in the wikis, yet it has the aforementioned issues. On the other hand, the official release fudges the timing slightly and the games work as a consequence.

Since we do not have access to CPU timing tables more granular than microcodes, the CPU must be advanced by an entire CPU cycle at a time. This means that the CPU and PPU will always be out of sync by 3 PPU cycles for NTSC and up to 4 PPU cycles for PAL. The PPU timings in the official release of Nintendulator appear to be shifted by 4 PPU cycles to compensate. It's a hack, but it maybe a necessary one.

Re: FineX, FineY and zero hit
by lidnariq on 2015-12-15 (#160708)

zeroone wrote:

Since we do not have access to CPU timing tables more granular than microcodes, the CPU must be advanced by an entire CPU cycle at a time.

... We don't?

I thought that the Visual{6502/2A03/2C02} tools let us break everything down into what's happening on a master clock by master clock basis... Or at least 6502 φ2 high/low half clocks.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-15 (#160727)

lidnariq wrote:

... We don't?

I thought that the Visual{6502/2A03/2C02} tools let us break everything down into what's happening on a master clock by master clock basis... Or at least 6502 φ2 high/low half clocks.

Sub-microcode steps could be obtained from a transistor level simulation of the 2A03. But, no one has done the legwork yet, producing a document analogous to this one. Lists with 3x the number of steps would work for NTSC. For PAL, it could still occasionally executed 2 PPU cycles for one sub-microcode step of the CPU to maintain the proper timing ratio.

Until this is demonstrated, who knows if it would even solve the problems. The CPU and PPU would still be out of sync by 1 PPU cycle for NTSC and up to 2 PPU cycles for PAL. Can a transistor level simulation of the 2A03 run in real-time yet? Maybe that's the ultimate way to go.

Re: FineX, FineY and zero hit
by Zepper on 2015-12-16 (#160754)

I don't know in software meanings, but in an easy manner, are you talking about "finetunning" the CPU/PPU time sync?

Re: FineX, FineY and zero hit
by zeroone on 2015-12-16 (#160760)

Zepper wrote:

I don't know in software meanings, but in an easy manner, are you talking about "finetunning" the CPU/PPU time sync?

Does RockNES advance the CPU by 1 microcode at a time or by 1 full instruction at a time?

Re: FineX, FineY and zero hit
by aLaix on 2015-12-16 (#160788)

@Zepper
Sorry for the misunderstanding, we pass suppression test, what we are not passing is 07-NMI_on_timing test.

Wow! I wasn't expecting this! :shock:

Very interesting...
What about Nestopia?? I tested both bart and battletoads and both does works as expected.

Re: FineX, FineY and zero hit
by Zepper on 2015-12-17 (#160827)

aLaix wrote:

@Zepper
Sorry for the misunderstanding, we pass suppression test, what we are not passing is 07-NMI_on_timing test.

Oh, okay. Do what I described and tell me the result.

zeroone wrote:

Zepper wrote:

I don't know in software meanings, but in an easy manner, are you talking about "finetunning" the CPU/PPU time sync?

Does RockNES advance the CPU by 1 microcode at a time or by 1 full instruction at a time?

Runs 1 instruction per "loop". For each CPU cycle, the PPU runs for 3 cycles before the current CPU cycle.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-17 (#160831)

Zepper wrote:

Runs 1 instruction per "loop". For each CPU cycle, the PPU runs for 3 cycles before the current CPU cycle.

For such a loop, the PPU can be out of sync with the CPU for up to 21 PPU cycles (some instructions take up to 7 CPU cycles and for NTSC, the ratio is 3 PPU cycles per every CPU cycle). Since it's a loop, neither really runs before the other; they just alternate. Any apparent order has no effect on that the size of that gap.

In the real hardware, the PPU and CPU run in parallel. Meaning, there is no gap at all. And, a zero gap is achievable with a transistor-level simulation of the A203. But, it's not practical for emulation.

For emulation, the size of gap can be shrunk considerably by executing microcodes in the loop instead of full instructions. Each microcode takes 1 CPU cycle, resulting in a gap size no larger than 3 PPU cycles out of sync.

However, that gap is apparently still too large for some games to work properly. Luckily, that can be compensated by a hack. In the PPU timing diagram, all the dot times can be shifted over by 3 PPU cycles (or by 4 PPU cycles to handle PAL as well).

Re: FineX, FineY and zero hit
by Zepper on 2015-12-18 (#160842)

zeroone wrote:

Since it's a loop, neither really runs before the other; they just alternate.

Yeah.

Quote:

Any apparent order has no effect on that the size of that gap.

In emulation terms, yes, it does a big difference! Just think about running the PPU clocking before reading $200x or after reading $200x.

Quote:

In the ~~real~~ hardware, the PPU and CPU run in parallel. Meaning, there is no gap at all. And, a zero gap is achievable with a transistor-level simulation of the A203. But, it's not practical for emulation.

I got what you mean... but still... Unable to find a way of doing such thing.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-18 (#160860)

Zepper wrote:

In emulation terms, yes, it does a big difference! Just think about running the PPU clocking before reading $200x or after reading $200x.

Think about it from the point of view of the program being executed by the CPU. For instance, a sprite 0 hit does not interrupt the processor. Instead, the program has to poll the PPU Status Register ($2002) in a loop. If the PPU always runs ahead of the CPU, then the program loop will occasionally break out earlier in emulation than it would in the actual hardware. And, some games are very sensitive to timing variations like that, such as The Simpsons: Bart vs. the Space Mutants. When the timing is slightly off, the status bar shakes. Running the PPU ahead or behind the CPU doesn't solve the issue because the discrepancy still causes variations in program loop iteration counts.

Zepper wrote:

Unable to find a way of doing such thing.

Unfortunately, the first step is to rewrite your CPU. But, it's actually not that difficult to transform the implementation if it's already working on the instruction level.

This link contains lists of microcodes executed by each instruction. Note that every single microcode reads from memory or writes to memory (the R/W column). Consequentially, the code that emulates each microcode will have to call some common read() and write() functions. Within those functions, advance the PPU by 3 PPU cycles (for NTSC). The PPU will still be running ahead of the CPU, but only by a maximum of 3 PPU cycles, as opposed to the up to 21 PPU cycles in your current model.

This will actually solve other timing issues as well. For instance, interrupts essentially happen in between executing instructions. But, DMA can occur in the middle of an executing instruction. And, it's behavior is modified based on the type of microcode (i.e. whether it is a read or write cycle).

Re: FineX, FineY and zero hit
by Fumarumota on 2015-12-18 (#160862)

Zepper wrote:

aLaix wrote:

@Zepper
Sorry for the misunderstanding, we pass suppression test, what we are not passing is 07-NMI_on_timing test.

Oh, okay. Do what I described and tell me the result.

@Zepper

We were able to pass the test by suppressing the NMI when rendering is turned on close to the end of vblank. (i.e. first two cycles of scanline 261).

zeroone wrote:

This will actually solve other timing issues as well. For instance, interrupts essentially happen in between executing instructions. But, DMA can occur in the middle of an executing instruction. And, it's behavior is modified based on the type of microcode (i.e. whether it is a read or write cycle).

@zeroone
- I remember reading somewhere in this forum that sprite DMA can only take place between instructions just like interrupts.
- By "microcode" you mean each cycle of an instruction, as in the http://nesdev.com/6502_cpu.txt document? If so, our CPU already execute instructions that way, doing every single memory access like that, it's just that the loop allows the current instruction to finish (i.e. we don't return from the loop in the middle of an instruction). On a write to $4014, we turn a "dmaPending" flag on, and let the DMA happen after the instruction finishes. Could that be causing sync problems between the CPU and the PPU?

Thanks for all the insight in this topic guys.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-18 (#160867)

Fumarumota wrote:

I remember reading somewhere in this forum that sprite DMA can only take place between instructions just like interrupts.

See Likely internal implementation of the read.

Per that link, the RDY pin "causes the CPU to pause during the next read cycle". My interpretation of this is that the processor can be suspended mid-instruction. And, it keeps RDY low for 4 CPU cycles because the longest contiguous sequence of write cycles for any instruction (or interrupt) is length 3. So, the processor is suspended for 1 to 4 cycles.

Fumarumota wrote:

By "microcode" you mean each cycle of an instruction, as in the http://nesdev.com/6502_cpu.txt document?

Yes. I am referring to each of those steps as a microcode instruction.

But, if you want to nitpick, the 6502 does not technically use microcode. It uses a state machine in combination with a programmable logic array. It's the poor man's version of microcode. If you know a better alternative name, I'll adopt it for this discussion.

Fumarumota wrote:

If so, our CPU already execute instructions that way, doing every single memory access like that, it's just that the loop allows the current instruction to finish (i.e. we don't return from the loop in the middle of an instruction).

That sounds perfect. There is no need to return mid-instruction.

Fumarumota wrote:

On a write to $4014, we turn a "dmaPending" flag on, and let the DMA happen after the instruction finishes. Could that be causing sync problems between the CPU and the PPU?

You do not need to return mid-instruction, but from my understanding, you need to handle DMA mid-instruction. And, you can handle this in the common read() function that I mentioned earlier in this thread. Meaning, if there is a DMA request, as soon as a read cycle is encountered, the processor will be suspended. I.e. the read() function will do 3 things: 1) handle DMA if need be, 2) update the PPU by at least 3 PPU cycles and 3) return a value from memory.

Re: FineX, FineY and zero hit
by tepples on 2015-12-18 (#160879)

zeroone wrote:

But, if you want to nitpick, the 6502 does not technically use microcode. It uses a state machine in combination with a programmable logic array. It's the poor man's version of microcode. If you know a better alternative name, I'll adopt it for this discussion.

It's microcode. Visual 6502 refers to it as a decode ROM. It's just incompletely decoded to improve compression (130 words vs. 256).

Re: FineX, FineY and zero hit
by zeroone on 2015-12-18 (#160884)

tepples wrote:

It's microcode. Visual 6502 refers to it as a decode ROM. It's just incompletely decoded to improve compression (130 words vs. 256).

It's almost microcode. Here are some further details about what's on that PLA.

For discussions on this forum, I'm fine with the term "microcode", even though technically, maybe it's not officially microcode.

Re: FineX, FineY and zero hit
by Fumarumota on 2015-12-18 (#160888)

zeroone wrote:

Quite an interesting read that was. I will definitely take it into account when we implement DMC.

zeroone wrote:

Didn't mean to nitpick or anything, microcode or anything else is just fine

zeroone wrote:

As I understand it, this is how an accurate DMC DMA would work, right? What about Sprite (OAM) DMA?

Here's how I do it currently:

- After $4014 write cycle:
* Run 3 PPU cycles, Account 1 CPU idle cycle.
* If write was on odd CPU cycle: Run PPU 3 cycles, account another CPU cycle.
* Perform OAM initialization by reading ram[valueWrittenIn4014 * 0x100] and writing it in PPU OAM_DATA via $2004 register (These account 512 CPU cycles, running 3 PPU cycles before each CPU access cycle).

All this happens after the instruction causing the write finishes.

Please forgive me if this a bit off topic on the Zero Hit stuff, but we want to find out everything that could be causing a zero hit miss (hanging some games like Battletoads or shaking Bart's status bar).

Re: FineX, FineY and zero hit
by Zepper on 2015-12-19 (#160903)

@zeroone
I've spent a lot of time debugging and tracing all those test ROMs provided by blargg... trying to understand some deep mechanics, cycle by cycle, huge logs, analyzing precisions of 1 PPU cycle + or - while reading/writing. You can't say "running up to 7 PPU cycles" of PPU/CPU out of sync because simply they are NOT. Most of these test ROMs perform specific reads that require precision of 1 cycle (as I said before), so I really dunno about it. My emu passes in all those tests, with NO hacks. Sorry. :cry:

While the idea of breaking the CPU time (or cycle/clock/whatever?) into a smaller piece of time is somewhat interesting, I have to disagree. Otherwise, _Q would have written something.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-19 (#160911)

Zepper wrote:

While the idea of breaking the CPU time (or cycle/clock/whatever?) into a smaller piece of time is somewhat interesting, I have to disagree. Otherwise, _Q would have written something.

Treat the Simpsons ROM as an additional test. Use logging and figure out why the status bar shakes. RockNES maybe perfectly tuned to beat all of Blargg's tests, but it's accuracy can be still be increased.

Breaking down each instruction into smaller executable pieces is certainly not a hack. And, the techniques that I mentioned above are used in Nintendulator.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-19 (#160922)

Fumarumota wrote:

What about Sprite (OAM) DMA?

Here's how I do it currently:

- After $4014 write cycle:
* Run 3 PPU cycles, Account 1 CPU idle cycle.
* If write was on odd CPU cycle: Run PPU 3 cycles, account another CPU cycle.
* Perform OAM initialization by reading ram[valueWrittenIn4014 * 0x100] and writing it in PPU OAM_DATA via $2004 register (These account 512 CPU cycles, running 3 PPU cycles before each CPU access cycle).

All this happens after the instruction causing the write finishes.

That sounds about right.

The processor will be suspended immediately after the $4014 write cycle (as opposed to waiting until the full instruction ends). Consequentially, within the write() function discussed earlier, the code can do the memory transfer:

Code:

if (odd CPU cycle)
  read(PC)  
  
read(PC)

for (i = 0 to 255) 
  write($2004, read((value * 256) + i))    

Above, the read() and write() functions have the side-effect of running 3 PPU cycles for NTSC and 3 or 4 PPU cycles for PAL. And, those functions each count as a CPU cycle (i.e. they will increment a CPU cycle counter required for frame timing). Since the for-loop calls both write() and read(), it takes 512 CPU cycles. The prior 2 read() calls extends the length of the transfer to 513 or 514 CPU cycles.

Futher details can be found here.

Re: FineX, FineY and zero hit
by Disch on 2015-12-20 (#160974)

is read(pc) right? The wiki says those cycles are idle.

Re: FineX, FineY and zero hit
by Zepper on 2015-12-20 (#160975)

Disch wrote:

is read(pc) right? The wiki says those cycles are idle.

That code (the for() loop) shouldn't be taken as "correct".

Re: FineX, FineY and zero hit
by Disch on 2015-12-20 (#160976)

Zepper wrote:

That code (the for() loop) shouldn't be taken as "correct".

Why not? It looks correct to me.

Re: FineX, FineY and zero hit
by lidnariq on 2015-12-20 (#160994)

The 2A03 can't have an idle cycle. M2 always goes high, because it's derived from the master clock, and ignores whatever internal processing (including /RDY) is going on... every cycle is always a read or write cycle.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-20 (#161000)

lidnariq wrote:

Do you have any idea why there is an additional read if the transfer starts on an odd CPU cycle? Is there some sort of latch that toggles every CPU cycle, determining if there is an extra read?

Also, when I wrote that pseudo-code above, I wasn't sure what address it would reading. I assumed it was like the start of an instruction fetch, reading from PC.

Re: FineX, FineY and zero hit
by lidnariq on 2015-12-20 (#161007)

The DMA units are kinda part of the sound hardware in the 2A03, and everything (except the triangle wave) in the sound unit runs at CPU clock ÷2, the DMA units reading on even cycles and writing on odd cycles.

So... "Yes".

Re: FineX, FineY and zero hit
by zeroone on 2015-12-20 (#161010)

lidnariq wrote:

Cool.

What about the address read during the idle cycles? Is it the PC address like I speculated above?

Re: FineX, FineY and zero hit
by Zepper on 2015-12-20 (#161024)

Disch wrote:

Zepper wrote:

That code (the for() loop) shouldn't be taken as "correct".

Why not? It looks correct to me.

Easier to fire up against my code? Go ahead.

Code:

//Sprite DMA
//----------
void cpu_spritedma(C_DWORD addr, C_BYTE flag)
{
   register DWORD offset = addr << 8;
   register const DWORD offset_lim = offset + 0xF8;
   register BYTE value;
   /* 3 if it lands on a CPU write,
      2 if it lands on the $4014 write or during OAM DMA, 
      1 if on the next-to-next-to-last DMA cycle, 
      3 if on the last DMA cycle.
     DMA transfer takes 513 cycles on even cycles
     and 514 on odd cycles. */
   // 513+1 cycles
   if(flag)
   {
      _readvalue(cpu->PC);
   }

   do {
      dmc_runfor(2);
      value = _readvalue(offset);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|1);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|2);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|3);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|4);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|5);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|6);
      _writevalue(0x2004,value);
      dmc_runfor(2);
      value = _readvalue(offset|7);
      _writevalue(0x2004,value);
      offset += 8;
   } while(offset < offset_lim);
    
   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(1);
   value = _readvalue(offset);
   _writevalue(0x2004,value);

   dmc_runfor(3);
   _readvalue(cpu->PC);
}

I'm working in this emulator for 17 years. A few things are missing from my mind, but I confirm there are NO hacks. It passes in all test ROMs (now that someone says such stuff is obsolete).

Re: FineX, FineY and zero hit
by Disch on 2015-12-20 (#161026)

lidnariq wrote:

The wiki should be updated then, as 'idle' is clearly the wrong term there and is misleading:

http://wiki.nesdev.com/w/index.php/PPU_registers#OAMDMA

The wiki wrote:

The CPU is suspended during the transfer, which will take 513 or 514 cycles after the $4014 write tick. (1 idle cycle, +1 if on an odd CPU cycle, then 256 alternating read/write cycles.)

zeroone wrote:

What about the address read during the idle cycles? Is it the PC address like I speculated above?

This was my next question -- though I doubt it's the PC, as there'd be no reason to put the PC on the bus for the DMA. My completely blind guess would be that it's reading from $4014, since that'd be what is sitting on the address lines when DMA is triggered.

Quote:

Easier to fire up against my code? Go ahead.

I'm sure your code is fine and works great -- but why is zeroone's code wrong? I don't see any problems with it.

The only real difference I see is that you're embedding the DMC DMA logic inside the OAM DMA logic -- but I would probably want to keep those separate... which is what I assume zeroone is doing.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-20 (#161027)

Zepper also reads from PC for the idle cycles:

Code:

if(flag)
{
   _readvalue(cpu->PC);
}

...

_readvalue(cpu->PC);

Though, it's unclear why one is at the top and the bottom of the function.

@Zepper I reviewed my code a bit. I put the DMC update logic inside the read() function itself. And, I did not unroll any loops like you did, which explains why my code appears more compact.

By the way, in your unrolled loop at the bottom, why are the final calls to dmc_runfor() with parameters 1 and 3 as opposed to both being 2 like the rest of them (see below). Is that a timing hack?

Code:

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(2);
   value = _readvalue(offset++);
   _writevalue(0x2004,value);

   dmc_runfor(1);
   value = _readvalue(offset);
   _writevalue(0x2004,value);

   dmc_runfor(3);
   _readvalue(cpu->PC);

Re: FineX, FineY and zero hit
by Disch on 2015-12-20 (#161029)

It's not a hack. DMC DMA takes a different length depending on where they land during the OAM DMA.

Though I thought the last few cycles were 1,2,3 ... not 2,1,3. Lemme find that page.

EDIT: Here's the post:

viewtopic.php?p=62690#p62690

blargg wrote:

DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle.

So:

last cycle = 3 cycles
next-to last cycle = 2 cycles
next-to-next-to-last cycle = 1 cycle

so the last 3 DMC runs in Zepper's code should be 1,2,3 -- not 2,1,3

Though the wiki makes no mention of this and just says it's 2 cycles throughout the entirety of OAM DMA. Which is correct?

EDIT 2: I'm still not convinced reading from PC is correct for the dummy cycles. I guess it could be, but $4014 still makes more sense to me.

blargg's post suggests that the OAM DMA process starts on the 4014 write cycle (since that's in the '2 cycle' zone), and if that's true it would mean 4014 would still be on the address bus, and the DMA unit would have no reason to put the PC back on the bus.

So either:

1)
- CPU puts PC on address bus before DMA starts (possibly as part of pipelining for the next instruction)
- DMA cuts in, reads dummy value(s) from PC (since that's what on the bus)
- DMA does it's thing
- DMA restores PC on address bus so that normal CPU execution is uninterrupted
- normal CPU execution resumes with the next opcode read

or

2)
- DMA cuts in while 4014 is still on the bus, before PC is put back on the bus
- dummy values read from whatever is on the bus (which would be 4014)
- DMA does it's thing
- DMA does not need to restore the PC because the PC was never put on the bus in the first place
- normal CPu execution resumes... and NOW the PC is placed on the bus for the next opcode read

or

3)
- something else

I just find scenario 2 much more likely than scenario 1.... but that's a total guess and I have zero evidence to back it up.

Re: FineX, FineY and zero hit
by Zepper on 2015-12-21 (#161038)

sprdma_and_dmc_dma_512.nes
Fails with 1,2,3 and passes with 2,1,3.

Re: FineX, FineY and zero hit
by zeroone on 2015-12-21 (#161039)

Interesting. Nintendulator 0.970 and 0.975-beta fail test sprdma_and_dmc_dma_512.nes.

edit: Nintendulator does not pass 7-dmc_basics.nes either.

Re: CPU timing precision
by Zepper on 2015-12-21 (#161041)

@Disch
I'm not familiar with the 6502 simulator, but you (or anyone else) can give a try, regarding $4014.

Re: CPU timing precision
by zeroone on 2015-12-21 (#161042)

From Blargg's post:

blargg wrote:

Based on the following tests, DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle.

Does "if it lands on a CPU write" mean if it coincides with a CPU write cycle?

Re: CPU timing precision
by Disch on 2015-12-21 (#161046)

@ Zepper:

Could the reason the test is failing be due to something else? Blargg's post clearly indicates it should be 1,2,3.

zeroone wrote:

Does "if it lands on a CPU write" mean if it coincides with a CPU write cycle?

It probably means if the 'rw' signal is low at the time the DMA cuts in -- which happens during write cycles.

Since the DMA transfer can't happen "on" a cycle (since the CPU can either be performing its normal exeuction or DMA fetch -- it can't do both at the same time) I assume this means if the DMA happens "immediately after" a write cycle.

Example, "STA $10" is a 3 cycle instruction, two reads and one write:

Code:

Cycle 0:   read(PC) for opcode
  ** DMA here takes 4 cycles **
Cycle 1:   read(PC) for address
  ** DMA here takes 4 cycles **
Cycle 2:   write(address,A) to perform the STA
  ** DMA here takes 2 cycles **
...

EDIT:

Looks like I'm wrong about 4014. According to visual2A03 the dummy reads are from the PC.

Re: CPU timing precision
by zeroone on 2015-12-21 (#161048)

Disch wrote:

It probably means if the 'rw' signal is low at the time the DMA cuts in -- which happens during write cycles.

Since the DMA transfer can't happen "on" a cycle (since the CPU can either be performing it's normal exeuction or DMA -- it can't do both at the same time) I assume this means if the DMA happens "immediately after" a write cycle.

It could also mean, in place of what would normally be a write cycle (effectively, immediately before a write cycle).

Either way, if RockNES does not process instructions at the microcode level, how does it handle such a case?

Re: CPU timing precision
by Zepper on 2015-12-21 (#161051)

...

Re: CPU timing precision
by Zepper on 2015-12-21 (#161052)

Disch wrote:

Could the reason the test is failing be due to something else? Blargg's post clearly indicates it should be 1,2,3.

It's been a LONG time since I debugged it like crazy, so one possible reason is...

Code:

CPUOP(STA1) /* sta $xxxx */
  dmc_runfor(3);
  writevalue(offset, cpu->A);
  /* You can see this in the STA $100 after OAM DMA,
  where DMC DMA takes three cycles for two different times.
  This is because both times it's landing on the fourth cycle of STA $100.
  */
OPEND

Re: CPU timing precision
by zeroone on 2015-12-21 (#161055)

Zepper wrote:

@zeroone
You're being more boring than constructive.
#define microcode with an example, then we'll continue.

Sure. I just mean the incremental steps that occur as an instruction is executed. For instance, 1--7 below are separate microcode instructions. Each microcode instruction takes 1 CPU cycle. And, every single microcode instruction does a memory access; it's either a read instruction or a write instruction (see the R/W column). If RockNES does not run at the microcode level, how does it handle Blargg's "3 if it lands on a CPU write" case?

Code:

     Read-Modify-Write instructions (ASL, LSR, ROL, ROR, INC, DEC,
                                     SLO, SRE, RLA, RRA, ISB, DCP)

        #   address  R/W description
       --- --------- --- ------------------------------------------
        1    PC       R  fetch opcode, increment PC
        2    PC       R  fetch low byte of address, increment PC
        3    PC       R  fetch high byte of address,
                         add index register X to low address byte,
                         increment PC
        4  address+X* R  read from effective address,
                         fix the high byte of effective address
        5  address+X  R  re-read from effective address
        6  address+X  W  write the value back to effective address,
                         and do the operation on it
        7  address+X  W  write the new value to effective address

       Notes: * The high byte of the effective address may be invalid
                at this time, i.e. it may be smaller by $100.

Re: CPU timing precision
by Zepper on 2015-12-21 (#161056)

...

Re: CPU timing precision
by zeroone on 2015-12-21 (#161057)

Zepper wrote:

My CPU emulator was written using this exact idea ages ago. You're misjudging my emulator.
Peace. ^_^;;

I must have misinterpreted your description earlier in the thread. Thanks for the clarification.

If possible, could you please provide the code that you used to implement the PHA instruction?

Re: CPU timing precision
by Zepper on 2015-12-21 (#161061)

zeroone wrote:

Zepper wrote:

My CPU emulator was written using this exact idea ages ago. You're misjudging my emulator.
Peace. ^_^;;

I must have misinterpreted your description earlier in the thread. Thanks for the clarification.

If possible, could you please provide the code that you used to implement the PHA instruction?

Not much.

Code:

/*      #  address R/W description
       --- ------- --- -----------------------------------------------
        1    PC     R  fetch opcode, increment PC
        2    PC     R  read next instruction byte (and throw it away)
        3  $0100,S  W  push register on stack, decrement S
*/
CPUOP(PHA3)
   _readvalue(cpu->PC);     //2nd cycle
   PUSH(cpu->A); cpu->S--;  //3rd cycle
   _INT_ACK(); /* acknowledge IRQ/NMI */
OPEND

Re: CPU timing precision
by zeroone on 2015-12-21 (#161063)

Thanks Zepper. In addition to returning a value from memory, does _readvalue() update the PPU by 3 PPU cycles?

Re: CPU timing precision
by Disch on 2015-12-21 (#161064)

What happened to the good old catch-up approach? What's with all this running stuff in parallel nonsense. =P

Re: CPU timing precision
by Zepper on 2015-12-21 (#161066)

zeroone wrote:

Thanks Zepper. In addition to returning a value from memory, does _readvalue() update the PPU by 3 PPU cycles?

Yes. Functions with a _dash do not poll IRQ/NMI, since it must be done only near the last cycle.

Code:

#define PULL()           readvalue(0x100|cpu->S)
static inline BYTE _readvalue(C_DWORD offset)
{
   ppu_new_clock();
   cpu_address_bus = offset;
   return cpu->read_mem[offset >> 13](offset);
}
static inline void _writevalue(C_DWORD offset, C_BYTE value)
{
   ppu_new_clock();
   dmc_runfor(3);
   cpu->writemem[offset >> 13](offset, value);
}

Disch wrote:

What happened to the good old catch-up approach? What's with all this running stuff in parallel nonsense. =P

That scheme of queue events, then dequeue? No.

Re: CPU timing precision
by zeroone on 2015-12-21 (#161069)

@Zepper

That looks great. That is exactly what I meant by an emulator running at the microcode level. I think the confusion in this thread is partially due to language-barrier issue.

However, even at the microcode level, the PPU and CPU are not perfectly synchronized. The PPU will advance by at least 3 PPU cycles and then the CPU catches up by running 1 CPU cycle. A engineering trade-off will likely have to be introduced to compensate for that gap.

Re: CPU timing precision
by Disch on 2015-12-21 (#161070)

Zepper wrote:

That scheme of queue events, then dequeue? No.

No, the scheme of predicting when the next interesting event is, and running the CPU up to that point, then catching everything else up.

Re: CPU timing precision
by Zepper on 2015-12-22 (#161087)

Anyway, about the game "The Simpsons - Bart vs Space Mutants", any clues regarding the flickering scorebar??? I tried messing up the sprite zero hit timing, but the game doesn't seem to be sensitive.

UPDATE: trapped secondary writes to $2006. Format is scanline, PPU cycle, loopy_v, loopy_t, screen enabled.
UPDATE2: can $2006 secondary writes occur before PPU cycle 256? Shouldn't be after 260?

Code:

-- frame --
000,172 * $0800 <- $3F00 ON
005,213 * $3F20 <- $0000 ON
196,254 * $72A6 <- $02C0 ON
-- frame --
000,176 * $1800 <- $3F00 ON
005,217 * $3F20 <- $0000 ON
196,265 * $02C0 <- $02C0 ON
-- frame --
000,181 * $0800 <- $3F00 ON
005,222 * $3F20 <- $0000 ON
196,263 * $02C0 <- $02C0 ON
-- frame --
000,173 * $0800 <- $3F00 ON
005,214 * $3F20 <- $0000 ON
196,262 * $02C0 <- $02C0 ON
-- frame --
000,175 * $0800 <- $3F00 ON
005,216 * $3F20 <- $0000 ON
196,260 * $02C0 <- $02C0 ON
-- frame --
000,173 * $0800 <- $3F00 ON
005,214 * $3F20 <- $0000 ON
196,268 * $02C0 <- $02C0 ON
-- frame --
000,178 * $0800 <- $3F00 ON
005,219 * $3F20 <- $0000 ON
196,263 * $02C0 <- $02C0 ON
-- frame --
000,173 * $0800 <- $3F00 ON
005,214 * $3F20 <- $0000 ON
196,262 * $02C0 <- $02C0 ON

Re: CPU timing precision
by Dwedit on 2015-12-22 (#161090)

What notation are you using for scanline number? I always thought the notation was to use 0-239 as the visible scanlines, -1 as the prerender, 240 as the postrender, and 241-260 as vblank time.

Re: CPU timing precision
by Disch on 2015-12-22 (#161092)

That was always my notation -- but I don't think it's standardized anywhere =P

And now the PPU cycles have been shifted by 1 because of Visual2A03... and dots 1-256 are the ones that render pixels instead of the easier-to-use 0-255

Re: CPU timing precision
by Zepper on 2015-12-22 (#161099)

Ah, sorry. :oops:

I'm using...
0-19 VBlank,
20 scanline -1,
21-260 visible lines,
261 dummy scanline.

Re: FineX, FineY and zero hit
by Hyde on 2015-12-23 (#161132)

Disch wrote:

blargg wrote:

DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle.

So:

last cycle = 3 cycles
next-to last cycle = 2 cycles
next-to-next-to-last cycle = 1 cycle

so the last 3 DMC runs in Zepper's code should be 1,2,3 -- not 2,1,3

Good stuff. Thanks for this

Re: CPU timing precision
by Zepper on 2015-12-27 (#161348)

Well... if the PPU runs before the current CPU read, there's no more shaking/flickering in a few games. I wonder why...
It's irrelevant for CPU writes. About the sprite hit flag $2002:$40, is there any latency or is set immediately?

Re: CPU timing precision
by zeroone on 2015-12-28 (#161442)

Zepper wrote:

Which games did it fix and which are still having issues?

The wiki does not mention any sprite 0 hit flag latency. Assuming that the sprite 0 hit is recorded in a latch (as opposed to somewhere in memory), it should be available by the next PPU cycle if not within the current PPU cycle.

Re: CPU timing precision
by tepples on 2015-12-28 (#161447)

I seem to remember there being a latency of about one or two dots (less than one CPU cycle) for sprite 0 hit. It's probably an artifact of the pixel pipeline.

Re: CPU timing precision
by Zepper on 2015-12-28 (#161453)

zeroone wrote:

Which games did it fix and which are still having issues?

Simpsons is one. ^_^;;

Re: CPU timing precision
by zeroone on 2015-12-28 (#161457)

tepples wrote:

I seem to remember there being a latency of about one or two dots (less than one CPU cycle) for sprite 0 hit. It's probably an artifact of the pixel pipeline.

Someone really needs to document the PPU pixel pipeline and everything that is affected in detail. The wiki is absent of this information.

Re: CPU timing precision
by Zepper on 2015-12-29 (#161476)

zeroone wrote:

tepples wrote:

I seem to remember there being a latency of about one or two dots (less than one CPU cycle) for sprite 0 hit. It's probably an artifact of the pixel pipeline.

Someone really needs to document the PPU pixel pipeline and everything that is affected in detail. The wiki is absent of this information.

Visual2C02 in JavaScript
http://www.qmtpro.com/~nes/chipimages/visual2c02/

Re: CPU timing precision
by zeroone on 2015-12-29 (#161478)

Zepper wrote:

Visual2C02 in JavaScript
http://www.qmtpro.com/~nes/chipimages/visual2c02/

Agreed. But, I have yet to figure out how to use it.

In a thread from a few months ago, we actually discussed this same topic. Delaying the sprite 0 hit by a tiny amount does fix The Simpsons. But, it seems to break some of Blargg's tests. Experiment and let us know what happens. If you post a new version of RockNES, I'd be happy to help you test some games.