CPU<>PPU order of operations

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
CPU<>PPU order of operations
by on (#85156)
I can find no documentation on this anywhere, so I'll ask here.

So, the CPU and PPU run in parallel. Of course that is impossible to do with our emulation, which runs each processor a little at a time.

So say that the CPU and PPU are at exactly the same position in time ... how do they interleave?

A CPU cycle will consume 12 clock cycles, whether it is a read or a write. So let's say the CPU reads from $2002. Does it read the PPU state, and then have the PPU tick three times? What about a write to $2000? Does it write the PPU state, and then tick the PPU three times? (Note: ticking the PPU would be when NMIs were tested.)

Under this model, I can see no possible way to pass ppu_vbl_nmi tests 03 and 07 at the same time. I'm not saying there is not a way! I just can't find it.

For 07, I get the same 00-05=N, 06-08=- that others here have gotten. Unfortunately both people who solved it never bothered to answer how they did it for the benefit of others.

Also under this model, the cache point for F=1,Y=261 BG&&SPR disable PPU cycle skip is 337, which is completely nonsensical as the rest of the PPU works in two cycle pairs, meaning it should be 338.

03 reads from $2002, and 07 writes to $2000. 03 can only pass if we clear the NMI line at 260,340. 07 can only pass if we clear it at 261,0.

The only way I can see to stagger this is if CPU writes affect the PPU four clocks later than CPU reads. This would match SNES bus hold behavior, which is documented in the W65C816S reference manual, yet I can find no documentation on this anywhere for the NES.

The idea is that reads are requests from other chips, so they see them and start acting on them faster. But writes need to stay on the bus for a given amount of time after the other chip sees it to be acknowledged / copied.

So in this instance: CPU reads would add 8 cycles, then run the PPU for two ticks (8 cycles), then perform the read, then run the PPU for one more tick. CPU writes would add 12 cycles, then run the PPU for three ticks, then perform the write.

With this, it is now possible to pass 03 and 07 at the same time. It also allows us to place the extra cycle skip test at X=338. And yet, now 05 has "4444433333" instead of "4443333322", so it fails. And absolutely nothing I try can change that pattern.

I can try and debug problems, and I can special case behavior when the PPU and CPU are at the same exact time (eg a 'conflict'), but I need to know the proper order of operations first.

-----

If it helps any, this is my current setup, which uses the former interleave pattern and fails only 07 (and requires 10 to be at 337):

Timing: (PPU executes one cycle (4 CPU cycles), and then performs the following:
Code:
  if(status.ly == 240 && status.lx == 340) status.nmi_hold = 1;
  if(status.ly == 241 && status.lx ==   0) status.nmi_flag = status.nmi_hold;
  if(status.ly == 241 && status.lx ==   2) cpu.set_nmi_line(status.nmi_enable && status.nmi_flag);
  if(status.ly == 261 && status.lx ==   0) cpu.set_nmi_line(status.nmi_flag = 0);  //260,340 will pass 03, but fail 07
  status.lx++;


$2002 read:
Code:
    result |= status.nmi_flag << 7;
    result |= status.sprite_zero_hit << 6;
    result |= status.sprite_overflow << 5;
    result |= status.mdr & 0x1f;
    status.address_latch = 0;
    status.nmi_hold = 0;
    cpu.set_nmi_line(status.nmi_flag = 0);


$2000 write:
Code:
    status.nmi_enable = data & 0x80;
    status.master_select = data & 0x40;
    status.sprite_size = data & 0x20;
    status.bg_addr = (data & 0x10) ? 0x1000 : 0x0000;
    status.sprite_addr = (data & 0x08) ? 0x1000 : 0x0000;
    status.vram_increment = (data & 0x04) ? 32 : 1;
    status.taddr = (status.taddr & 0x73ff) | ((data & 0x03) << 10);
    cpu.set_nmi_line(status.nmi_enable && status.nmi_flag);


CPU triggers an actual NMI whenever the line transitions from 0->1.

by on (#85157)
It might sound empiric, but I assume a PPU clock before a memory access. For NTSC, 3 PPU clocks before a memory access, subject to change if a DMC DMA occurs.

by on (#85160)
About to head out for a job interview, I'll fill in more detail when I get back.

Running the CPU, then running the PPU 3 times isn't how things happen on the hardware.

Hardware wise, the CPU takes the input clock, and generates two non-overlapping clocks phi1 and phi2 from it. Each one is a pulse wave with a phase offset and duty cycle such that they're high for a little less than a half period of the input clock, and about half an input clock phase apart.

phi2 gets sent out a pin to the cart edge, and is the clock devices should generally use to know when the CPU's providing a valid address and data. Write data and the address go out a bit before the rising edge of phi2, and the CPU latches read data on the falling edge.

I imagine the PPU's bus is similar, other than having seperate rd/wr lines for vram.

Depending on exactly how the internals are wired up, the PPU probably latches the internal reg to the bus drivers somewhere around the rising edge of phi2, and latches the external bus to it's own regs around the falling edge. For regs it only reads, this is pretty trivial. For regs it can write back to, there's some questions regarding how the clocking works.

I don't know how square the post-divider input clock is. It could be very lopsided, and I suspect on the SNES the bulk of the extra cycles for the longer access periods go into phi2.

The two-cycle pairs thing on the PPU is entirely on the VRAM side of it, as it needs to burn a cycle to latch the address (multiplexed A/D bus).

by on (#85163)
Zepper wrote:
It might sound empiric, but I assume a PPU clock before a memory access. For NTSC, 3 PPU clocks before a memory access, subject to change if a DMC DMA occurs.


That's pretty much how I assumed existing emulators pulled this off. If it's wrong, it will create a lot of weird "workaround" code that will pass tests, but not be 100% correct for all registers.

As far as whether it's { PPU, PPU, PPU, CPU } or { CPU, PPU, PPU, PPU } is a fairly minor (yet still important) question. Either way it ends up behaving the same after the first cycle. But it may "offset" all timing events by one cycle to compensate.

But let's say reads are 8:read:4, and writes are 12:write. Then the pattern becomes non-deterministic from the PPU alone. For RWWR: { PPCP PPPC PPPC PPCP }

PPU cycles between CPU cycles can vary from 2-4 in this case.

Zepper, I realize it was a while ago, but may I ask how you fixed test 07, if you still recall?

Quote:
phi2 gets sent out a pin to the cart edge, and is the clock devices should generally use to know when the CPU's providing a valid address and data. Write data and the address go out a bit before the rising edge of phi2, and the CPU latches read data on the falling edge.


Awesome! Sounds like you really know your stuff in this regard :D
Look forward to hearing more. But if possible, can you please dumb it down for someone who isn't well versed with digital circuits? Eg, I'm not sure of the phi1/2 clocks/phases; so more like the PPCP example above would be fantastic.

I suspect it's possible, with lots of hacks, to make an NES emulator pass all tests regardless of how you treat the interleaving, it's just that the proper way will result in a minimum of "special case" checks.

by on (#85164)
Maybe this goes without saying, but there are more than one possible power-up PPU-CPU alignments, which in turn affects some things like how OAM data readback (through $2003 and $2004) (mis)behaves. I doubt it affects any of blargg's test ROMs though, unless he explicitly states so. Hopefully we'll some day see an emulator that actually emulates all the different alignment possibilities...

by on (#85167)
Are there any solid, easy to code tests for which alignment you're on? It might be useful as part of the seed for a PRNG.

by on (#85175)
If no matter what its startup state dependent, which you can't easily test for, then most games should not rely on the PPU/CPU sync behavior for correct operation. Therefore, it should be pretty irrelevant, no? Hack in correct behavior for the one or two games that require it, and you make your life easier in the process.

by on (#85186)
Let's see if some diagrams can help a bit. Head to the bottom if you want the too long, didn't read summary. This is gonna be a bit heavy on clock details.

First case, generic 6502. You have an input clock, nominally square, that looks something like the following. Call it mclk for lack of a better term.

Code:
mclk: 111111000000111111000000....


Internally, the CPU needs a few more events than that can provide to be able to process things, and also needs something suitable to use as a signal to external hardware when the address and data are valid. These are phi1 and phi2 (phi2 tends to be listed as M2 in pinouts on the NES). Skipping the hardware that generates it (it's a mess I can barely follow) the outputs look like so:

Code:
mclk: 111111000000111111000000....
phi1: 011110000000011110000000....
phi2: 000000011110000000011110....


phi1 updates the CPU internals, phi2 signals external hardware that it should provide data for reads, and expect data for writes. In reality, they're closer to the full width of the pulses than the above diagram, so for the rest of this, we'll treat it as if phi2 is 1 whenever mclk is 0.

On the NES and SNES, the CPU is getting fed a 21MHz clock signal, that is divided internally to form the mclk signal for the actual 6502 or 65816 core. The unknown bit is whether that divider generates a 50/50 square wave, or some lopsided one. The SNES in particular could be quite likely to generate a lopsided one, as that would give the external hardware the most slack at the lower clock rates. Below, NES50 is a 50/50 NES freq wave, NES33 is a 33/66, and the SNS5 F/S/X and SNES F/S/X waves are the 3 SNES freqs at 50/50, and at a non-square rate I was planning on using in my fpga. F = /6, S = /8, X = /12

Code:
21MHz: 101010101010101010101010101010101010101010101010
NES50: 111111000000111111000000111111000000111111000000
NES33: 111100000000111100000000111100000000111100000000
SNS5F: 111000111000111000111000111000111000111000111000
SNS5S: 111100001111000011110000111100001111000011110000
SNS5X: 111111000000111111000000111111000000111111000000
SNESF: 111000111000111000111000111000111000111000111000
SNESS: 111000001110000011100000111000001110000011100000
SNESX: 111000000000111000000000111000000000111000000000


Note how the 0 periods on the non-square ones are longer. This gives external devices (rom, etc) more time to come up with data and get it stabilized.

As for what exactly phi2 represents, it being high signifies the following:
The address bus will stay stable while it is high
If it's a write, the data bus will stay stable while it is high
If it's a read, the external device should drive the data, and hold it stable while m2 stays high

Now let's take a closer look at the PPU's place in all this. Here are the 21MHz clock, NES33 clock, NES50 clock, and a 50/50 guess at the PPU's post-divider clock

Code:
21MHz: 101010101010101010101010101010101010101010101010
NES33: 111100000000111100000000111100000000111100000000
NES50: 111111000000111111000000111111000000111111000000
PPU:   110011001100110011001100110011001100110011001100


You can see there the x3 relationship, this is for the more trivial case of powerup synchronization. Noting that the M2 phase covers two full PPU cycles, we can surmise that the CPU-facing read hardware does not run directly at the full PPU clock rate, where the PPU would just treat M2 being high as meaning to do a read or write from the CPU that cycle. If it did, most of the accesses would see two back to back ones, rather than the single access it is. Also, it means the value would possibly jitter on the data bus while M2 is high.

There is a much more likely situation for the PPU, and that's where the CPU-facing logic is controlled almost entirely by M2 and R/W.

At the rising edge of M2, for a read, the PPU latches the appropriate data (status flags, $2007 buf, palette data, or OAM[$2003]) onto it's external data lines. This value will stay steady, regardless of what happens underneath. For $2002 and $2007, it also sets a flag indicating that the NMI flag should be cleared, or that the buffer should be refilled. Presumably, the PPU would act on this flag either at one of the following PPU clock edges (for the clear, could be either rising or falling), or at the start of the next memory cycle (2 PPU cycles long). Alternatively, it could treat a $2002 read as being an asynchronous clear of the NMI latch, though that might have some race issues with latching it for returning to the CPU. For the cases that could hit either edge,

For a write, the PPU could either look for the rising edge, and apply the write at one of the next PPU edges, or could wait until M2 falls before applying. Writes to $2005 would fill the latch used to reset the VRAM address, writes to $2006 would fill the temp latch that gets copied into aforementioned scroll regs, and set a flag to flush it to the VRAM address. Writes to $2007 probably set the buffer and a flag indicating a pending memory request. Writes to $2000 or $2001 probably go straight to the register, since from the PPU's perspective, they're read-only. Writes to $2004 probably write the value, and probably set an increment flag that the PPU will act on at it's convenience. One possibility for $2007 writes that comes to mind, is that since they're suppressed during rendering anyways, they might bypass the buffer, and drive the next two PPU cycles directly, since the data will stay steady across the entire access period.

For the SNES, the waves look something like:

Code:
21MHz: 101010101010101010101010101010101010101010101010
SNS5F: 111100001111000011110000111100001111000011110000
SNESF: 111000001110000011100000111000001110000011100000
PPU:   110011001100110011001100110011001100110011001100


Either with a 50/50 CPU clock or a lopsided one, there's still only one PPU cycle covered by M2.

Determining what case actually covers reality could be done trivially with an oscilloscope that can show you M2 against the 21MHz input. Determining it programmatically is a little trickier, given that whether something happens on the positive or negative edge generally just delays it a cycle or not, and the CPU can't read from the PPU fast enough to see the difference. Inspection of the die shots that are floating around might be able to shed some light on the particulars as well, but I haven't gotten the hang of reading them yet, and I think the 2C02 one is still missing a number of layers.

What this means for emulation:

Your approach of doing CPU reads 4 cycles earlier than writes working suggests the PPU latches the output on reads at the rising edge of M2, and latches incoming data on writes at the falling edge. It may not be 100% correct to the hardware, but it's probably within what can be determined without an oscilloscope and/or a logic analyzer with a rather high oversampling rate.

As you can see above, it's not as simple as just interleaving the CPU and PPU, though for a lot of things, there's no visible difference.

by on (#85187)
So then the consensus is that nobody knows perhaps the most fundamental aspect of a parallel-operation system. Wonderful =(

In that case, short of adding a hack to force-block an NMI, I can see no way to pass blargg's test 07, without subsequently failing test 03. And yet others seem to solve it. So is everyone else special casing the last point that NMI can be enabled and blocking it; or is there some other trick? If the latter, please explain, would be most helpful :/

by on (#85191)
It's hardly the most fundamental aspect.

If I'm reading the expected output correctly for the tests, assuming 0 is based the same for each, what we want is something that will produce this:

Code:
cyc  0 1 2 3 4 5 6 7 8
stat V V V V V V - - -
ctrl N N N N N - - - -


Where stat is whether $2002.d7 read back as true or not, ctrl is whether an NMI occurs with a write to $2000 on that dot.

This looks perfectly reasonable, given the assumption that the PPU regs update at or near the end of it's cycle. The internal vblank flag (NMI hold in your code I believe) gets cleared during cycle 5. A read that hits on that cycle will return the state of that flag early in the cycle, whereas a write to $2000 won't hit the register (and the logic that does VBLANK & ENABLE) until the end of the cycle, after the vblank flag drops.

Here's a rough timeline:

Code:
4R 4F 5R 5F 6R
      R
         C
            W


R is the read, C is when the vblank flag is cleared, W is when the written data hits the reg for $2000.

One other thing that comes to mind, though I don't think it applies here, is that if they decided to synchronize the CPU signals to the PPU clock for whatever reason, that would add a 1-2 cycle latency. It's exceedingly unlikely, since it would mean stuffing another bit or two worth of flipflops on all the data lines, M2, and R/W, and would be somewhat unneccesary.

One last thing is that IIRC, NMI coming into the CPU might be synchronized, though with the divider, there's some question as to which clock it's synchronized to. If so, NMI pulses shorter than a CPU cycle might be invisible.

by on (#85192)
byuu wrote:
Zepper, I realize it was a while ago, but may I ask how you fixed test 07, if you still recall?


What test are you talking about?

by on (#85193)
He means test 7 out of ppu_vbl_nmi,

Code:

07-nmi_on_timing
----------------
Tests NMI occurrence when enabled near time
VBL flag is cleared.

Enables NMI one PPU clock later on each line.
Prints whether NMI occurred.

00 N
01 N
02 N
03 N
04 N
05 -
06 -
07 -
08 -

by on (#85194)
ReaperSMS wrote:
It's hardly the most fundamental aspect.


The reason I disagree is because changing this alignment will require changing all timing, everywhere. And without having absolute values, you can't give definitive "this behavior happens at this cycle." Take the part where it caches the BG&&OAM enable for skipping the extra PPU dot. Nobody says "it happens at X=338", because where it happens depends on how you choose to guess how the interleaving works.

To me, this is major. This is the core foundation of your emulator. You don't want to build your castle on top of a poor foundation. And nobody can document the completely accurate process of how the hardware works.

Instead, everyone plays games of fiddling with numbers and hold latches and stuff until they can pass a given set of test ROMs that cover maybe 5% of the total possible combinations. You may nail blargg's $2000/$2002 timing tests, but are there also timing tests for $2003, $2004, $2006? Or what about with the more complex 64 PPU registers on the SNES?

Minor flaws in the lowest level tend to multiply as you get to higher and higher levels of operation, requiring increasingly complicated hacks to match behaviors. If you want to match hardware in all your tests, you need to get everything exactly right, always.

Quote:
A read that hits on that cycle will return the state of that flag early in the cycle, whereas a write to $2000 won't hit the register (and the logic that does VBLANK & ENABLE) until the end of the cycle, after the vblank flag drops.


Well, the issue is that three entire PPU cycles complete between each CPU cycle. And with our stock "don't know/care" approach of CPPP CPPP CPPP for both reads and writes, that means the CPU $2002 read or $2000 write will always occur on the very edge of every third PPU cycle.

by on (#85195)
What I meant was that at some point or another, you will hit a level where you technically don't (or can't) know the specific timing, and that the current knowledge is as good as the set of tests we have so far. Somewhere in there, emulation has to adopt some sort of compromise between simulating every last detail, and abstracting slightly in a way that maintains software compatability. Otherwise you end up with scope creep, and suddenly you're trying to simulate process variations, or speed vs temperature variations, and have to find a way to characterize some particular chip. If you aren't trying to hit that point (and that point is rather extreme) the usual compromise is to emulate a system that, to every observing program known, is indistinguishable from the Real Thing within reason. Power-up synchronization is something that is fairly reasonable for an emulator to force to a case that cleans up some edge issues.

Even at the HDL level, compromises have to be made. Partly because we can't know certain things without doing a full RE of a die shot (and as awesome as visual 6502 is, their simulator is simulating ideal transistors in a lot of spots, because reality is messy) or because certain constructs don't work with modern logic. The NMOS 6502 is full of pass gates doubling as dynamic latches, which is something you just can't really do in an FPGA. In particular, dynamic level sensitive latches are not (reasonably) synthesizable, so they have to be 'emulated' as some form of edge sensitive flip flop. That's going to be a somewhat lossy process, but most of the time it doesn't matter in practice, due to guarantees made by other parts of the hardware (such as ensuring none of the input signals will change across the latch enable)

Other compromises come in, to deal with hardware details such as having to share one ram chip among everything, and ways to deal with that.

As for the specifics...

What we care about is observable behavior. In this case, we have test roms, that run on the actual NES, and show repeatable behavior. As anomie's timing doc says, when you give a proclamation of "X happens on dot Y", there's at least half a dot of uncertainty as to exactly when that happens. Most of the time, that doesn't interact with some other behavior, but when it does, those uncertainties come into play.

Most of the PPU regs are not readable, so if you have some event you can get a PPU cycle accurate test for, they should have repeatable behavior.

Also, cycles are not atomic, monolithic things. Nothing happens instantly, and the chips we're dealing with are more complex than those that can be handled by a single level of logic. Ergo, to "properly" emulate the CPU, sans abstractions and "hacks", you'd need to drop down to at least a half-cycle, and emulate the sub-cycle signalling.

All of these details fall out of what the CPU expects and provides, regarding M2, R/W, and the A & D buses. It expects peripherals to start driving the data for a particular address around the time M2 goes high, expects that data to stabilize before M2 goes low, and latches the value on the pins when it does. Likewise, when it writes, peripherals know that M2 means the CPU's trying to write, but they don't know how long it will take the value to settle. When M2 drops, they know the CPU thinks it's had enough time.

The suppression behavior (test 06) could be reproduced either with some trickery inside the PPU, delaying some signals more than others, etc, by having the CPU ignore NMI pulses shorter than a cycle, or by having the NMI output of the PPU be registered in a fashion that lets the clear and set logic settle such that there are no externally visible glitches on the line. (glitch in this case would be the hardware definition, where you see a spurious edge during switching) Any of these three could be what the Real Thing does, but they're otherwise indistinguishable in that test, so you usually implement whichever one fits your language best.

by on (#85196)
Zelex wrote:
If no matter what its startup state dependent, which you can't easily test for, then most games should not rely on the PPU/CPU sync behavior for correct operation. Therefore, it should be pretty irrelevant, no? Hack in correct behavior for the one or two games that require it, and you make your life easier in the process.

I'm pretty sure you could detect the PPU syncs by seeing how the OAM readback behaves (which bytes get corrupted), but of course no games rely on that. But some emulators have goals besides just getting games to run, which in fact doesn't require that accurate low level emulation at all (just look at FCEUX, it's scanline based).

by on (#85227)
Even lacking proper hardware knowledge, I don't see a logic that reproduces a CPU "half-cycle" as you say very often. From my viewpoint, we deal with 3 modules: CPU, APU and PPU. Well, the mother-thing is the CPU - we deal with instructions being executed as "steps", read a byte from a given memory adress, add with the accumulator register etc., and then the PPU is "clocked". Actually, I must call APU before the PPU because of DMC DMA thing, but that's another kind of abstraction.

ReaperSMS wrote:
The suppression behavior (test 06) could be reproduced either with some trickery inside the PPU


Back to the subject, that's what I fear while running most of the blargg's tests. An easy hack (or workaround) would get a pass. That's "unfair", a penalty in your emulation scheme.

Remember that even the Nintendo and its Virtual Console doesn't emulate the things perfectly. I played Star Fox 64 recently and... wow, it has lots of emulation quirks.

by on (#85236)
With the APU, it isn't much of an issue, since all of its features are clocked at either the CPU clock, or one half that.

As for the fears of what the Right Way is, there's only three ways to determine that:

Get a rather fast logic analyzer, and oversample all the pins between the CPU and PPU, and see what the waveforms look like. Depending on how fast the LA is, there'll be some quantization error, but above 20MHz or so that shouldn't matter.

Watch the clock, R/W, NMI, address, and data lines on a Fast Enough oscilloscope (probably take several, given the number of signals) and see what things look like in analog.

Get a Visual 2A03 and a Visual 2C02 set up and connected with each other, and run a simulation trace to give you the equivalent of the waves above. This will probably be subject to certain idealistic simplifications regarding transistor speeds, trace delays, etc.

by on (#85245)
Well, guess I'm on my own then. May as well leave 07 failing by one cycle. I don't see much point in hacking an inaccurate workaround just to get points on some arbitrary checklist, when the result would likely be wrong for every other register on the system anyway.

Quote:
Back to the subject, that's what I fear while running most of the blargg's tests. An easy hack (or workaround) would get a pass. That's "unfair", a penalty in your emulation scheme.


Agree completely. I actually have about 1,000 SNES tests, but I don't release them because I don't want people trying to pass them explicitly to the detriment of doing things right.

Fundamentally, there is a core minimum level necessary to express all hardware effects cleanly and without unnatural checks and hacks/workarounds. Whether that level is at the cycle-level, or if we have to simulate phi1/phi2 phases, or we have to simulate the warming up of transistors as the system runs ... there is a point where we can perfectly simulate all observable behaviors.

And that point is, to me, quintessential to preservation. No, emulators do not have to work at this ultimate lower-level. We only need ONE emulator that does it, and documents all of the lowest-level timing information. And from here, we can abstract general-purpose playable emulators that take appropriate shortcuts where applicable. The biggest and most obvious being the elimination of the "pseudo-random" effects of things like power-on states for RAM, bizarre hardware glitches, etc ... stuff that no official or sane homebrew software would ever rely on.

It's very disappointing to me, because I believed the NES scene was at this point already. What I've discovered is that cycle-level PPU timing for the NES is "easy" (once you know how it works; it was still a reverse-engineering marvel that it was figured out in the first place), and can be done in 10-15KB of code. I mistook it as being more amazing because the SNES PPU is ungodly complex by comparison. Aside from that, NES CPU timing actually seems, in some ways, to be more primitively understood than SNES CPU timing :/

I just really feel like, across the board for every system ... we start emulation as a top-down approach, instead of the bottom-up approach that it should be. And it just takes so much longer to reach perfection this way, it's not even funny. You don't build you house by starting with the roofing.

by on (#85248)
ReaperSMS wrote:
He means test 7 out of ppu_vbl_nmi,


I've found a WIP in my forum, date 2008. See if it makes sense to you (some english fixes).

Quote:
Fixed the nmi_on_timing test. The VBlank_flag_clear is set to true right on 2000h write, but this makes the NMI to trigger in a wrong time. The fix was verifying if VBlank_flag_clear is true before requesting a CPU NMI.

- Such flag is set to false on the next instruction, since the number of PPU cycles have been expired.

by on (#85255)
I would challenge that claim that SNES CPU/PPU timing is more well known than NES. I see nothing in any doc regarding which master clock read data comes from, or write data hits.

Unfortunately, the same testcase you're running into on the NES wouldn't translate directly to the SNES, since the vblank flag and NMI there are generated on the CPU, and I don't see any mention of NMI suppression or a readback glitch with $4210 or $4200. I'd suggest that SNES timing isn't that well understood either, but the current state of the art for CPU/PPU interleaving there happens to match, since I don't see any particular register setups where the subtleties would be visible. On the NES, the VBL/NMI interaction is about the only one I can think of, as it requires a register that is readable, that can also have the value changing underneath it across the read cycle, and that also has some other functionality dependent on writes elsewhere, and where the update portion is running at a higher frequency than the CPU.

Emulators tend to start top down, because that's usually how the documentation tends to approach things.

Incidentally, if Q's visual 2A03 is to be believed, the divided CPU clock is 50/50 in the NES.