Trying to tackle the PPU and timing

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Trying to tackle the PPU and timing
by on (#146202)
I'm writing an NES emulator and I've completed the CPU with cycle timing.

I'm now working on the PPU but having an issue trying to grasp how to actually do timing/drawing.

The CPU and the PPU have fixed frequencies. The frequency to on the CPU makes sense to me... The CPU executes instructions in memory, each having their own cycle counts.
The PPU, on the other hand, doesn't actually execute any instructions, yet it still has a clock rate. From the emulators perspective, I'm having difficulty trying to figure out what to do with the clock rate of the PPU, and tying that into drawing pixels on the screen.

My code is like this:

Code:
while (1)
{
    int cpu_cycles = execute_next_cpu_instruction();
    wait_until_cpu_cycles_elapse(cpu_cycles);
}


But I dont understand where the PPU clock should tie into this. Am I approaching this wrong?

If anyone has any advice or articles that they can point me to, I'd greatly appreciate it. Thanks.
Re: Trying to tackle the PPU and timing
by on (#146203)
1 PPU cycle is 1 pixel. The PPU doesn't read instructions from memory, but it certainly follows an internal sequence of pre-programmed tasks. This page describes what the PPU does on each cycle.
Re: Trying to tackle the PPU and timing
by on (#146210)
tokumaru wrote:
1 PPU cycle is 1 pixel. The PPU doesn't read instructions from memory, but it certainly follows an internal sequence of pre-programmed tasks. This page describes what the PPU does on each cycle.



Okay -- I read that section on the wiki. I'm a little confused at this (my interpretation from the wiki link):

Each pixel is drawn with 1 PPU cycle.

However, during each pixel it draws, it also performs memory accesses (NT, AT, TBL, TBH), each taking 2 cycles, for a total of 8 PPU cycles.

Those memory accesses it's performing is for future pixel plotting so that when its time to draw that pixel associated with those memory accesses, it can do it in 1 cycle.

So for example:

On the prerender scanline (-1), when drawing pixel 0, it does its memory accesses (NT, AT, TBL, TBH) for 8 PPU cycles to get the pixel data for visible scanline 0, pixel 0?

Because its 8 PPU cycles to get the pixel data (even though it starts a scanline early), I would imagine that it would creep up on the pixel that needs to be drawn and eventually stall, since it really requires 8 PPU cycles of memory access to get the pixel data. By the time it start pulling the pixel data for scanline 0, pixel 1, it's already on prerender scanline -1, pixel 8.
Re: Trying to tackle the PPU and timing
by on (#146212)
It actually starts a little earlier than pixel 0, it's cycles 321-336 of the previous scanline that get the first 16 pixels ready. Then cycles 337-340 are garbage fetches that don't matter. Then it's at cycle 0, and it starts outputting pixels it has stored (fine scroll determines which pixels it displays), and starts fetching the next tiles.

Prerender line is mostly junk that doesn't matter, except for the very end (cycles 321-336), which fetches the first 16 pixels that get drawn on the first visible scanline. Also at dots 280-304 of the prerender line, the scrolling event V=T happens every cycle.
First visible scanline doesn't have sprites either.
Re: Trying to tackle the PPU and timing
by on (#146214)
Dwedit wrote:
It actually starts a little earlier than pixel 0, it's cycles 321-336 of the previous scanline that get the first 16 pixels ready. Then cycles 337-340 are garbage fetches that don't matter. Then it's at cycle 0, and it starts outputting pixels it has stored (fine scroll determines which pixels it displays), and starts fetching the next tiles.

Prerender line is mostly junk that doesn't matter, except for the very end (cycles 321-336), which fetches the first 16 pixels that get drawn on the first visible scanline. Also at dots 280-304 of the prerender line, the scrolling event V=T happens every cycle.
First visible scanline doesn't have sprites either.


The four memory accesses:

Nametable byte
Attribute table byte
Tile bitmap low
Tile bitmap high (+8 bytes from tile bitmap low)

Are these four fetches for an individual pixel or individual tile? I had originally said pixel, but if it just does the fetches for the tile, then it has the all the data it needs to plot for a 8x8 tile in 8 cycles.
Re: Trying to tackle the PPU and timing
by on (#146216)
They are for a tile.
It still needs to have 16 pixels ready before it fetches stuff at cycle 0 of the scanline, because of fine scrolling and all that.
Re: Trying to tackle the PPU and timing
by on (#146232)
Okay -- Would this be a decent algorithm then:

Code:
void start()
{
    while (1)
    {
        execute_cpu();
    }
}

void exec_cpu()
{
    if (cpu_instruction_is_about_to_require_ppu_mem_access())
    {
          execute_ppu(cpu_cycles_executed_before_this_instruction);
    }
    ....
}

void execute_ppu(int cpu_cycles)
{
    catch_up_and_draw_based_on_where_the_cpu_is(cpu_cycles);
    ....
}


Essentially, the CPU drives. We keep executing CPU instructions (and counting CPU cycles) until we discover that the next CPU instruction is going to affect the PPU. When that happens, we catch up the PPU to the same point as the CPU, right before it was about to execute that instruction.

If thats a valid algorithm could accuracy be an issue?
Re: Trying to tackle the PPU and timing
by on (#146295)
Keep in mind that the PPU also affects the CPU: NMIs and scanline IRQs are trigered by the PPU. The PPU status (VBlank flag, sprite hit flag and sprite overflow flag) can affect the program flow, but at least you know when the CPU is reading these flags.

And don't forget about the APU, which can generate interrupts too. For these reasons, I don't think it's safe to let the CPU run the show.

The safest thing to do would be to emulate one cycle of each component (CPU, PPU, APU, mapper, etc.) at a time. On today's desktop computers, an emulator like this would probably still run at full speed, but there are lots of other devices in use today that are not as powerful (phones and tablets, mostly).

In order to implement a proper catch-up method you probably have to keep a list of events that could affect other components of the system, and predict when those events will occur, and run the individual components until the next event, updating the list as you go. For example, you can predict when a scanline IRQ will fire based on the last parameters written to the mapper, so you can forget about that until it's time to handle that event, but if the CPU changes something that could affect the counting of scanlines (new mapper writes, new PPU configurations, etc.), you have to predict again.

Hopefully people who have actually written emulators will share their methods, I'm just here to tell you that there's more to consider. =)