timing... (attn: disch)

timing... (attn: disch)
by Anes on 2005-08-19 (#3909)

i dont know what to do with timing in my emu. I know how i emulate ppu is crappy:

i do the following:

- has a "cc" counter that counts ppu cycles, this is inside the ppu emulate loop
- when the "cc" arrives at more or less "340" i increse another counter "cScanline"
- When "cScanLine" arrives at 262 it reset to 0;
- well all things happens inside this as 2C02 brad taylor's doc says

In the emulation main loop i do this:

Code:

EmulateCpu();
EmulatePPU(cCurrentCycle * 3)

(im not tanking account yet PAL)

Disch told me about a method to keep emulating the cpu until something happens to the ppu that stop the cpu emulation and then executes the ppu as many cycles the cpu executed. I readed about it in a emulation doc too.., but i have problems, in other words i dont know how to implement it, i catch it "theorically" but i cant put it in the code

.

Help plz!!

by Disch on 2005-08-19 (#3910)

I do the following:

1) Keep a CPU timestamp (obviously). This timestamp, is in "master cycles" (see below)

2) Keep a PPU timestamp -- same idea as CPU timestamp. Again, in "master cycles"

3) Keep a Scanline Counter (-1 through 240).

4) Keep a scanline cycle counter (0-340)

5) Keep a 'VBlank Time' var (this will be more or less constant, but it changes between PAL/NTSC modes).

I do the 'main' timestamps in what I call Master Cycles. These are neither CPU nor PPU cycles -- rather they're a higher resolution so that the ratio between PAL CPU:PPU cycles can be manitained.

- For every 1 NTSC CPU cycle that passes, I increment the CPU timestamp by 15
- For every 1 PAL CPU cycle that passes, I increment the CPU timestamp by 16
- For every 1 PPU cycle that passes (NTSC or PAL), I increment the PPU timestamp by 5

I'd recommend you take PAL into account as soon as possible, as relying on the 3:1 NTSC ratio will make things a pain in the ass later when you finally do decide to add PAL support.

As for implimentation -- the two big functions of my program are RunCPU(int runto) and RunPPU(int runto). RunCPU will emulate CPU instructions until the CPU timestamp reaches/passes the given 'runto' timestamp (typically, RunCPU is only called once in my emu and it told to run the CPU for an entire frame's worth of time). RunPPU does the same thing, but runs the PPU (and renders pixels) until the given timestamp is reached (typically, RunPPU is called many times per frame).

Making these functions work together is simple. If you keep the CPU timestamp updated as you emulate 6502 instructions -- you simply pass the CPU timestamp to RunPPU when you want the PPU to 'catch up' to the CPU. You should have the PPU catch up everytime something on the system which affects drawing changes, and also when the status of the PPU will alter CPU action (in the case of register reads). This includes (but is not necessarily limited to) PPU register writes/reads, Nametable mode changes, and CHR swapping.

For instance when your game is swapping CHR -- updating the PPU would be as simple as something like the following:

Code:
void SwapCHR(int where,int page)
{
RunPPU( cpu_timestamp );

// swap CHR here
}

The tricky part now, is making a RunPPU function which can be entered and exited on ANY given PPU cycle. This is one reason why I keep those Scanline and Scanline Cycle counters I mentioned earlier. If you keep track of the scanline and scanline cycle that the PPU is in, it makes PPU emulation easier. But you also need to keep the main timestamp to keep it synced up with the CPU.

My RunPPU function looks kind of like this:

Code:
void RunPPU( int runto )
{
if( ppu_timestamp < vblank_cycles ) /* vblank_cycles is the number of master cycles VBLank lasts. For example on NTSC this is (20 * 341 * 5) */
{
ppu_timestamp = vblank_cycles; //do nothing in vblank
scanline = -1; // set scanline counter to pre-render scanline
scanline_cycle = 0; // start of cycle 0 of that scanline
}

if( ppu_timestamp >= runto ) return; /* see if we're done -- this should be done every time ppu_timestamp is adjusted */

if( scanline == -1 )
{
// do pre-render scanline stuff
}

while( scanline < 240 )
{
while( scanline_cycle < 256 )
{
/*render 1 pixel, load another tile if needed, adjust PPU address where needed, etc */

scanline_cycle++;
ppu_timestamp += 5;
if( ppu_timestamp >= runto ) return;
}

while( scanline_cycle < 340 )
{
//similar things here
}

scanline_cycle = 0;
scanline++;
}
}

That's gives a rough idea.

Anyway -- there are rooms for optimizations. The two big things I can think of are:

- detecting $2002 read loops and running the PPU until $2002 status changes

- having a faster version of RunPPU which renders full scanlines which can be called when the PPU is to render a full scanline.

Anyway, at the end of the frame, you'd make sure the PPU is caught up to the CPU again, then you subtract CPU/PPU timestamps by the number of cycles in that frame (do not reset the timestamps to 0! Otherwise cycles which "spilled" over to the next frame would be lost).

by blargg on 2005-08-22 (#3987)

One thing I wanted to try with my NES emulator was seeing how efficient a PPU core could be if it rendered the whole screen at once. After thinking about the design Disch described, I realized that it does allow the optimization of the common case where dozens of scanlines are rendered without any relevant PPU writes between. It allows the standard approach to efficiency of first writing code that works in all cases and then optimizing the common operations.

The design simulates cooperative threading, where each thread explicitly yields to another. It would be interesting to implement it with a proper cooperative threading library. The code below shows the differences:

Code:
// no threading
void f()
{
for ( int i = 0; i < 10; i++ )
g();

h();
}

// manual threading
static int i;
static int phase;

void f()
{
switch ( phase )
{
case 0:
i = 0;
phase = 1;
break;

case 1:
i = 0;
if ( i < 10 ) {
g();
i++;
}
else {
phase = 2;
}
break;

case 2:
h();
phase = 3;
break;
}
}

// cooperative threading
void f()
{
for ( int i = 0; i < 10; i++ ) {
g();
yield();
}

h();
}

by Anes on 2005-08-23 (#3999)

thanks disch, i taked the "concept" i applied to my emu, its working better and with better performace, but i still have problems with battletoads, any help? thanks.

by Disch on 2005-08-23 (#4001)

Battletoads relies on some pretty exact timing crap. To get it working properly, make sure:

1) You execute 1 instruction between the start of VBlank (when $2002.7 is raised) and when an NMI is actually triggered. There appears to be some latency between the two. This doesn't apply to battletoads, but this latency also exists when you enable NMIs from a disabled state when $2002.7 is high (failure to handle this latency will make Lolo games crash and burn -- failure to handle NMI triggering when NMI's are enabled when 2002.7 is high will cause problems with Captain Skyhawk)

2) PPU X address is incremented no earlier than every 4th cycle on the scanline (4, 12, 20, etc)

3) PPU Y address is incremented on cycle 252

4) PPU X address is reset on cycle 256

Doing those 4 things should get Battletoads running without problems.

by Quietust on 2005-08-23 (#4003)

Disch wrote:
2) PPU X address is incremented no earlier than every 4th cycle on the scanline (4, 12, 20, etc)

3) PPU Y address is incremented on cycle 252

4) PPU X address is reset on cycle 256

The actual values for these are 3/11/19/etc., 251, and 257 (all zero-based), verified by doing extremely precise PPU testing using Kevin Horton's "3-in-1 tester".

by Disch on 2005-08-23 (#4005)

whoops -- I stand corrected.

by blargg on 2005-08-25 (#4058)

Fx3 wrote:
(from the thread "Reading opcodes directly without read function")
Code:
void cpu_run()
{
ppu_run(); apu_run();
data = cpu->bank[PC>>13][PC & 0x1fff];
//do stuff
}

Why do the PPU and APU need to be run every CPU instruction? Unless they can affect each other in some way, they can each be run separately and in any order.

What you need is a way to ask the PPU and APU for a timestamp of the earliest time they can affect the CPU, then run the CPU until this time. Along the way the CPU might write to the APU or PPU in a way that changes the timestamp of their earliest effect, in which case you might need to stop the current CPU emulation run loop.

by Zepper on 2005-08-28 (#4109)

Now you're messing up the things. Let me clear it - anyway, the PPU/APU is executed at every single CPU cycle. For the case above, 1 cycle to fetch the instruction. I'm not running CPU/APU for every instruction, but for every cycle.

by Nessie on 2005-08-28 (#4112)

Of course, that makes much more sense.

by Kinopio on 2005-09-24 (#4854)

Disch wrote:

4) PPU X address is reset on cycle 256

By reset do you mean it is reloaded with PPU X address
from the temp address (Loopy_t)?

by Disch on 2005-09-24 (#4855)

Yes

X Scroll reset logic:

Loopy_V = (Loopy_V & ~0x041F) | (Loopy_T & 0x041F);