cycle for cycle stuff

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
cycle for cycle stuff
by on (#4401)
Does anybody have any accurate documents on 6502 timing (i.e. cycle for cycle stuff). I have used 6502_cpu.txt and the w65c02s.pdf document but they don't make sense or have conflicting theories. I am trying to implement accurate emulation so that I can execute 1 CPU cycle then 3 PPU cycles for my emulator. What did you use for your emulator?

For instance:

inline void TYA( int Cycle )
{
switch(Cycle)
{
case 0: increment CPU cc break;
case 1: A = Y break;
}
}

by on (#4402)
Executing partial CPU instructions is largely a waste, since it takes so much overhead and provides no benefit. If you want the precision, just emulate the CPU one instruction at a time, but have each instruction update the PPU/APU/etc. during each cycle of each instruction.

by on (#4403)
Have you written an emulator Quietust? If so do you have accurate CPU emulation?

It's not just for the NES I also wanted an accurate 6502 for other console/pc emulators. So I really am desperate.

by on (#4404)
Yes, his emulator is the most accuratest in the world about CPU and PPU timing / syncronistation.

by on (#4405)
Thanks, but I still need some documents on accurate on the 6502's timing.

by on (#4407)
I was under the impression that the following had accurate per-cycle operations of the CPU:

http://nesdev.com/6502_cpu.txt

near the end it gets into which reads/writes are performed on every cycle of the instruction.


As for implimentation, the easiest would probably be to pass an additional timestamp to your read/write proc indicated the number of cycles into the instruction. For example, the code for STA absolute might look something like this:

Code:
byte opcode = Read( PC, 0 );
PC++;

switch(opcode)
{
 case 0x8D:
  adr = Read( PC, 1 );
  adr |= Read( PC + 1, 2 ) << 8;
  Write( adr, A, 3 );
  PC += 2;
  cpu_timestamp += 4;
  break;
}


cpu_timestamp would contain the overall cycle (which in incremented by the time determined by each instruction). the last parameter passed to Read()/Write() indicates the number of cycles to add to the cpu_timestamp when "catching up" other systems. This value can usually be ignored for areas which dont' need it (for example, when reading from ROM, or reading/writing RAM... the last param will be tossed). However for register writes/reads the extra given cycle will allow you to have cycle-accurate precision when catching up the PPU and APU to the current CPU time -- at little processing cost.
}

by on (#4408)
Anonymous wrote:
Have you written an emulator Quietust? If so do you have accurate CPU emulation?

A good example of how a perfectly legitimate questions becomes hilarious given the right context.
For the benefit of the thread author, Q's emulation called Nintendulator is as close to hardware as you get.

by on (#4409)
Thanks I'll try that but the 6502_cpu.txt cycles explanation differs from WDC own w65c02s.pdf (basically the same CPU). This is what is concerning me.

http://www.westerndesigncenter.com/wdc/datasheets/w65c02s.pdf

Can anyone give a better explanation to the details of the PDF's cycle chart as I find them verry hard to understand.

by on (#4412)
that PDF document you linked to seems to be for a more recent processor than the processor used by the NES. It makes reference to various instructions unavailable on the 6502 (BRA, PLY, PLX) -- plus it does things 6502_cpu.txt says are CMOS processor behavior (such as doing read/write for Absolute Read/Modify/Write instrutcions, rather than dual writes) NES has an NMOS, not CMOS processor)


I'd stick with 6502_cpu.txt

by on (#4425)
Ok I'll do that, I don't think that it'll make much difference anyway. I'll post again later on today or tomorrow to see how it goes.

by on (#4441)
Quote:
Yes, his emulator is the most accuratest in the world about CPU and PPU timing / syncronistation


Is it the "accuratest" or the "most accurate" bregalad?
When words are large you should use "most", but not "more accurastest".

by on (#4442)
Anes wrote:
Quote:
Yes, his emulator is the most accuratest in the world about CPU and PPU timing / syncronistation


Is it the "accuratest" or the "most accurate" bregalad?
When words are large you should use "most", but not "more accurastest".


lmao. True. But sticking with my post...

by on (#4717)
Hi,
I am having a look at nintendulator sources (955 beta).
I would like to be sure that I am understanding the base of your CPU emulation (I mean I am asking :) ). Please could you confirm that the following is correct?

Instead of having an external clock system synching CPU, PPU and APU, you use CPU cycles to synch PPU and APU.
To do that, you call RunCycle (which internally calls PPU_Run and APU_Run) from CPU_MemGet and CPU_MemSet.
CPU_MemGet and CPU_MemSet are called from opcodes and interrupts.

I suppose that this reflects the following lines from Kim programming manual:

Quote:
The major point to be noted is that every clock cycle in the MCS650X microprocessor is a memory cycle in which memory is either read or written. Simultaneously with the read or Write of memory, an internal operation of the microprocessor is also occurring.


(http://users.telenet.be/kim1-6502/6502/proman.html#51)

Thanks.
Best regards

by on (#4731)
jarodcanal wrote:
Instead of having an external clock system synching CPU, PPU and APU, you use CPU cycles to synch PPU and APU.
To do that, you call RunCycle (which internally calls PPU_Run and APU_Run) from CPU_MemGet and CPU_MemSet.
CPU_MemGet and CPU_MemSet are called from opcodes and interrupts.


That is correct.

by on (#9043)
(BTW: I started this post, under the (accidental) name of 6502 Timing)

Quietust wrote:
Executing partial CPU instructions is largely a waste, since it takes so much overhead and provides no benefit. If you want the precision, just emulate the CPU one instruction at a time, but have each instruction update the PPU/APU/etc. during each cycle of each instruction.

...and provides no benefit.

That is not true. Observe the following.

Code:
Clock Cycle #
1   2   3   4
|   |   |   |
|   |   |   +-   Load From $2002 (i.e. $00)
|   |   |   |
|   |   |   |
|   |   |   |
|   +-   VBlank Ends ($2002 = $00)
+-   ($2002 = $80)


An emulator without perfect timing would read that back as $80.

by on (#9046)
I don't get Quietust's original comment in the first place. What he described is executing partial instructions, and it will handle the case of reading $2002 just after the VBL flag is cleared (as will any method which communicates the time of the memory read on the fourth instruction clock, even just read_memory( addr, timestamp + 3 )).

by on (#9049)
When someone says "executing partial instructions", I think of the ability to halt the CPU in the middle of an instruction. My approach is not the same - though it does emulate the individual cycles, it still must execute one full instruction at a time.

by on (#9050)
According to my diagram, would Nintendulator return $00 or $80?

by on (#9057)
It would return $00.

As Q mentioned, his is cycle accurate, but execution is instruction granular, so if you tell it to execute N cycles, it likely will execute slightly fewer or slightly more, depending on where the instruction falls. In the grand scheme of things, keeping the rest of the system in-synch with this is trivial.

I wrote a CPU core that could halt in the middle of an instruction. To say it was slow would be a vast understatement. It wasn't terribly useful aside from the novelty of the idea.

by on (#9060)
Quote:
When someone says "executing partial instructions", I think of the ability to halt the CPU in the middle of an instruction.


If you're running the PPU and APU each CPU clock, then the CPU emulator is halting in the middle of an instruction:

Code:
int opcode = read_mem( pc++ );
next_clock(); // halts here
switch ( opcode )
{
case 0xAD: // LDA abs
    int lo = read_mem( pc++ );
    run_ppu_and_apu(); // halts here
    int hi = read_mem( pc++ );
    run_ppu_and_apu(); // halts here
    a = read_mem( (hi << 8) | lo );
    run_ppu_and_apu(); // halts here
    set_nz( a );
    break;
...
}


What I take you to be meaning is that the CPU emulator function can return in the middle of an instruction, which would require something inefficient like

Code:
case 0xAD: // LDA abs
switch ( phase++ )
{
case 0: lo = read_mem( pc++ ); break;
case 1: hi = read_mem( pc++ ); break;
caes 2: a = read_mem( (hi << 8) | lo ); break;
case 3: set_nz( a ); opcode = read_mem( pc++ ); phase = 0; break;
}
break;


where you have to call this four times to execute a single LDA instruction. These are equivalent if you're just turning around and calling it like this:

Code:
while ( clocks_remain-- )
{
    run_one_cpu_clock();
    run_ppu_and_apu();
}

by on (#9064)
I understand what you are saying, but I prefer my method. First of all the 6502 emulator that I wrote will also be used on some of my other emulators (e.g. Atari 2600). Secondly it is easier to handle interrupts this way and I can also emulate the BRK bug (I am going for MAXIMUM accuracy here baby!). Thirdly is the simplicity of it all, as I am emulating what the NES does exactly. I also don't have the function overheads/increased .exe size from updating the PPU/APU like Nintendulator does. I have inlined all of the opcodes for maximum speed.

Example;

Code:
inline void OpticCodeAD()
{
   switch(CPU.Cycle)
   {
      case 0:
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 1:
         CPU.TMP2 = CPU.Memory[CPU.PC];
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 2:
         CPU.TMP2 += (CPU.Memory[CPU.PC] << 8);
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 3:
         CPU.A = CPU.Memory[CPU.TMP2];
         CPU.P &= 0x7D;
         if( !CPU.A )
            CPU.P += 0x02;
         CPU.P += (CPU.A & 0x80);
         CPU.Cycle = 0;
         break;
   }
   CPU.CC++;
}

by on (#9065)
Quote:
Thirdly is the simplicity of it all, as I am emulating what the NES does exactly.


The NES works via electrons moving in transistors (or even more basic, if you want to go to a subatomic level). An emulator doesn't emulate this. Most work at a higher level, emulating the behavior of the CPU instructions.

Quote:
I also don't have the function overheads/increased .exe size from updating the PPU/APU like Nintendulator does. I have inlined all of the opcodes for maximum speed.


What you show above is probably slower since it adds lots of branching and function calls. But programmer intuition has never been what determines the speed of code. What does your profiler say?

by on (#9066)
Yeah, yeah I meant at a higher level anyway. I know that this method will make my emulator very slow due to the switch/case branching but with since I have inlined every opcode there are actually no function calls at all. As for my profiler, I don't have one, but my probation officer says that if I don't keep my nose clean, it'll be back to the state pen. for me.

by on (#9134)
blargg: technically, yes it's stopping in the middle of the instruction. as far as performance goes, there's drastic differences between the two.

WdNESday: if you think about it, there is no logical difference between the two approaches. If the other parts are implemented properly, the CPU won't be able to tell the difference, and the only thing you get out of that approach is a slight cleanup in the outer loop running the CPU core. Going down that route for the purpose of personal curiousity is fine, but keep in mind that you get no technical benefit, and a slowdown of about 100x compared to the instruction-granular with cycle-accurate side effects approach.

by on (#9149)
Thanks, for the advice. I know that it will make it slower, but I am implementing this because the core will also be used in other console/computers on other emulators. Also I like the simplicity that it involves.

For example, the (NTSC) VBlank time is 2273 (.3) cc's. If we are on cycle 2272 and STA Absolute is executed then then first cycle wouldn't need any PPU drawing/fetching, but the others would. Observe;

Code:
for( int cc = 0; cc < 2273; )
{
    FetchOpcode();
}

for( int cc = 0; cc < 29393; )
{
    FetchOpcode();
    Draw3Pixels();
}


This way, if we are in a VBlank period the PPU won't need any checking. My method ensures that there are no wasted calls to Draw3Pixels(). Also observe the following; (let's say that we are on a different console/computer)

Code:
inline void OpticCodeAD()
{
   switch(CPU.Cycle)
   {
      case 0:
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 1:
         CPU.TMP2 = CPU.Memory[CPU.PC];
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 2:
         CPU.TMP2 += (CPU.Memory[CPU.PC] << 8);
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 3:
         CPU.A = CPU.Memory[CPU.TMP2];
         CPU.P &= 0x7D;
         if( !CPU.A )
            CPU.P += 0x02;
         CPU.P += (CPU.A & 0x80);
         CPU.Cycle = 0;
         break;
   }
   CPU.CC++;
}


Let's pretend that after case # 1 was executed there was some kind of automatic bankswitching that meant that a different high byte was fetched. This would ensure that the correct byte is fetched.

by on (#9151)
-deleted-

by on (#9195)
There's nothing stopping instruction granular from handling the situation you mention regarding a timed bankswitch, if implemented correctly

If the switch is timed, then it should be updated per-cpu-cycle like the rest of the hardware, and the memory accesses should realize that side effects will possibly invalidate the direct fetches, so the memory fetch should go through code rather than direct access.

by on (#9201)
Any NES CPU emulator which includes the timestamp of memory accessess can be used as the basis for a "cycle-accurate" NES emulator. The general rule is, any number of hardware modules can be emulated on an as-needed ("catch-up") basis as long as the future effects of all but one module on others can easily be predicted in advance. This is the case for the NES, where the CPU is the only entity whose future effect can only be determined by doing the actual emulation.

by on (#9227)
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator. I immediately hooked it up to WedNESday to test it out. I didn't bother to include any PPU/APU accesses, no memory mapping/trapping, no blitting, x1 window, as I just wanted a rough estimate of how slow the core was.

Boy, nothing could prepare me for it.

On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.

Please don't say I told you so. I did listen to you guys and I always agreed with you all the way it was just that I wanted to give it a try because no one had done it before.
a question
by on (#9228)
what is the name of Quietust emulator???? do you release it???

by on (#9229)
I think it's good that people are trying different methods of emulation. It gets everyone scrutinzing the way they do it and finding new areas for optimization. If you find a method that's slow, be sure you figure out why it's slow.

I've been trying optimizations of my NES CPU emulator lately. I found it was easiest to use it in a standalone mode and just run a simple infinite loop for a few billion clocks. You can easily try different basic schemes without having to worry about getting everything right and having it work well in a full emulator. This can help filter the designs that are obviously slow.

by on (#11249)
Whoa, sorry to bump such an old topic. I'd have posted sooner if I knew about this thread.

While I write an SNES emulator, the two systems are very similar so I see no reason we can't share experiences.

Quote:
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator ... Boy, nothing could prepare me for it. On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.


Then you seriously botched something up. I went from an opcode-based core that synced the clocks once per opcode (merely adding to an opcode_cycle_count var for each read/write/io cycle) to pretty much the switch/case cycle system and lost about 30% performance, eg 100fps to 70fps. The new one syncs and updates clocks after every CPU cycle.

Anyway, I realize that you can probably pull off perfect accuracy by making the CPU the master controller and enslaving all other processors (PPU, APU, etc), but this is definitely not a good way to write self-documenting code. PPU and APU synchronization should be nowhere near the CPU core.

Now, let me go into the reasons I feel I had to break the SNES CPU core down so that it could return after executing single cycles.

1) CPU<>APU communication. The SNES has a dedicated sound chip, unlike the NES, that can actually execute instructions. Since most people consider it a "black box", only accessible via four ports, it can be mostly emulated as a slave device. But what about when you want a debugger? Say you want one that lets you step opcode by opcode, and edit registers between steps. So what do you do when you run one CPU opcode and your emulator crosses over an APU opcode, then starts on another APU opcode before the CPU opcode returns so your debugger can update? Simple, you end up in the middle of a new APU opcode, and you can no longer safely edit the APU registers. Second, the APU would have to be able to break after single cycle steps to properly emulate things like when the CPU reads from the APU port, and in the middle of the opcode the APU writes to that port. Timestamps and such work, but again this makes for sloppy coding and is a hack at best.

2) DMA synchronization. The DMA runs at 1/8th the CPU clock, so in order to emulate DMA sync delays (time between enabling DMA and the transfer beginning, and the time from DMA ending to the CPU resuming), you have to be able to single-step instructions. If you forcefully execute the entire instruction, and a DMA happens in the middle of that transfer, you will be forced to complete the DMA transfer immediately. Quite a problem when a single DMA can take up to ten full frames (64kbytes * 8 channels * 8 cycles/byte transferred).

3) Interrupts. Interrupts test at the start of a new bus opcode cycle. Of course, the work cycle is one behind this (both the NES and SNES are two-pipeline processors), so you need to test and possibly trigger interrupts one cycle before the end of each opcode. This can be done with Quietust's approach, but again is less elegant.

4) Code mixing. As stated before, it's definitely advantageous from a coding standpoint to keep each core as absolutely separated as possible. I have maybe 3-4 functions that need to be exposed for all of my core chips, CPU, APU, DSP, and PPU, and it works fantastically.

Now that we've established that there is merit to being able to cycle step and return, let's talk about how best to do it :)

First off, I personally feel that C++ is a bad language for parallelism. I don't have a "better" language, either. Essentially, I think something like a "thread" type would be needed. This would basically be a class where you call it directly, e.g. thread t1; t1(); and it runs until it hits pause() or exit().
Each thread would have its own stack, and calling the thread would restore the stack pointer and program counter, pausing it would save the stack and program counter and return to where the thread was called.
In essence, it's a fake thread that isn't truly run in parallel with other threads. But it's extremely lightweight, needing to only save and restore two registers and make an indirect jump instead of just a stack push and direct jump.

The benefit?

Code:
thread CPU {
//cycle 0 is always op fetch, no need to add that into each opcode
  void opa9() {
  //cycle 1
    regs.aa.l = op_read(); pause();
    if(!regs.p.m) { flags_lda_8bit(); pause(); return; }
  //cycle 2 (only executed when accumulator is in 16-bit mode)
    regs.aa.w = op_read(); flags_lda_16bit(); pause();
  }
};


Yeah. It would be amazingly useful. You could break out and re-enter things right in the middle of functions. You'd never have problems with the stack getting crushed (CPU calls PPU calls APU calls PPU calls CPU calls APU ... crash). Each thread would only need a tiny stack heap. May not be all that processor efficient, but neither is going from cpu -> run -> run_opcode_cycle -> switch(opcode) -> switch(cycle) -> regs.a.l = op_read(); break; break; return; return; return; for what is essentially a read and assign.

But since we can't throw out the language to do this, who has better ideas? The truth is, the switch(cycle) system is hideously inefficient, even though it only costs me 30% performace, that's still way too much.

Oh, and I didn't notice anyone mentioning this. What about bus hold times? Reads and writes don't hapen at the start of the bus cycle, you know.
Take the SNES latch counters, if you read from $2137 or write to $4201, it copies the H/V counter positions to $213c/$213d. Now, the funny thing is that both lda $2137 and sta $4201 use the exact same cycles, the only difference is the read vs write to the actual address. Both consume six clock cycles, and yet writing to $4201 results in the counter being four clock cycles ahead of reading from $2137. Why? Write hold times are longer than read hold times. I admit, my hold times may not be perfect, but they're made from highly logical guesses and the timing schematics (stored in uS) inside the w65c16s technical manual.
I'm betting you guys are just compensating by adjusting your numbers to match the NES, right?
e.g. if your emulator needs to set Vblank at V=225,HC=2, you do that rather than setting it to the true V=225,HC=6 because you ignore the read hold delay of 4 cycles? Sure, you get the same results, but which is more correct? :)
I don't have the luxury of cheating like this with two coprocessors talking to each other in realtime.

* Obviously my cycle comparisons are invalid for the NES, but hopefully you get the idea. With the SNES, one CPU cycle consumes 6-12 clock cycles against the 21mhz timing crystal.

Heh, sorry for posting so much. This is my favorite subject regarding emulation. I'm very curious to hear your ideas.

by on (#11251)
byuu, lot's of good points in your post. The threading you described is known as "cooperative threading", where the threads run until they give up their time. I agree that emulating per cycle is more accurate, and also produces much more easy to read code. You don't end up mixing PPU/CPU/APU code at all. However, I've avoided it because of two reasons (which I'm guessing is why it's not suggested generally by members here):
1. It is thought of as slow. I've never implemented this style, partially because somebody here (maybe Brad Taylor in his emulation notes) did some tests and saw huge performance problems with that implementation. Something huge, like 100x slower I think. I"ve always assumed this approach would be very slow, but have never run the tests myself. Maybe the performance loss would be worth the code readability?
2. The added accuracy is not necssary on the NES. The smallest unit of time which can affect an on screen pixel is a single CPU cycle (3 PPU cycles in NTSC). Using the "catch up" style emulator you can be as accurate as needed to get the correct display.

Obviously a more accurate simulation has it's value, especially as CPU's get faster and faster. In fact a mult-threaded approach is looking more reasonable now that mulit-core CPU's are getting popular.

by on (#11252)
Quote:
The threading you described is known as "cooperative threading"


Never heard of it. So is it doable in c++ without non-portable libraries? The big thing it needs that is going to be hard is the ability to leave and resume in the middle of functions. It also needs to be stack safe. You should be able to adjust the stack unevenly between each "thread" call.
So eg calling the CPU could then jump into the CPU opcode decoder routine, jump into the CPU opcode routine, and then return halfway through that and not destroy the stack completely. The next call to CPU would immediately return back from three functions in a row and the execute a bit of the next CPU code and then return back to the thread caller.
I need to work it out on paper, but I think you pretty much would need to allocate custom stacks for each "thread". No getting around it, c++ is an inefficient language for this.

Quote:
1. It is thought of as slow. I've never implemented this style, partially because somebody here (maybe Brad Taylor in his emulation notes) did some tests and saw huge performance problems with that implementation. Something huge, like 100x slower I think.


That's ridiculous. I can tell you I wasn't getting 6000fps before I decided to go cycle-based. If you're counting just the raw CPU, maybe 50%, but you have to factor in that other things remain at the same speed as they were before, so total speed loss is about 20-30% in my case, as stated above.

Quote:
2. The added accuracy is not necssary on the NES.


Funny, it was always my opinion that the older and slower the system, the more accuracy is needed. Not to mention it's easier to pull off since the emulator is less demanding. Hey, whatever works for you.

by on (#11257)
I kinda rushed the testing of the emulator so it is very likely that there was a mistake. Here is an example of what I did;

Code:
inline void OpticCode69() // ADC Immediate
{
   switch(CPU.Cycle)
   {
      case 0:
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 1:
         if( CPU.A + Byte + CPU.CF > 0xFF )
            CPU.TMP = 1; else CPU.TMP = 0;
         if( (char)CPU.A + (char)Byte + CPU.CF < -128 || (char)CPU.A + (char)Byte + CPU.CF > 127 )
            CPU.VF = 0x40; else CPU.VF = 0x00;
         CPU.NF = CPU.ZF = CPU.A += Byte + CPU.CF;
         CPU.CF = CPU.TMP;
         CPU.PC++;
         CPU.Cycle = 0;
         break;
   }
   CPU.CC++;
}

inline void OpticCode()
{
   switch(CPU.OpticCode)
   {
      case 0x69: OpticCode69(); break;
   }
}

// Main Loop e.g. VBlank

for( CPU.CC = 0; CPU.CC < 2278; )
{
   OpticCode();
   Draw3Pixels();
}


Now when I tested it I left out everything I possibly could so the speed result was not affected by anything other than the CPU.

by on (#11258)
Quote:
Anyway, I realize that you can probably pull off perfect accuracy by making the CPU the master controller and enslaving all other processors (PPU, APU, etc), but this is definitely not a good way to write self-documenting code. PPU and APU synchronization should be nowhere near the CPU core.


And they don't need to be. The CPU runs merrily along and calls a memory read/write function for accesses. That function dispatches to the appropriate handler (RAM, PPU, APU, etc.). Then the emulator for the component itself does whatever catch-up is necessary to get to the present time, then applies the effect of the access. The CPU knows nothing of synchronization; all it must do is ensure that the memory read/write handler is given the current time.

I take strong issue with your claims that the catch-up design is a hack. Since I'm somewhat familiar with the SPC-700, I'd appreciate a concrete example where the catch-up is ugly or just plain won't work. Maybe you should start a new thread for it.

Quote:
In essence, it's a fake thread that isn't truly run in parallel with other threads. But it's extremely lightweight, needing to only save and restore two registers and make an indirect jump instead of just a stack push and direct jump.


Such a form of cooperative non-preemptive threads can be implemented without any OS support or even any special language support. The longjmp() mechanism uses a similar mechanism, and if you've ever taken a look at the source for it, it's quite simple. I wouldn't be surprised if gcc already has something like you describe. Whereas the manual technique with a swtich statement basically has you simulating a program counter to keep track of where you were in the function, the efficient implementation uses the actual program counter. Your CPU emulator would yield control to another thread at each point in an instruction. Yielding would simply save the current context (program counter, stack pointer, and registers), then restore the context of the other thread which is becoming active. Seriously, look around, because something like this is probably available. On Mac OS Classic, the thread manager provided exactly this kind of thing with explicit yielding, separate stacks for each thread, etc., and it was quite efficient.

by on (#11259)
I share blargg's opinion on this. I tend to implement sound and video on a catch-up basis, and only update status registers when they're accessed. (I do update the cycle counter in the middle of memory opcodes of course.) My event system is almost entirely dedicated to interrupts. Since most or all interaction between components happen through memory reads/writes I think it's natural to handle it like this (you'll want to keep your opcode-reads fast though). I can't see why this approach needs to be hard to understand. It seems unneccesary to be "cycle-accurate" everywhere when it's so easy to predict when it's needed and when it's not. In fact I think I could even use a dynarec most of the time and still be effectively cycle-accurate. Of course, some systems require more interaction between components than other systems.

by on (#11267)
byuu wrote:
1) CPU<>APU communication. The SNES has a dedicated sound chip, unlike the NES, that can actually execute instructions. Since most people consider it a "black box", only accessible via four ports, it can be mostly emulated as a slave device. But what about when you want a debugger? Say you want one that lets you step opcode by opcode, and edit registers between steps. So what do you do when you run one CPU opcode and your emulator crosses over an APU opcode, then starts on another APU opcode before the CPU opcode returns so your debugger can update? Simple, you end up in the middle of a new APU opcode, and you can no longer safely edit the APU registers. Second, the APU would have to be able to break after single cycle steps to properly emulate things like when the CPU reads from the APU port, and in the middle of the opcode the APU writes to that port. Timestamps and such work, but again this makes for sloppy coding and is a hack at best.

byuu,

First of all, your emulator is awesome. =) I understand your design goals and that readability/maintainability are much more important to you than performance.

I just wanted to point out that, in general, it is not strictly necessary to simulate the CPU and the SPC700 in lock-step even when debugging. For example, if I put breakpoints in the CPU's address space but not the SPC700's, then I could always run the CPU ahead of the SPC700 when they were not interacting, and if I hit a breakpoint in the CPU address space, I'd just "catch up" the CPU before stopping in the debugger. And vice versa. Only if you put breakpoints in both address spaces, is it strictly necessary to simulate both in lock-step.

I am very interested in this because I would like to see a high-performance SNES emulator emerge with accuracy as good as bsnes. I would expect such an emulator to not use switch statements per cycle, but instead to be more like zsnes: straight-line assembly code for CPU or SPC700, except with the capability to do a context-switch in the middle of executing a CPU instruction (for example, as part of completing a memory access, you might have to suspend and simulate the other task until the value you're trying to read is available).

I understand that lock-step is easier for you to manage in your emulator though, and I applaud your efforts towards accurate emulation of the SNES --- something which seems long overdue for a 15-year-old console!

--mozz

by on (#11270)
I just thought of something that I should probably make explicit. A CPU core is really just a state machine. Well, when you write some code to simulate a state machine, you can either represent the state explicitly (by keeping it in a variable), or implicitly (by the program counter, or which part of your code is currently executing).

In the explicit model, you typically simulate the progress of the state machine by calling a function, which uses a switch statement (or something) to dispatch to different bits of code based on the value of the state variable. You run the code, set the variable to a new value, and then return from the function. Now, this is very much how byuu's cycle-based emulator Bsnes works. The advantage is that you can call the function whenever you want, and do one unit of work. You can then decide if you want to go off and do something else, or if you want to call it again to do another unit of work.

The other model, the implicit model, involves representing the "current state" by your position in the code. You switch to a new state by transferring control to a different place in the code (e.g. with a goto statement). That is how I would expect my "efficient but accurate" emulator to be implemented---it would look very much like the instruction-based emulators, except that the code to handle the effects of each cycle would need to be in the correct order, and you'd need to account for the correct number of machine cycles between read or write calls. Because the "current state" is partly embodied in the host machine's program counter, in order to pause in the simulating of a CPU instruction and go off and simulate something else, we need to be able to preserve and restore the host machine's program counter. And that is why I talk about green threads and context-switching.

And the effects don't really need to be emulated in the right order, if they are not observable from outside the CPU (unless you care about that for debugging, for example). You can fudge a bit. As long as externally-visible effects (such as memory access cycles) occur at the correct time, and as long as you can suspend (i.e. context-switch) on demand at the beginning of any memory access cycle, you can have single-cycle accuracy and yet run almost as fast as an instruction-based emulator does.

Sorry if any of this is confusing or unclear. =)

P.S. Byuu, is that 30% overhead figure just for your CPU core? Or your entire emulator? Even if its just the CPU core, keep in mind that the 65816 is more complicated than a 6502, so the relative overhead for a NES core might be higher. :wink:

by on (#11272)
byuu wrote:
Quote:
The threading you described is known as "cooperative threading"


Never heard of it. So is it doable in c++ without non-portable libraries?
[...]
I need to work it out on paper, but I think you pretty much would need to allocate custom stacks for each "thread". No getting around it, c++ is an inefficient language for this.


EDIT: that part is correct at least. :wink:
What you need is SEPARATE STACK SPACE for each task. When you switch tasks, you also switch stacks. So whatever function call you were in the middle of when you last suspended that task, it is preserved perfectly on the stack for that task.

So that's the jist of co-operative multithreading. Multiple "threads" or "tasks" of execution, which explicitly yield to each other at the desired points--no preemption (these are sometimes called "green threads" to distinguish them from real OS threads). You basically allocate a small stack area and a small area for saving host machine registers (the "context" as it is called), for each green thread. Then you have a small routine (which you probably have to write in assembly) that knows how to switch between them. It saves the registers (including the stack pointer) of the old task, and loads the registers for the new task (including its stack pointer) from its context area. Each green thread points to its own stack. Note that because you are switching co-operatively, you may not need to save all the registers. This is different from the "preemptive multitasking" used by OS threads, which must save and restore ALL registers because the OS might preempt the thread at any time. For any CS students out there, the concept of green threads is very similar to co-routines.

WinNT-based OSes have something called a "fiber" which is basically a green thread. You could use those for context switching if you just want to try things out. However they might have bad performance (I have no idea).

Edit: How do you allocate the stacks for each task? Well, you could use malloc or new[] or something, but be careful about calling any OS functions without switching back to the "true" OS stack first. Another way is to reserve a bunch of stack space with a local array variable in an outer function call (such as main). Touch the pages of each array with a for-loop to make sure it's actually paged in (because demand-paging of stack space only works if you touch the stack pages linearly, which is unlikely to be the case here). Then stick the address of the top of the array into your context structure's stack pointer slot. I'm not sure which is better. I once wrote code that used the malloc method and we had to change it to the other method because Windows 95/98 would crash when we called OutputDebugString with our stack pointer pointing into the heap instead of the real thread stack. Or maybe our stacks were just too small...who knows. If you avoid calling any OS functions on them you can probably get by with pretty small stacks.

by on (#11273)
Quote:
I take strong issue with your claims that the catch-up design is a hack.


Sorry, it's just my personal opinion. If it works, it works. I try and code things as a reference implementation, which is obviously terrible for performance. There's no need to separate PPU synchronization from CPU memory accesses, but I think it's generally a good idea. I know I'm not exactly emulating logic gates and other insanely low level hardware things like that, but I believe in sticking to how hardware does things as much as possible without being utterly ridiculous. Of course, that's completely relative.

mozz: I do allow breakpoints to be set in both the CPU and APU, I also allow them both to be stepped cycle by cycle in real time. See bsnes v0.013's debugger for an example of this.

Stack size isn't a problem. Even a massive stack is only 256kb, and I only need four (including the main program). The DSP and APU can share, at least for now as I have no idea what happens between DSP cycles, and there's no way to find out without some serious hardware analysis tools that I simply don't have.

I have a few concerns then. First off, how do you switch the stack in c++? setjmp/longjmp do not do this. makethread/switchthread are unix only. x86 assembler turns my application from completely portable to not at all portable. But I could at least create a wrapper so the asm parts could be ported.

Currently, I have two ideas. One is to just use that library approach and use cooperative multitasking. The other is to write a library to hide the state machine code from the emulation code as much as possible, something like:

Code:
void op_lda_addr_w() {
begin_state_machine();
  aa.l = op_read();
  yield();
  aa.h = op_read();
  yield();
  regs.a.l = op_read(MODE_ADDR, aa.w++);
  yield();
  regs.a.h = op_read(MODE_ADDR, aa.w);
  flags_lda_addr_w();
//yield(); before return?
end_state_machine();
}


Of course, there's no performance advantage there, it just makes the code LOOK like it's cooperative. But this is also insufficient for some cases as I can't break out from the level below, e.g. in the middle of the op_read() call to simulate the bus hold time.

As far as adding true cooperative multithreading, one problem I'm worried about there is what about exiting the thread and restarting it? Say the user tries to reset the SNES, but the CPU thread is in the middle of an opcode. With the state machine, I can just reset the states and jump back to the start. I'd have to add is_reset() right after every yield for the purposes of returning back to the start of the stack to account for that, I believe...

If I were to write my emu in assembler, I could easily do what I wanted. But I'm not interested in sharing the platform limitations that ZSNES suffers to this day.

by on (#11274)
Quote:
Sorry, it's just my personal opinion. If it works, it works.


Oh well, I was hoping for actual problem cases for the catch-up method. I agree that a straightforward design like you describe is great for ease-of-implementation, comprehension, and simplicity. If high efficiency wasn't a prime goal, I would not use catch-up. And like you say, all emulators are at a higher level than the console itself. All take liberty by merely emulating the behavior of the hardware at some level, rather than emulating the hardware itself. Putting aside the extremely low-level operation at the molecular and finer levels, practical things like bad cartridge connections and power glitches require a much lower level of emulation than all emulators currently achieve.

Quote:
First off, how do you switch the stack in c++?


C++ is capable of utilizing a cooperative threading library. Implementation of such a library, on the other hand, cannot be done in C or C++; it absolutely requires assembly. Cooperative multithreading is very architecture-neutral and should be easy to implement on any architecture. The only area where you might encounter issues is in stack manipulation and the operating system's interaction with this, but as mozz said, you could just divide the normal stack into sub-stacks. This is what I did in the version I wrote. As far as the OS is concerned, it's all one thread.

Quote:
As far as adding true cooperative multithreading, one problem I'm worried about there is what about exiting the thread and restarting it? Say the user tries to reset the SNES, but the CPU thread is in the middle of an opcode.


Just reinitialize the CPU thread's state and start it fresh. Read mozz's nice description of the equivalence of a state machine's current state with the program counter in the threaded version. The old state is irrelevant unless you're allocating memory and storing the pointer in a local variable; if you acquire any resources you should keep the reference in a non-local variable, like a member of the CPU object.

I'm going to implement a 6502 version of this to show how simple it is.

by on (#11275)
Quote:
Oh well, I was hoping for actual problem cases for the catch-up method.


I cited many in my first post. To repeat one example, how would you complete a DMA transfer that activates in the middle of an opcode if your CPU emulator can't break out until the opcode is completed? To get the timing right, you'd have to return from the opcode so that you can DMA bits at a time. If you make the CPU the absolute master function over the entire emulator, then you could get away with adding all the sync stuff into the DMA.

Then what do you do about the CPU<>APU cycle interleaving? You can't interleave each cycle for both unless at least one is a state machine capable of breaking after each cycle is executed.

What about single opcode stepping both the CPU and APU? Much harder when you have to complete the CPU opcode, which could bump the APU opcode forwad some.

These issues likely don't apply to the NES, but they do apply to most multiprocessor systems.

Anyway, there are ways to get around everything. But a lot of them are indeed hackish. eg using a timestamp buffer to store memory values for the CPU<>APU sync, or making the CPU the absolute master controller for the entire emulator.

by on (#11283)
It sounds like you aren't interested in giving the catch-up method a fair chance, but I'll try to discuss this anyway. That means focusing on one example rather than throwing them around randomly without scrutinizing any one thoroughly. The DMA seems like a good one, but I'll need a bit more info on what DMA can do. Can you give a concrete situation, i.e. DMA occurs here and does this while CPU is in middle of opcode X? That way you don't have to describe DMA in general, just what it's doing in that case, that precludes using catch-up.

by on (#11332)
Quote:
It sounds like you aren't interested in giving the catch-up method a fair chance, but I'll try to discuss this anyway.


I don't mind giving it a fair chance, it obviously works for you guys. But I'm not interested in using it myself, having tried for the better part of three months to get those issues I mentioned working. That includes even if every issue I have were solved. I was mainly posting because I wanted to know if anyone had successfully used cooperative multitasking in an emulator quickly and cleanly. Or if anyone had ideas on how to.

Ok, DMA. When you write to $420b, it performs one transfer for each bit set.
So you have up to eight channels, each channel can send up to 65536 bytes, and each byte transferred takes 8 cycles. There are 1324 available cycles per scanline, and 262 scanlines per frame.

When you write to $420b, the system executes one more CPU opcode cycle (this could be the opcode fetch of the next opcode cycle if the last cycle of the current opcode writes to $420b). It then waits a bit for syncing and then runs all of the DMA transfers before returning control to the CPU.

Now, you can sync things by calling APU and PPU sync commands for each DMA byte transfer. The DMA transfers can affect the APU and PPU, as it can write to its address lines. The problem is returning back to the main program. If you run ten frames worth of DMA transfers, then your GUI will be unresponsive (unless you call window update routines in your DMA byte transfer), your sound will skip (unless you call directsound buffer updating code), your debugger won't be able to freeze emulation (unless you jump into that from your DMA transfer routine and don't exit that routine until the user resumes with the debugger -- and now your debugger needs to sync the GUI messages up on its own), and you won't be able to take savestates in the middle of a DMA transfer because you won't be able to re-enter your transfer in the middle of it.

Yeah, all of these things can be overcome. But if there was only just some way to do fast and clean cooperative multitasking, you wouldn't have to worry about adding APU/PPU/GUI/Sound/Debugger syncing inside anything in the CPU that executes a memory read or write on a port.
Obviously, you can skip syncing the PPU when a PPU register is not accessed, same for APU, same for the rest.

by on (#11336)
byuu wrote:
mozz: I do allow breakpoints to be set in both the CPU and APU, I also allow them both to be stepped cycle by cycle in real time. See bsnes v0.013's debugger for an example of this.

Of course. I just wanted to emphasize that in an implementation that was geared towards performance, you wouldn't have to run in lock-step mode unless the user actually set breakpoints in both. Even with breakpoints set for one chip, you could still run it ahead of the other one to batch up work.

I fully realize that sort of thing makes no sense in the context of Bsnes. I think Bsnes is great, by the way. I've looked over some of its code and its pretty readable (very readable by the standards of most emulator code! :wink:)

blargg wrote:
Oh well, I was hoping for actual problem cases for the catch-up method. I agree that a straightforward design like you describe is great for ease-of-implementation, comprehension, and simplicity. If high efficiency wasn't a prime goal, I would not use catch-up. And like you say, all emulators are at a higher level than the console itself. All take liberty by merely emulating the behavior of the hardware at some level, rather than emulating the hardware itself. Putting aside the extremely low-level operation at the molecular and finer levels, practical things like bad cartridge connections and power glitches require a much lower level of emulation than all emulators currently achieve.

Well, you could take the point of view that what you are emulating is an abstract machine--an artificial construct that can be described in its entirety with mathematical precision. The Nintendo Entertainment System hardware is one implementation of this abstract machine. An emulator is a different implementation. Of course there is some uncertainty when you have multiple chips on independent clocks (like the SNES does). Maybe that defies the "mathematical precision" and you really do have to think of it as emulating two different interacting abstract machines. Hmm...

by on (#11454)
Topic drift...

byuu wrote:
Quote:
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator ... Boy, nothing could prepare me for it. On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.


Then you seriously botched something up


Oh you betcha. I have just done a test of the following;

Code:
inline void Operation69()
{
   switch(CPU.Cycle)
   {
      case 0:
         CPU.PC++;
         CPU.Cycle++;
         break;
      case 1:
         CPU.P &= 0x3D;
         if( CPU.A + CPU.Memory[CPU.PC] + (CPU.P & 0x01) > 0xFF )
            CPU.TMP = 1; else CPU.TMP = 0;
         if( (char)CPU.A + (char)CPU.Memory[CPU.PC] + (CPU.P & 0x01) < -128 || (char)CPU.A + (char)CPU.Memory[CPU.PC] + (CPU.P & 0x01) > 127 )
            CPU.P += 0x40;
         CPU.A += CPU.Memory[CPU.PC] + (CPU.P & 0x01);
         CPU.P &= 0xFE;
         if( !CPU.A )
            CPU.P += 0x02;
         CPU.P += CPU.TMP + (CPU.A & 0x80);
         CPU.PC++;
         CPU.Cycle = 0;
         break;
   }
   CPU.CC++;
}


vs.

Code:
inline void Operation69()
{
   CPU.PC++;
   CPU.CC++;
   UpdatePPU();
   UpdateAPU();
   UpdateMapping();

   CPU.Byte = CPU.Memory[CPU.PC];
   CPU.TMP = (char)CPU.A + (char)CPU.Byte + CPU.CF;
   if( CPU.TMP < -128 || CPU.TMP > 127 )
      CPU.VF = 0x40; else CPU.VF = 0x00;
   if( CPU.A + CPU.Byte + CPU.CF > 0xFF )
      CPU.TMP2 = 1; else CPU.TMP2 = 0;
   CPU.NF = CPU.ZF = CPU.A += CPU.Byte + CPU.CF;
   CPU.CF = CPU.TMP2;
   CPU.PC++;
   CPU.CC++;
   UpdatePPU();
   UpdateAPU();
   UpdateMapping();
}


The latter of the two was twice as fast.

by on (#11477)
The sticky part of a catch-up scheme is synching the visible environment changes between processors. Queueing up writes works to a point, but anything that can respond asynchronously takes extra effort.

Scanline interrupts and such are rather predictable, but that of course adds another layer of complexity to the matter.

Sound/gui updates should be handled as asynchronously as possible anyways, for the reasons mentioned.

Any coop multitasking system is going to be rather system specific, though the details can be abstracted out to minimize the unportable bits.

Coroutines might be another approach as well, though implementing them in C is somewhat tricky. Essentially, when the CPU talks to the APU, it jumps over to the APU code, which does it's thing, then jumps back, keeping track of the state for both. It's relatively similar to cooperative multitasking, except you specify the thread to jump to explicitly.

As for emulating full-blown multiprocessor systems accurately, there's almost always going to be tradeoffs made, if only because emulating the clock skew and propogation delay between chips is a bit extreme for this stuff. It's better to sit down and think about the system, and determine what exactly each processor can see, what events modify that view, and structure things around that.

by on (#12487)
I am gonna make WedNESday cycle accurate. But there are a few things that I would like to make sure of first.

1. Is it worth updating the Data Bus after each cycle? I can't see how this would be useful in a 6502 emulator.

2. What is the 'Address Bus' that some documents keep referring to. Example;

Code:
BRK

        #  address R/W description
       --- ------- --- -----------------------------------------------
        1    PC     R  fetch opcode, increment PC
        2    PC     R  read next instruction byte (and throw it away),
                       increment PC
        3  $0100,S  W  push PCH on stack (with B flag set), decrement S
        4  $0100,S  W  push PCL on stack, decrement S
        5  $0100,S  W  push P on stack, decrement S
        6   $FFFE   R  fetch PCL
        7   $FFFF   R  fetch PCH


2. I am quite confused by the 'pipelining' concept. Some documents say that the previous opcode is executed when the new one is fetched. Is this true? Or is the following correct;

Code:
inline void ADC()
{
   PC++;
   CC++;

   A += ...   // Executed ADC NOW not when next opcode is fetched
   CC++;
}


Thanks in advance.

by on (#12489)
WedNESday wrote:
2. What is the 'Address Bus' that some documents keep referring to.
It just refers to the signalling lines that are used by the CPU to tell other bus participants (e.g. a memory controller?) what address it wants to read or write to.

Think of a "bus transaction" as follows: one party (the CPU) supplies an address on the address lines, and indicates whether it wants to read or write. If writing, it also supplies a "value" on the data lines. if reading, the other party (the memory controller?) is responsible for setting the data lines. One party writes to them and the other party reads from them.

WedNESday wrote:
2. I am quite confused by the 'pipelining' concept. Some documents say that the previous opcode is executed when the new one is fetched. Is this true?

Yes. But for practical purposes, you can pretend it isn't true.

"Pipelining" means dividing your logic that implements an instruction into multiple "stages". If you had three instructions A, B, C and a three-stage pipeline, then in cycle 1 instruction A enters the first stage. In cycle 2 instruction A moves on to the second stage, and instruction B enters the first stage. In cycle 3 instruction A enters the third stage, B enters the second stage and C enters the first stage.

The "latency" of instruction A was 3, because it had to pass through 3 pipeline stages (taking one cycle each) before it was completely finished. However, the "throughput" of this pipeline (assuming all instructions use all 3 stages and have latency of 3 cycles) would be 1 instruction per cycle! In other words, by splitting up the task of executing an instruction into 3 stages, we manage to be doing 3 things at once instead of just one thing. So our throughput is 3 times as high.

Now, its not always so rosy. For one thing, instructions take different numbers of cycles and sometimes need to participate in only some of the pipeline stages. Also, sometimes there are bottlenecks: for example, the 80386 had a pipeline but the first stage had to share some resources with the second stage. So an instruction could only 'issue' into the pipeline every two cycles, and the minimum effective cost of each instruction was 2 cycles. The 486 fixed this and many of its instructions could then be executed effectively in one cycle, making it perform noticably better.

Imagine a branch instruction enters the first stage of your pipeline, but you won't figure out if the branch is 'taken' or 'not taken' until stage 3 of the pipeline. What do you do on the next cycle? The branch moves up to stage 2 of the pipeline, but what should go into stage 1? You don't know yet if its the instruction right after the branch, or the instruction at the branch target address. (even if you knew, you'd have to fetch that byte before you could decode it). Most processors have 'branch prediction' logic to try and figure out well in advance whether a branch is likely to be taken or not. They then assume that is the case and keep filling the pipeline with the predicted instructions. If it turns out it guessed wrong, it has to flush the parts of the pipeline that contain wrongly-predicted instructions, causing a "pipeline stall". On modern X86 processors, most branch instructions are predicted correctly 80% to 90% of the time. The rest of the time, you take a performance hit that could be anywhere from 10 to 25 cycles. [Edit: I think I forgot to mention, that modern x86 processors have the equivalent of 15-25 stages in their pipelines. Modern GPUs in graphics cards have even more stages---dozens and dozens!]


Okay, so getting back to the 6502... think of it as having a "pipeline of depth two". The first stage performs a memory access (read or write), and the second stage performs the operation of the instruction. The first and second cycles of a 6502 instruction are consecutive reads. During the second cycle, it is also decoding the opcode and deciding what to do with it. Maybe the second read was wasted (e.g. an implied operand), or maybe it was a Direct Offset or an Absolute Address Low byte, or something else we would have needed to read anyway. Take the example of an implied operand, though. The *third* cycle of the instruction is when its actually executed! But the instruction is only two cycles long, you say! And you're sort of right---that *third* cycle is the same cycle as when the opcode fetch of the next instruction is occurring:

Code:
    A-1.  fetch A's opcode      (and, finish last instruction)
    A-2.  fetch next byte        (and decode A)
    B-1.  fetch B's opcode      (***and execute A)
    B-2.  fetch next byte        (and decode B, which turns out to be a Zero Page insn)
    B-3.  read from Zero Page   (and...idle)
    C-1.  fetch C's opcode     (and execute B)


But if you write these effects sequentially, does it really matter how you divide them up?

Even if its really doing this:
Code:
    (finish last instruction) and
    A-1.  fetch A's opcode
----------------------------
    (decode A) and
    A-2.  fetch next byte
----------------------------
    (execute A) and
    B-1.  fetch B's opcode
----------------------------
    (decode B, which turns out to be a Zero Page insn), and
    B-2.  fetch next byte


you can for the most part think of it instead as doing this:
Code:
    ....
    (finish last instruction)
----------------------------
    A-1.  fetch A's opcode, and
    (decode A)
----------------------------
    A-2.  fetch next byte, and
    (execute A)
----------------------------
    B-1.  fetch B's opcode, and
    (decode B, which turns out to be a Zero Page insn)
----------------------------
    B-2.  fetch next byte, and
    ....


I'm no expert so there might be errors in this. But you get the general idea.

by on (#12594)
Very helpful thanks. Does anyone know if reseting the NES clears the pipelines?

by on (#12598)
Reset is one of the three interrupts (the other being IRQ and NMI). Any interrupt will clear the pipeline.

by on (#12611)
What about this...

1. Trigger NMI/IRQ if pending.
2. Fetch next instruction.
3. Decode the addressing mode & process the proper bytes (1 or 2).
4. Jump to the instruction using the addressing mode byte/word as argument.

Note: I could make great use of jump tables. A lot of opcodes with different addressing modes work as if they would be only 1.

by on (#12616)
By the way, I don't think you should actually try to emulate the pipelines explicitly in your code! That would almost certainly be overkill.

Remember, external devices can ONLY interact with the CPU through memory accesses and the IRQ/NMI/RESET pins. So while a device (such as a memory mapper, or the PPU) is able to see reads and writes, it is NOT able to see exactly when the 6502 is doing the "work" of executing the instruction. That would be internal calculations such as adding a temporary value (perhaps read in the previous cycle) to A.

So as an emulator author, you just have to make sure the correct "internal operations" are performed sometime during the instruction after the inputs become available and before the output is needed. The 6502 actually performs them in parallel with bus operations and uses the 2-stage pipeline. But emulator authors should not need to do that.

This sort of timing stuff can get very complicated. If it helps you think about it, you could draw out a sort of "timeline of events" for an instruction. Divide it up into cycles, and mark down when the different actions (reads, writes, internal computations) are occurring. For anything thats not externally visible, there is no fixed time you have to do it at --- just dependencies on earlier events, and maybe later events that depend on this one.

Quote:
Note: I could make great use of jump tables. A lot of opcodes with different addressing modes work as if they would be only 1.

Be aware that there is usually a performance penalty for jumping through a jump table. That is called a "dynamic branch" and modern processors will be very slow unless it is highly predictable, i.e. 95% of the time you are jumping to the same address. Opcode dispatch is almost the worst case for dynamic branch, being random enough that the CPU can almost never predict the correct address. For example, if you were to write your core so that each instruction did TWO table dispatches instead of just one, you would probably find that the whole core was now 10-20% slower! Of course it depends how fast the rest of your instruction emulation is... since the 6502 performs a read or write (which takes you many host cycles worth of work to emulate) on *every* cycle, maybe you're paying enough for that already that the dispatch overhead is kinda small in comparison.

I'll recommend this: try to do the table lookup into a register well in advance, and leave it there untouched for a few dozen cycles before you actually do the indirect jump. One of my coworkers once went for a seminar at Intel about their VTune product, and he asked about dynamic branches and was told that on Pentium 2's and Pentium 3's, a jump through a register is "essentially free" if the address sits in the register for at least 40 cycles before the jump. My takeaway from that story was that 40 cycles is long enough for the write to the register to retire from the ROB and be written back to the architecturally-visible register file. Anyway, if the value has been in the register long enough, the branch predictor will just use it as the predicted address. So try to read it as early as possible.

[Edit: I realize now this part of my post was disengeous, because I'm actually working on generating cores that do in fact divide most instructions into two halves and do two dynamic branches to execute them. The first routine handles most of the addressing mode, and the second routine includes the various ALU ops and so on. It should considerably reduce code size (which I think I am about the only person on the planet who cares about that these days! hehe) and to avoid it being too slow I'll be doing exactly what I recommended above---in fact my two dispatch tables are combined into one table and my dispatch routine will load BOTH addresses into registers as early as possible. It will then do a few other things before branching to the first register. The second register will be untouched throughout the first handler for the instruction (the address mode code) unless it needs to be spilled by the handler for a nontrivial memory access such as a PPU access. So hopefully, the jump-through-a-register which connects the "first half" of my instruction to its "second half" will be essentially free most of the time. I'm kind of more interested in how much I can optimize code size anyways...]

by on (#12622)
mozz wrote:
'm kind of more interested in how much I can optimize code size anyways...


If you can get under around 4K of machine code + 1K for the jump table, you have my C++-based core beat. :)

by on (#12630)
As far as i know, there are 3 "options" to bear with opcode emulation...

a) a LARGE case statement. You know, around 256 entries of case 00h...FFh inside an infinity loop. Pros? Cons?

b) jump tables. You use the opcodes as small "pieces", and then jumps to proper "piece" carrying an argument (byte or word). It makes the code much smaller. Pros? Cons?

c) inlined subroutines. The standard thing, newbie style perhaps? You code each opcode into a separate function, maybe inlined &| static. Each opcode is accessed by a pointer, something like (*function_op[opcode_num](argument)); Pros? Cons?

Actually, I use b. No clue if a small core could be worst than a large core, like the case statement.

by on (#12646)
Here's mine:
Code:
__forceinline void OperationCode69()
{
   CPU.PC++;
   CPU.CC++;
   // PPU/APU
   // ADC
   CPU.PC++;
   CPU.CC++;
   // PPU/APU
}

__forceinline void OperationCode()
{
   switch(CPU.Memory[CPU.PC])
   {
      case 0x69: OperationCode69(); return;
   }
}

Note the return not break at the end of the opcode fetch. Just as long as there is no code after the case table then you can exit the function immediately. I know that without macros the cycle for cycle memory accessing will make the code large, but never mind. Is there a faster method of doing this kind of thing?

by on (#12652)
WedNESday wrote:
Here's mine:
Code:
__forceinline void OperationCode()
{
   switch(CPU.Memory[CPU.PC])
   {
      case 0x69: OperationCode69(); return;
   }
}


Note that this generates the exact same code as:
Code:
__forceinline void OperationCode()
{
   switch(CPU.Memory[CPU.PC])
   {
      case 0x69: OperationCode69(); break;
      case 0x70: OperationCode70(); break;
   }
}

So you're not actually gaining anything.

by on (#12653)
The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?

by on (#12655)
Compile both versions and compare the binaries using an hexa editor. ^_^;;..

by on (#12656)
When you write in C, you aren't specifying the machine code that comes out; you're specifying the external side-effects that must occur when executing the code. Since the two have the same side-effects, and they are reasonably easy for a compiler to optimize and occur commonly, compilers generate the same code.

by on (#12662)
WedNESday wrote:
The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?

No, it is not.

by on (#12664)
WedNESday wrote:
The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?

They should have exactly the same performance on any decent compiler, because of the optimizer. Those two programs are exactly equivalent if you convert them into a DAG representing their control flow (which the optimizer does, probably even at -O1 or -O0).

by on (#12667)
Fx3 wrote:
Compile both versions and compare the binaries using an hexa editor. ^_^;;..


Never!

I believe you guys. However, if you had a really crap compiler then I would be right :lol: .

by on (#12693)
Observe the following;

Code:
Read-Modify-Write instructions (ASL, LSR, ROL, ROR, INC, DEC,
                                     SLO, SRE, RLA, RRA, ISB, DCP)

        #   address  R/W description
       --- --------- --- ------------------------------------------
        1    PC       R  fetch opcode, increment PC
        2    PC       R  fetch low byte of address, increment PC
        3    PC       R  fetch high byte of address,
                         add index register X to low address byte,
                         increment PC
        4  address+X* R  read from effective address,
                         fix the high byte of effective address
        5  address+X  R  re-read from effective address
        6  address+X  W  write the value back to effective address,
                         and do the operation on it
        7  address+X  W  write the new value to effective address


When it says read from effective address would that affect the VRAM register if the address were $2007? Also is it neccessary to read, then re-read or can that be done just the once? (my concern being that some kind of mapper or external device may cause a bank switch or something)

by on (#12694)
blargg tested this and it does effect the ppu registers. so if there were 2 reads or what ever then the ppu would both reads. i looked a 1 game and it used the ABS addressing mode for ppu registers. perhaps thats why ? maybe check to see what addressing mode games use to read/write the ppu registers and write to the mmc registers.

matt

by on (#12701)
mattmatteh wrote:
blargg tested this and it does effect the ppu registers. so if there were 2 reads or what ever then the ppu would both reads. i looked a 1 game and it used the ABS addressing mode for ppu registers. perhaps thats why ? maybe check to see what addressing mode games use to read/write the ppu registers and write to the mmc registers.

matt

The CPU is either reading or writing at some address in every cycle. That table lists the addresses, too. If you want the most accurate emulation, you should simulate every read and every write (even the "dummy" ones that the CPU doesn't use the value of). I don't know if there are any games that rely on that behaviour, but that's what the 6502 in the NES did.

by on (#12702)
WedNESday wrote:
Observe the following;

Code:
[...]
        4  address+X* R  read from effective address,
                         fix the high byte of effective address
        5  address+X  R  re-read from effective address
        6  address+X  W  write the value back to effective address,
                         and do the operation on it
        7  address+X  W  write the new value to effective address


When it says read from effective address would that affect the VRAM register if the address were $2007? Also is it neccessary to read, then re-read or can that be done just the once? (my concern being that some kind of mapper or external device may cause a bank switch or something)

Yes. Every read or write on $2007 advances the VRAM pointer, so this is advanced by 3 or 4 depending on whether index addition crosses a page boundary. And yes, if you want to emulate the bus accurately, you have to make both reads if index addition crosses a page boundary, and you have to make both writes.

by on (#12752)
Alright, figured I'd post about this here as well since we were discussing cooperative multithreading, etc previously in this thread.

Basically, what I've found out about cooperative multithreading is that there is an implementation in Win95/NT3.51+ called Windows Fibers. On my Athlon 3500+, it is generally 10-15x slower to call SwitchToFiber(my_proc) than it is to call my_proc() directly. And it's also Win32-specific.

However, I was able to write my own implementation of the library that works on any x86 target including Win32, Linux and FreeBSD (excepting maybe OSx86), and get that number down to 6-7x slower.

My cothreading library creates separate stacks for each cothread, supports direct jumping or infinite call/return recursion of threads, and saves/restores everything the c/c++ ABI for win+lin+bsd specifies for each context switch (basically ebx+ebp+esi+edi, ignoring eax+ecx+edx+ST(0)-ST(7)+mm0-mm15). And context switches occur in roughly 14 opcodes, requiring no user<>kernel transisitons.

The source to the library is here : http://byuu.cinnamonpirate.com/temp/libco_x86.asm

And the entire package is here :
http://byuu.cinnamonpirate.com/temp/libco_v03.zip

This basically allows for exactly the kind of implementation I was talking about before, and should hopefully be very fast as well. Even though you have the overhead of switching contexts instead of just calling a subroutine, you should hopefully gain it back by not having to use a state machine (switch/case) every time you call your CPU core.

So for example, the following code :
main() -> cpu_run() -> switch(state) case OPCODE_EXEC: -> exec_opcode() -> switch(opcode) case 0xa9: -> op_a9() -> switch(opcode_cycle) case 2:
Will become :
main() -> co_call(cpu_context) -> /* we are now within case 2: */

You also no longer *have* to break out of the CPU core after every opcode cycle, and you can just test the memory address, breaking only if it affects another unit (eg it is a PPU / APU read/write operation), or if a significant amount of time has passed. It also allows for more accuracy as you can trivially simulate bus hold delays now.

The actual CPU core can be implemented exactly like an opcode-based core, putting the co_return() calls inside the op_read() and op_write() functions alone.

So you gain more speed, more accuracy, and much cleaner code. But there are a few drawbacks to this apprach that will probably mean nobody ever uses this besides myself :

1) Platform-dependance. The library is trivial to implement on any platform, but it would take a good programmer with knowledge of assembler to do it right.
2) Context-switching requires flushing the pipeline/L1 cache or whatever, meaning performance will be worse on more excessively pipelined processors (such as the P4) compared to more moderate designs (Athlons, older processors ...). Basically, a lower clock multiplier will probably result in faster performance. Processors with more than 8 volatile registers could require significantly more time to save/restore the processor context.
3) And this is the big one... you can't save and restore the state of threads in a platform-independant way, meaning you can't have savestates... it may be possible to save the context-buffer + stack into a savestate and restore it, but the format would change with even a minimal code change, breaking older savestates in the process. Power/reset events are still possible, one need only destroy and recreate the CPU thread.

Always a tradeoff, right? I'm personally going to write a CPU+APU core, and then write the code to run them so that I have both an opcode-based, less accurate core supporting savestates, and a bus-accurate core that will not support savestates.

So, ideas, comments? Would anyone else consider using something like this?

-----

An example opcode using cothreading :

Code:
void op_lda_addr() {
  aa.l = op_read(regs.pc++);
  aa.h = op_read(regs.pc++);
  cpu_io();
  regs.a.l = op_read(aa.w);
  regs.p.n = bool(regs.a.l & 0x80);
  regs.p.z = regs.a.l == 0;
}

uint op_read(uint addr) {
  add_clocks(4);
#ifdef USE_COTHREADS
  co_return(); //synchronization magic!
#endif
uint r = mem_read(addr);
  add_clocks(4);
  return r;
}

by on (#13135)
Can anybody enlighten me on this;

Code:
Relative addressing (BCC, BCS, BNE, BEQ, BPL, BMI, BVC, BVS)

        #   address  R/W description
       --- --------- --- ---------------------------------------------
        1     PC      R  fetch opcode, increment PC
        2     PC      R  fetch operand, increment PC
        3     PC      R  Fetch opcode of next instruction,
                         If branch is taken, add operand to PCL.
                         Otherwise increment PC.
        4+    PC*     R  Fetch opcode of next instruction.
                         Fix PCH. If it did not change, increment PC.
        5!    PC      R  Fetch opcode of next instruction,
                         increment PC.

       Notes: The opcode fetch of the next instruction is included to
              this diagram for illustration purposes. When determining
              real execution times, remember to subtract the last
              cycle.

              * The high byte of Program Counter (PCH) may be invalid
                at this time, i.e. it may be smaller or bigger by $100.

              + If branch is taken, this cycle will be executed.

              ! If branch occurs to different page, this cycle will be
                executed.


It's the only thing left that I am yet to do. It seems to me that there are way too many 'increment PC''s in there. It must check the branch condition on cycle no 2 and not 3 otherwise the minimum number of cycles required would be 3 which is not the case.

by on (#13138)
The description looks fine to me (the following is based on what you quoted). If you read the description, several of the PC increments are conditional. The first step, fetch opcode, is performed on the last cycle of the previous instruction. The second step fetches the branch offset. The third step fetches the next instruction regardless of whether the branch is taken. If it was not taken, PC is incremented and the branch is complete.

If the branch was taken, the fourth step fetches the new opcode. If no page-crossing occurred, the PC is incremented and the branch is complete. If a page-crossing did occur, then the high byte of the PC needs to be fixed and the new opcode fetched one final time in step 5, and the PC incremented.

by on (#13212)
byuu wrote:
3) And this is the big one... you can't save and restore the state of threads in a platform-independant way, meaning you can't have savestates... it may be possible to save the context-buffer + stack into a savestate and restore it, but the format would change with even a minimal code change, breaking older savestates in the process. Power/reset events are still possible, one need only destroy and recreate the CPU thread.

Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)

by on (#13224)
augnober wrote:
Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)


Uhh... what?

by on (#13226)
byuu wrote:
augnober wrote:
Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)


Uhh... what?

Sorry about that. You'd made it sound like it wouldn't be possible to avoid savestate versioning issues.. but since the NES is static, it's more a matter of how practical it is. My thought was that to avoid version-dependence you could be careful to only write out data which is a snapshot of the emulated NES internals (only the essentials). All emulators would need this and only this data. If you could find a way for that to be sufficient without making the export and import too unwieldly, there would be no platform or emulator dependency (the only platform it would depend on is the NES, which any future versions will emulate anyway).

Edit: To put this in perspective, it would have been possible to save the state off a real NES back in the 80's (well, in theory anyway) and load it up in an emulator today, provided the data saved were sufficient. Presumably this data would have been in bits and pieces with no obvious original organization, and the person who made the record would have needed to organize the data in a presentable fashion. This could just as easily (or difficultly) be done from an emulator today.

by on (#13235)
Quote:
You'd made it sound like it wouldn't be possible to avoid savestate versioning issues.. but since the NES is static, it's more a matter of how practical it is.


This isn't related to the NES. The reason you can't use savestates with cooperative multithreading is because you can't save the stack + context registers of each thread into a savefile with any kind of reliability that it will ever work again. Let alone on a different OS or processor.

You can save the current state of the NES, but you wouldn't be able to load it. You'd have no way to set program execution to the correct point in each of the threads. I'm sure it's possible to create a semi-hybrid design that would allow savestates + partial multithreading, but it would lose the main advantages of using threads in the first place: code simplicity and possibly accuracy.

by on (#13250)
augnober, what you describe is exactly what I've done with my NES emulator. The save state format is laid out much like the NES registers, along with any internal state necessary. Pretty soon I'm going to write some code to transfer a save state from an emulator to a NES.

The problem with byuu's system (ignoring that it's for the SNES) is a limitation of the emulator implementation. Some of the system state is represented in the host machine's internal state of several threads (program counter, stack contents, stack pointer). This can't be easily converted to an emulator-neutral format, and would probably be quite difficult to reconstruct back when restoring from a neutral save state. The only option left is an exact memory dump of the host process (or some subset), which results in a very volatile format that could change even when simply re-linking the executable.

by on (#13260)
Quote:
The problem with byuu's system (ignoring that it's for the SNES) is a limitation of the emulator implementation. Some of the system state is represented in the host machine's internal state of several threads (program counter, stack contents, stack pointer). This can't be easily converted to an emulator-neutral format, and would probably be quite difficult to reconstruct back when restoring from a neutral save state. The only option left is an exact memory dump of the host process (or some subset), which results in a very volatile format that could change even when simply re-linking the executable.


Exactly. By moving the parallelism from the software (eg a state machine as discussed in this thread before) to hardware (eg cooperative multithreading), you gain an enormous speedup, code simplification and you can control synchronization points as finely as you desire -- eg you can run multiple opcodes in a row before returning from the CPU, and yet still stop right in the middle of a bus hold delay of any given cycle, when needed. However, software or hardware, a save state has to save the current "state" of each processor, and your state format is now dependant on the host machine as well as the emulated machine.

The closest thing to a savestate would be to dump the physical stack memory and context registers for each thread. However, like blargg said, offsets inside the EXE to procedures and variables can change even with the slightest modification to code. Also, different platforms with different implementations for malloc/etc will cause allocated memory addresses to be different between systems anyway. Any alternations between malloc calls between two separate runs of the program would also break the state captures. At the best case, you'd only reliably be able to run a savestate in this format on the system the savestate was captured on, and it would require a lot of trickery. Probably not even that.

Anyway, it's again a tradeoff. And it isn't SNES dependant. It'll work with any emulator. However, I don't think I'll ever find any emu author willing to give up savestates for it :(

by on (#13263)
That, or have all threads sync to a stable architectural state at every vblank.

by on (#13270)
blargg wrote:
The problem with byuu's system (ignoring that it's for the SNES) is a limitation of the emulator implementation.


I should have said that this is an inherent limitation of a design using multiple threads, not just a problem with byuu's implementation. Even though a solution isn't immediately obvious, I'm still unconvinced that one doesn't exist. I haven't even thought about this cooperatively threaded design much yet.

tepples wrote:
That, or have all threads sync to a stable architectural state at every vblank.


If we define a stable state as the main CPU and sound CPU both having just completed instructions, this might not even be possible. I imagine that the only stable states that could be reliably gotten to would constrain the implementation enough that it would come to resemble the previous design that didn't use threads. How do you restore a situation like the following where the main CPU emulator and sound CPU emulators are both mid-instruction in their respective threads?

Code:
// Main CPU
int addr = read_mem( pc++ );
yield();
int data = read_mem( addr );
yield(); // stopped here

// APU
int operand = read_mem( pc++ );
yield();
int result = a + operand;
yield(); // stopped here


In this example several pieces of state are in registers or on the stack, in a way determined by the particular compiler used. I do have the start of an idea of a solution. It involves running each thread manually from some well-defined starting point, feeding it enough dummy data to get into the desired state. This is similar to testing microchips through a narrow serial test pin where lots of actions are necessary to get the chip into the desired state.

by on (#13274)
byuu wrote:
Quote:
You'd made it sound like it wouldn't be possible to avoid savestate versioning issues.. but since the NES is static, it's more a matter of how practical it is.


This isn't related to the NES. The reason you can't use savestates with cooperative multithreading is because you can't save the stack + context registers of each thread into a savefile with any kind of reliability that it will ever work again. Let alone on a different OS or processor.

You can save the current state of the NES, but you wouldn't be able to load it. You'd have no way to set program execution to the correct point in each of the threads. I'm sure it's possible to create a semi-hybrid design that would allow savestates + partial multithreading, but it would lose the main advantages of using threads in the first place: code simplicity and possibly accuracy.

So to get savestate support, you'd likely have to compromise the design enough that it would no longer be the design you were talking about.. Which means the design really doesn't support savestates. Got it.

I was about to say I was still pretty confident that there's some way to get savestate support while still retaining the benefits of using cooperative multithreading.. but since: a) optimizing the elegance of this may be the kind of work better suited for a master's thesis, and b) without or even with that kind of work, people may have a rough time understanding how you'd made it work (which would risk defeating the original purpose of your design selection), it's probably better to let it go. On the other hand, it's a fun goal if someone wants to give it a try.. :)

by on (#13275)
Unless you have a really creative emulator I don't see any problems to making save states. Basically you need cycle alignment and an image of all the states of the emulated device. At most you would needs all the states of registers and RAM images.

by on (#13277)
that would be nice if you could load a save state on the nes. then maybe we could a standard save state format. one that works on the nes. i wonder if the revearse could be done. take a state from the nes and load it to the emulator. doesnt seem possible unless the data buss is being logged and internal states are determined. oh well. but i think a common save state would be nice. i know we discussed it before among emulators, but this time we (not me though as my emu is not mature enough) agree one one if its loadable on the nes.

matt

by on (#13279)
blargg wrote:
blargg wrote:
The problem with byuu's system (ignoring that it's for the SNES) is a limitation of the emulator implementation.


I should have said that this is an inherent limitation of a design using multiple threads, not just a problem with byuu's implementation. Even though a solution isn't immediately obvious, I'm still unconvinced that one doesn't exist. I haven't even thought about this cooperatively threaded design much yet.

This is my situation too. The more I think about it, the more I see byuu knew what he was talking about when he said saving the state is feasible, but restoring it in a platform independent way isn't. It's looking that way to me. I imagine the best chance at a solution is in a different programming language which has this in mind, but then I suppose that even if such a language exists, the compilers currently available are bound to be slow.

by on (#13301)
byuu wrote:
This isn't related to the NES. The reason you can't use savestates with cooperative multithreading is because you can't save the stack + context registers of each thread into a savefile with any kind of reliability that it will ever work again. Let alone on a different OS or processor.


If you simulate the parallel pieces of hardware in lock-step, then you're right that there is no convenient way to save and restore the state because at any given moment, at least some of the parallel tasks will be in an "intermediate" state where some of their temporary values are in machine registers, etc.

On the other hand, if you simulate parallel tasks using a "catch up" scheme, or some more general scheme where each task has a separate idea of the "current" time and you only synchronize them when they need to communicate, then there is a simple solution: advance each task to a point where it isn't in an "intermediate" state before attempting to save the state of that task, and save the current simulated time of each task as part of the state!

In other words, if you were simulating the CPU and APU and PPU as three separate parallel tasks, you might end up with a savestate containing CPU state as of time T1, APU state as of time T2 and PPU state as of time T3. But the timepoint T1 would be the beginning of a CPU cycle (not some intermediate point in the middle of a CPU cycle). And the timepoint T2 would be the beginning of an APU cycle, etc. Most importantly, the values T1,T2 and T3 would be part of the savestate, and after loading the savestate the emulator would resume emulating the tasks from those exact timepoints. It's not even necessary for T1,T2,T3 to be close together in time... however, it might be desirable to get them as close as you can before taking the snapshots, so that other emulators with less precision could still load the saved state of each chip and pretend that all 3 of them were saved at time T.


Of course, the main reason to simulate the separate tasks out-of-sync with each other (as opposed to simulating them in lockstep) is performance: you can run one task for a hundred cycles before you switch and run a different one for a hundred cycles. You only need to check if its time to switch infrequently (e.g. one test in the CPU instruction dispatch code), and if it is. The memory read or write handlers for ports which could require synchronization with another task might also need to suspend the current task until another task has "caught up", but you can treat that differently. I am tired, I hope I made some sense in this post.

As usual, the stuff in this post would be difficult to apply to BSNES because it simulates tasks in lock-step (which is reasonable, because performance is not one of the main goals of BSNES---BSNES aims to be as accurate as possible while still being written in a clear and maintainable fashion).

blargg wrote:
Even though a solution isn't immediately obvious, ...

The most interesting problems usually require non-obvious solutions. I would say writing an emulator which is both high-performance and highly-accurate is a very interesting problem! At least, I've been thinking about how to do it for many months now. I've had some good ideas that came easily to me, and some others that came only after a lot of thought and thus were non-obvious. At this point, its hard for me to remember which were which.

Edit: by the way, the idea in this post could be used for other things besides savestates. For example, if you had the "Sands of Time" feature in your emulator, you'd do the same thing for taking the in-memory snapshots used to rewind time.

Another random thought: suppose your task / cothread (representing, for example, the CPU) could cause a context switch in multiple places but only ONE of them was the "safe place" to do a save. When this cothread was not running, it would be easy to check if it had stopped at the "safe place" by comparing the saved host PC to an address in your "safe place" code. A complete save operation would then involve running each task for a while until it stops at the "safe place", and then saving the state (and timepoint) of that task. Of course, if a task suspends somewhere other than the "safe place" it might mean you have to run a different task for a while before you can resume the suspended task, and if you have already saved that other task, you would have to discard its saved state and save it again later ! So there are still some wrinkles. My gut feeling is that this problem could be avoided: perhaps you could do something like, keep simulating tasks for short durations and checking if ALL tasks are stopped at a safe point. As soon as they all are, save them all simultaneously with whatever current times they have.

by on (#13303)
Quote:
If we define a stable state as the main CPU and sound CPU both having just completed instructions, this might not even be possible. I imagine that the only stable states that could be reliably gotten to would constrain the implementation enough that it would come to resemble the previous design that didn't use threads. How do you restore a situation like the following where the main CPU emulator and sound CPU emulators are both mid-instruction in their respective threads?


Exactly. And even if it's not so difficult to align the CPU and APU for a snapshot, what about when you have more than two threads? How about eight, twelve? It becomes harder and harder to align all threads to resumable points. Of course, you can split things up to have more resumable points. The more you add, the more like your old slow state machine your code becomes. The less you add, and the more threads you use, the less likely you'll be able to take a snapshot at all this way.

Quote:
I was about to say I was still pretty confident that there's some way to get savestate support while still retaining the benefits of using cooperative multithreading.. but since: a) optimizing the elegance of this may be the kind of work better suited for a master's thesis, and b) without or even with that kind of work, people may have a rough time understanding how you'd made it work (which would risk defeating the original purpose of your design selection), it's probably better to let it go. On the other hand, it's a fun goal if someone wants to give it a try.. :)


Indeed. If we eliminated this bottleneck, and kept all of the coding and performance benefits, we'd have the potential to do for accuracy what dynarec and HLE did for speed.

Master's thesis, hm? Heh, college would be nice. I'd have to settle for it being a high school senior's thesis :)

Quote:
This is my situation too. The more I think about it, the more I see byuu knew what he was talking about when he said saving the state is feasible, but restoring it in a platform independent way isn't.


I spent the better part of a year researching this (ok, a few hours a week). So yeah, I probably know what I'm talking about ... but then again, maybe not ;)

Quote:
I imagine the best chance at a solution is in a different programming language which has this in mind, but then I suppose that even if such a language exists, the compilers currently available are bound to be slow.


That's exactly it. This could be done in c++ too, using save states. You either implement execution context in hardware, using CPU registers and the stack, and gain a huge speedup but lose the ability to save and restore the state cross-platform; or you implement the execution context in software, which becomes a state machine, where you have many small functions and hold execution states in memory variables, and lose a lot of speed. It's irrelevant what language you use. Hardware is always faster, but by it's very nature makes the states platform-dependant. Of course, it'd be more convenient to use a language that had this sort of functionality built in, but I like c++ too much to switch.

Quote:
In other words, if you were simulating the CPU and APU and PPU as three separate parallel tasks, you might end up with a savestate containing CPU state as of time T1, APU state as of time T2 and PPU state as of time T3.


Again, this becomes increasingly difficult as the number of threads increases. Let's say I need CPU, APU, PPU, DSP, and SA-1. Now if any of those five components are communicating with each other, I won't be able to take a snapshot because the waitforevent delay inside one of those will force that thread to be stuck in the middle of a function. It's a good kludge that'd probably work for CPU<>APU synchronization, though.

I'm also not interested in a complex timeshifting system where there are memory buffers and rewind executions and such to effectively allow out-of-order execution such that the CPU can execute an entire opcode, even though the middle cycle writes to the APU, and when the APU runs it is able to compensate by detecting there was a write from the CPU in the middle of its opcode. That's too complex for my design goals. However, that's not to say it's a worse idea than using cothreads. It may even be a superior method.

Quote:
Of course, the main reason to simulate the separate tasks out-of-sync with each other (as opposed to simulating them in lockstep) is performance


Cothreads allow you to get away with this to an even finer detail. Right now, you're probably executing one opcode at a time between sync calls, right? With cothreads, you can skip over the out-of-order complexity, and simply check what address is being written to.

Code:
uint read(uint addr) {
  add_cycles(bus_hold_read_delay);
  if(requires_sync[addr])yield();
uint result = memory_read(addr);
  add_cycles(read_delay - bus_hold_read_delay);
  return result;
}


The new CPU core will exit only when it absolutely needs to, even if that ends up being 2000 opcodes later! You can avoid overflowing your cycle counters with something like :

Code:
void add_cycles(int cycles) {
  cycle_counter += cycles;
  while(cycle_counter >= too_many_cycles)yield(); //allow something else to run to lower this thread's cycle counter
}


Quote:
perhaps you could do something like, keep simulating tasks for short durations and checking if ALL tasks are stopped at a safe point. As soon as they all are, save them all simultaneously with whatever current times they have.


Answered already.

by on (#13318)
byuu wrote:
Again, this becomes increasingly difficult as the number of threads increases. Let's say I need CPU, APU, PPU, DSP, and SA-1. Now if any of those five components are communicating with each other, I won't be able to take a snapshot because the waitforevent delay inside one of those will force that thread to be stuck in the middle of a function. It's a good kludge that'd probably work for CPU<>APU synchronization, though.

I don't think this would be a problem in practice. Consider the CPU cothread. Once per emulated instruction (in the dispatch code), it will be at a "safe point" to save. So as long as it isn't blocked waiting for something else to complete, you only have to advance it a relatively short amount of simulated time in order to get to a "safe point" to save.

In other words, you might context switch away from the CPU cothread for two reasons: either because you want to (which will always be at a "safe point", e.g. at the beginning of dispatching an instruction), or because you have to (because you need to interact with a different cothread and need it to "catch up" to you before you can complete your operation, e.g. a read from a APU->CPU communication port). In the case of "want to" points, you might want to switch to a different cothread to handle a timer interrupt, or to make sure other tasks aren't delayed for too much wall-clock time (i.e. must simulate APU often enough to keep sound buffer full of samples) or because you want to take a savestate.

In order to arrange a save, you would just make all tasks stop at a "want to" point as soon as they can. If one or more of them stopped at a "have to" point instead, you'd continue emulating normally, i.e. do whatever "catch up" was necessary to unblock it, until it reached a "want to" point. Remember, "want to" points occur pretty frequently--every simulated instruction, for example--so you're not likely to "have to" suspend more than once to get to one.

Another way to look at it: In the quote above you seem to be implying a scenario where many cothreads need to synchronize at once (i.e. "have to" stop for each other) in a way that prevents you from getting them all to a "want to" point. But even if it was possible for that to happen for short periods of time (I'm not at all sure that it is), it doesn't happen most of the time. If it did, "catch up" implementations would be synchronizing so often that they would hardly perform better than lock-step simulation. The reason they perform better in practice is that the parallel tasks spend most of their time NOT interacting with each other, so close synchronization is not usually needed.

I'm just speculating about all of this. At one time I was trying to come up with a formalized timing model that gave precise definitions for all of this stuff. But I got busy doing other things.

by on (#13339)
Quote:
Another way to look at it: In the quote above you seem to be implying a scenario where many cothreads need to synchronize at once (i.e. "have to" stop for each other) in a way that prevents you from getting them all to a "want to" point. But even if it was possible for that to happen for short periods of time (I'm not at all sure that it is), it doesn't happen most of the time. If it did, "catch up" implementations would be synchronizing so often that they would hardly perform better than lock-step simulation. The reason they perform better in practice is that the parallel tasks spend most of their time NOT interacting with each other, so close synchronization is not usually needed.


I'm really truly very much against the idea of playing to chance, hoping that the one rare instance where things get stuck in "have to" syncs repeatedly never occurs. In other words, if it isn't guaranteed to work, then to me it's junk.
I also need to do this for the PPU, and I won't be breaking it apart into a dot renderer. So it will have to run an entire scanline to be in a "want to" point to pause at.
However, if it's the only way to get savestates, perhaps I'll consider it :/

by on (#13353)
byuu wrote:
Quote:
I imagine the best chance at a solution is in a different programming language which has this in mind, but then I suppose that even if such a language exists, the compilers currently available are bound to be slow.


That's exactly it. This could be done in c++ too, using save states. You either implement execution context in hardware, using CPU registers and the stack, and gain a huge speedup but lose the ability to save and restore the state cross-platform; or you implement the execution context in software, which becomes a state machine, where you have many small functions and hold execution states in memory variables, and lose a lot of speed. It's irrelevant what language you use. Hardware is always faster, but by it's very nature makes the states platform-dependant. Of course, it'd be more convenient to use a language that had this sort of functionality built in, but I like c++ too much to switch.

Thinking about it more, I think a different language could help a lot. Before, I was thinking about the ability of a state to be saved on one platform and loaded on another. For a strictly easier goal.. How about have cross-platform emulator code supporting savestates but only allowing the states to be restored on the same platform? I think the proper language could be very helpful in abstracting what would be required while also not horribly maiming execution speed.

by on (#13363)
On what cycles are bit 7 of $2002 set and NMI executed? And in which order?

by on (#13589)
augnober wrote:
Thinking about it more, I think a different language could help a lot.

What language in particular do you have in mind?

by on (#13599)
-_pentium5.1_- wrote:
augnober wrote:
Thinking about it more, I think a different language could help a lot.

What language in particular do you have in mind?

I'm not thinking of any particular language, rather just being kind of optimistic for the sake of not ruling it out. I don't want to go too far off the topic of emulation, but I'll explain the sort of thing that I think could be of use since I didn't explain before and haven't seen mention of this subject before (although since I haven't seen it mentioned, I also unfortunately don't know anything about it and therefore don't feel right going on about it.. hmm..)

If a high level language were designed to facilitate the saving and loading of execution state, then this would be abstracted across platforms (saving a developer from the impractically hacky/difficult platform-specific work).. and if they didn't go so far as to require the state to be transferable between platforms, then in theory, with a good optimizing compiler I don't see why it would necessarily be too slow. At its simplest, a facility to keep track of pointers that will be relocated on load and to register heap which must be saved (not being thorough here) may be enough to put a language on its way toward the goal (while things such as being able to keep the stack organized are common to many compilers anyway - and just may need to be dealt with in more depth). So to avoid giving the language too much burden/overhead in an attempt to make it automatically work in every situation, perhaps the application programmer would be expected to keep restoration in mind and act accordingly. Code modification after saved state could cause trouble, but not so much as to disallow all changes.

I bet some computer scientists have gotten obsessed with the idea of being able to stop and restore any execution state and have gone ahead and developed a high level language to support it across different platforms, even if only as a proof of concept (or to secure a grant, to graduate, whatever). I'm not aware of any languages advertised as such though. Strangely, I haven't felt any need for such a language since I'm generally satisfied with being able to save and load my own data.. but this project of byuu's is interesting :) In a way, interpreted languages make it easy enough to restore state by nature.. so it's surely been done before even if by accident, but the question is whether or not someone's gone for restorable execution state and also good performance.

If someone has made something like I'm describing with compilers which can generate code not much slower than the best C++ compilers, then I'd be interested in hearing about it :) Or if someone has tried and there's a reason why performance is necessarily poor, I'd be interested in finding out why.

by on (#14738)
blargg wrote:
mozz wrote:
'm kind of more interested in how much I can optimize code size anyways...


If you can get under around 4K of machine code + 1K for the jump table, you have my C++-based core beat. :)


For interest's sake: Last weekend, I wrote by hand about 90-95% of a 6502 core (in x86 assembly code) which uses two handlers to implement each instruction. The dispatch table will be 1K (two 16-bit fields in each entry) and so far the code size is about 750 bytes, however I am missing some code:
- need more dispatch code, probably about 50 bytes worth
- the memory read and write routines are not present. They will probably weigh in around 200 bytes.
- BRK and a few other instructions (one of the complex undocumented insns--ARR?--is not implemented yet) will probably add another 50 bytes or maybe even more..

Anyway, its currently looking like the entire core will weigh about 2K including dispatch table. I still haven't tried to assemble it or run it, and inevitably it has bugs in it. I don't know how well it will perform (and frankly having a core that small is so cool that even if it turns out to be slower I will probably keep it that way! hehe)

by on (#14748)
i dont think blargg uses any asm. mine doesnt as its portable. i will have some asm but with c code as option. using all asm limits portability.

matt

by on (#14778)
My 6502 emulator doesn't use any assembly, but it also doesn't emulate any unofficial instructions nor some of the more subtle aspects of read-modify-write instructions yet (dummy reads). So mozz's likely 2-3K core with unofficial instruction support will be quite an achievement.

by on (#14805)
Bah, pack that baby down to 256-bytes and release it as a demoscene .com app :P

Or more realistically, it'd be kind of cool to try and design the smallest possible NES emulator in pure assembly as a .com DOS file.

For those who use Macs, a .com file is basically an executable with no header. You can't have one bigger than 64k, and it runs in 16-bit mode but you can use 32-bit registers in it.

TNES had one that was ~30kb, but it went up in size and eventually became a .exe file anyway.

by on (#14819)
Under Windows, the .com suffix is also useful for fooling n00bs into executing your rootkit, as they think it's a shortcut to some web site.

by on (#14847)
byuu wrote:
Or more realistically, it'd be kind of cool to try and design the smallest possible NES emulator in pure assembly as a .com DOS file.

Its a cool idea, but I don't know realistically how small you can make some of the stuff. The PPU for example---perhaps you could make one that was small or fast but not both. And I have no idea how small, either--maybe not very. And then think of the mappers... yikes! I bet even a clever size-optimized implementation of 100+ mappers will be bigger than my asm core. :P

At the end of the day, if all the important code+tables fits in an L1 cache then that will be good enough.

One thing I hope to do with my code generator is experiment a bit. If my core was 10x bigger but only used one handler per instruction instead of two, would it be faster or slower? (I suspect it would be a tiny bit faster, but truly I am just guessing!) But I have lots to do before I get to that point (start compiling and debugging these cores, get generating a suite of automated instruction tests and get the cores running those tests, etc).

Edit: also my code is meant for 32-bit code and flat memory model and DOS used 16-bit code and segmented memory. Insns using a 32-bit register would be one byte bigger; however, near call instructions would be 2 bytes smaller! =)

by on (#14852)
You don't need to worry about your entire code set fitting in the cache, just the often-used portion. For the CPU emulator, for example, a small set of instructions dominate. The same probably goes for what hardware is most accessed, like the PPU status register.

by on (#14916)
Going back a little ways here...

byuu - You're looking for a way to do savestates in a co-threaded execution model? I used co-threads once before, so I'm used to how they work. I might have a suggestion here, although I don't know how feasible it is, as I don't know how your code is set up.

Each of the three emulation threads (CPU, PPU, APU) should have one "safe yield" point where a savestate can be made. The PPU would most likely have this point during its V-Blank (or forced blank, possibly) handler, since that would be an easy time to resurrect the current state as the PPU isn't really doing anything other than counting cycles. The CPU and APU would have a "safe yield" between instructions, where execution would return to the main loop within their respective cores. Upon resuming from a "safe yield," all three threads should receive a message stating whether the thread should archive its state (to a section of memory, pending write to disk) or not. This way, each thread is responsible for archiving its state.

All threads should, upon creation, receive an "unarchive" message, telling it to load to an existing state and jump to the code immediately following the "safe yield" location. If a save state is loaded, currently executing threads would be destroyed and new threads would be created, so that a thread would not have to test for "unarchiving" after every yield.

You should have no trouble syncronizing the PPU and CPU to "safe yield" points, since the CPU should (in most cases) execute numerous instructions during V-Blank. The tricky part, however, is finding a "safe yield" point for the APU. My suggestion here is this: When the user hits the freeze state button/key, continue execution until the CPU and PPU are both at safe yield points, then tell those two threads to archive. Next, continue execution until the APU is on a safe yield point, then archive its state and write to disk (making sure to include the amount of time elapsed between the CPU/PPU states and the APU state). When resuming from a savestate, be sure to block APU execution until the appropriate amount of time passes.

The only time this will fail is if the CPU reads from an APU register (writes to an APU register won't be a problem, as their effects would be remembered in the APU state). The way to solve this would be to archive the state of these registers on every CPU read cycle between the CPU state and the APU state, then restoring these values after each cycle when loading the savestate. Thus, if the CPU ever accesses these registers during this time, it would receive the value returned by the APU at that time.

The only remaining problem I can think of would be if DMA takes up the entire V-Blank time, but I don't see how that could happen during gameplay (is it possible?).

by on (#14979)
Actually, a single DMA call can span over eleven fields (consider it frames in non-interlace).

8 channels * 65536 bytes/channel * 8 clocks/byte transferred = 4194304 clocks
1364 clocks/scanline * 262 lines/field = 357368 clocks/field
4194304 / 357368 = 11.7 fields

Anyway, I'm aware that if I can align the CPU and APU on opcode boundaries or "cheat" and execute instructions out of order (so long as they aren't communicating with each other it won't matter), I can save states at "safe" points. But I really wanted to avoid that sort of code trickery if possible. Especially a time buffer. If I have a time buffer between CPU and APU communications, I may as well not use cothreading to begin with. The goal was quite honestly to simplify the code.

We're really only talking a single opcode of desynchronization anyway, and even then it can't be detected unles the two are communicating with each other (eg CPU is accessing APU port and APU is accessing CPU port. One or the other would not cause a problem). It's not like other emulators (excepting probably SS) have more accurate savestates anyway ... and this allows conversion between the emulators since they cannot resume mid-opcode.

So then :

Code:
void save_state() {
  while(r_cpu->scanline() < 240 || r_cpu->scanline() > 260)snes->run();
  while(r_cpu->in_opcode())r_cpu->run();
  while(r_apu->in_opcode())r_apu->run();
  capture_savestate();
}


The only uncertain part left is the PPU. Right now, the PPU does not have its own thread, and runs one scanline at a time. Therefore, I have no re-entry problems with it.
However, with a dot-based renderer, things might get more tricky. Taking a snapshot every field may work, but isn't as point-specific as I'm sure some would like. Specifically, romhacking would suffer if the savestate leaped forward when they were testing a small block of code. I would need to make the PPU renderer re-entrant, or force it to "run ahead" to the next "safe point" which I have no idea where that would be at.

Lastly, I'm still planning on creating interpretive opcode emulators that can be combined with the current scanline PPU renderer. It should hopefully be as accurate as SNES9x ever will be (but nowhere near as fast) at least, and won't require cothreading at all.

And as always, I'm still interested in a transparent way to create recursive re-entrant functions that gets at least 80% of the speed of cothreading, but without actually requiring cothreading.