Whoa, sorry to bump such an old topic. I'd have posted sooner if I knew about this thread.
While I write an SNES emulator, the two systems are very similar so I see no reason we can't share experiences.
Quote:
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator ... Boy, nothing could prepare me for it. On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.
Then you seriously botched something up. I went from an opcode-based core that synced the clocks once per opcode (merely adding to an opcode_cycle_count var for each read/write/io cycle) to pretty much the switch/case cycle system and lost about 30% performance, eg 100fps to 70fps. The new one syncs and updates clocks after every CPU cycle.
Anyway, I realize that you can probably pull off perfect accuracy by making the CPU the master controller and enslaving all other processors (PPU, APU, etc), but this is definitely not a good way to write self-documenting code. PPU and APU synchronization should be nowhere near the CPU core.
Now, let me go into the reasons I feel I had to break the SNES CPU core down so that it could return after executing single cycles.
1) CPU<>APU communication. The SNES has a dedicated sound chip, unlike the NES, that can actually execute instructions. Since most people consider it a "black box", only accessible via four ports, it can be mostly emulated as a slave device. But what about when you want a debugger? Say you want one that lets you step opcode by opcode, and edit registers between steps. So what do you do when you run one CPU opcode and your emulator crosses over an APU opcode, then starts on another APU opcode before the CPU opcode returns so your debugger can update? Simple, you end up in the middle of a new APU opcode, and you can no longer safely edit the APU registers. Second, the APU would
have to be able to break after single cycle steps to properly emulate things like when the CPU reads from the APU port, and in the middle of the opcode the APU writes to that port. Timestamps and such work, but again this makes for sloppy coding and is a hack at best.
2) DMA synchronization. The DMA runs at 1/8th the CPU clock, so in order to emulate DMA sync delays (time between enabling DMA and the transfer beginning, and the time from DMA ending to the CPU resuming), you have to be able to single-step instructions. If you forcefully execute the entire instruction, and a DMA happens in the middle of that transfer, you will be forced to complete the DMA transfer immediately. Quite a problem when a single DMA can take up to ten full frames (64kbytes * 8 channels * 8 cycles/byte transferred).
3) Interrupts. Interrupts test at the start of a new bus opcode cycle. Of course, the work cycle is one behind this (both the NES and SNES are two-pipeline processors), so you need to test and possibly trigger interrupts one cycle before the end of each opcode. This can be done with Quietust's approach, but again is less elegant.
4) Code mixing. As stated before, it's definitely advantageous from a coding standpoint to keep each core as absolutely separated as possible. I have maybe 3-4 functions that need to be exposed for all of my core chips, CPU, APU, DSP, and PPU, and it works fantastically.
Now that we've established that there is merit to being able to cycle step and return, let's talk about how best to do it :)
First off, I personally feel that C++ is a bad language for parallelism. I don't have a "better" language, either. Essentially, I think something like a "thread" type would be needed. This would basically be a class where you call it directly, e.g. thread t1; t1(); and it runs until it hits pause() or exit().
Each thread would have its own stack, and calling the thread would restore the stack pointer and program counter, pausing it would save the stack and program counter and return to where the thread was called.
In essence, it's a fake thread that isn't truly run in parallel with other threads. But it's extremely lightweight, needing to only save and restore two registers and make an indirect jump instead of just a stack push and direct jump.
The benefit?
Code:
thread CPU {
//cycle 0 is always op fetch, no need to add that into each opcode
void opa9() {
//cycle 1
regs.aa.l = op_read(); pause();
if(!regs.p.m) { flags_lda_8bit(); pause(); return; }
//cycle 2 (only executed when accumulator is in 16-bit mode)
regs.aa.w = op_read(); flags_lda_16bit(); pause();
}
};
Yeah. It would be amazingly useful. You could break out and re-enter things right in the middle of functions. You'd never have problems with the stack getting crushed (CPU calls PPU calls APU calls PPU calls CPU calls APU ... crash). Each thread would only need a tiny stack heap. May not be all that processor efficient, but neither is going from cpu -> run -> run_opcode_cycle -> switch(opcode) -> switch(cycle) -> regs.a.l = op_read(); break; break; return; return; return; for what is essentially a read and assign.
But since we can't throw out the language to do this, who has better ideas? The truth is, the switch(cycle) system is hideously inefficient, even though it only costs me 30% performace, that's still way too much.
Oh, and I didn't notice anyone mentioning this. What about bus hold times? Reads and writes don't hapen at the start of the bus cycle, you know.
Take the SNES latch counters, if you read from $2137 or write to $4201, it copies the H/V counter positions to $213c/$213d. Now, the funny thing is that both lda $2137 and sta $4201 use the exact same cycles, the only difference is the read vs write to the actual address. Both consume six clock cycles, and yet writing to $4201 results in the counter being four clock cycles ahead of reading from $2137. Why? Write hold times are longer than read hold times. I admit, my hold times may not be perfect, but they're made from highly logical guesses and the timing schematics (stored in uS) inside the w65c16s technical manual.
I'm betting you guys are just compensating by adjusting your numbers to match the NES, right?
e.g. if your emulator needs to set Vblank at V=225,HC=2, you do that rather than setting it to the true V=225,HC=6 because you ignore the read hold delay of 4 cycles? Sure, you get the same results, but which is more correct? :)
I don't have the luxury of cheating like this with two coprocessors talking to each other in realtime.
* Obviously my cycle comparisons are invalid for the NES, but hopefully you get the idea. With the SNES, one CPU cycle consumes 6-12 clock cycles against the 21mhz timing crystal.
Heh, sorry for posting so much. This is my favorite subject regarding emulation. I'm very curious to hear your ideas.