(I also posted this to Spritesmind for more feedback.)
I know I'm a bit of an outlier here with using cooperative-threading to design emulators, and that most will use state machines that run each chip the minimum amount of time, but I ran into a rather interesting challenge with threads, and was curious if anyone here had some input.
The Sega Genesis + Sega CD turns out to be a bit of a nightmare of a design to emulate.
You have two 68Ks, a Z80, a VDP graphics chip, PSG audio, YM2612 audio. And tons of components in the Sega CD that I basically roll into its 68K: a graphics rotozoom ASIC, CD controller, CD driver, CD-DA playback, etc. Then there's controllers that may need threads of their own, cartridges as well, etc.
The way my scheduler works is, every time a thread is running, and it does something that takes time, I increment that thread's current clock cycle. There's some extra math in there to scale the value so that each thread can run at different clock rates.
Many chips share RAM and resources, and so you need to make sure the current thread doesn't perform an operation that could affect the other chips while it's ahead in time. So every time the CPU accesses a PSG register, it has to catch up the Z80 *and* the PSG, since the Z80 can also access the PSG. The same for the YM2612. In fact, the Z80 can access almost the entire bus of the 68K, and vice versa, so it ends up being a lot of synchronization.
The naive approach looks like this:
Once every emulated frame, I find the smallest clock value of all the threads, and then subtract that value from every thread's clock value, which prevents the clock counters from ever overflowing.
So notice how the YM2612 can't modify the PSG. As a result, it doesn't need to sync the PSG right now and can happily keep running even if it's ahead of the PSG, so long as it doesn't run so long that the PSG runs out of buffered audio samples.
The design works well, but with one glaring flaw: let's say the APU runs for a few clock cycles, and calls step. It is currently behind the CPU, but it is ahead of the YM2612. So it switches to it. The YM2612 runs for quite a while (my core is sample-based, so it eats 144 clock cycles per sample), and now it's quite a ways ahead of the CPU, so then it switches over to the CPU. BUT!! The CPU is still behind the APU, and now the CPU goes on to perform a write that affects the APU.
The naive solution to that is: every time any thread is about to do something that would influence another processor, we scan through every single thread, and find the thread with the smallest clock value, and if that's not the current thread, switch to it.
But that means we have to loop over a list of threads, compare their clock values, and keep track of which thread has the smallest value. And when we're the YM2612, that means synchronizing the PSG potentially, even though we usually don't care about that. And synchronizations are *painful*, so this approach ends up costing me about 20% more performance and I absolutely cannot spare a hit that big anymore.
My initial thought was something like a binary min-heap array for the threads, so I can find min(list) in O(1) time, but that doesn't solve the problem with synchronizing threads that aren't important.
I can also unroll the synchronization logic, for instance in the YM2612's case:
Which is an annoying increase of code that I can hide in a library function, but mostly works ... until you have three threads. Or four. Or in the case of the Genesis CPU, the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.
So right now, the naive approach seems to be the most likely to work. But as it's way too slow, I was curious if anyone had better ideas?
I know I'm a bit of an outlier here with using cooperative-threading to design emulators, and that most will use state machines that run each chip the minimum amount of time, but I ran into a rather interesting challenge with threads, and was curious if anyone here had some input.
The Sega Genesis + Sega CD turns out to be a bit of a nightmare of a design to emulate.
You have two 68Ks, a Z80, a VDP graphics chip, PSG audio, YM2612 audio. And tons of components in the Sega CD that I basically roll into its 68K: a graphics rotozoom ASIC, CD controller, CD driver, CD-DA playback, etc. Then there's controllers that may need threads of their own, cartridges as well, etc.
The way my scheduler works is, every time a thread is running, and it does something that takes time, I increment that thread's current clock cycle. There's some extra math in there to scale the value so that each thread can run at different clock rates.
Many chips share RAM and resources, and so you need to make sure the current thread doesn't perform an operation that could affect the other chips while it's ahead in time. So every time the CPU accesses a PSG register, it has to catch up the Z80 *and* the PSG, since the Z80 can also access the PSG. The same for the YM2612. In fact, the Z80 can access almost the entire bus of the 68K, and vice versa, so it ends up being a lot of synchronization.
The naive approach looks like this:
Code:
CPU::step(uint clocks) {
clock += clocks;
//resume(thread) is a thread switch:
//after resume(cpu) later, CPU code execution continues here:
if(clock >= apu.clock) resume(apu);
if(clock >= psg.clock) resume(psg);
if(clock >= ym2612.clock) resume(ym2612);
}
APU::step(uint clocks) {
clock += clocks;
if(clock >= cpu.clock) resume(cpu);
if(clock >= psg.clock) resume(psg);
if(clock >= ym2612.clock) resume(ym2612);
}
YM2612::step(uint clocks) {
clock += clocks;
if(clock >= cpu.clock) resume(cpu);
if(clock >= apu.clock) resume(apu);
}
clock += clocks;
//resume(thread) is a thread switch:
//after resume(cpu) later, CPU code execution continues here:
if(clock >= apu.clock) resume(apu);
if(clock >= psg.clock) resume(psg);
if(clock >= ym2612.clock) resume(ym2612);
}
APU::step(uint clocks) {
clock += clocks;
if(clock >= cpu.clock) resume(cpu);
if(clock >= psg.clock) resume(psg);
if(clock >= ym2612.clock) resume(ym2612);
}
YM2612::step(uint clocks) {
clock += clocks;
if(clock >= cpu.clock) resume(cpu);
if(clock >= apu.clock) resume(apu);
}
Once every emulated frame, I find the smallest clock value of all the threads, and then subtract that value from every thread's clock value, which prevents the clock counters from ever overflowing.
So notice how the YM2612 can't modify the PSG. As a result, it doesn't need to sync the PSG right now and can happily keep running even if it's ahead of the PSG, so long as it doesn't run so long that the PSG runs out of buffered audio samples.
The design works well, but with one glaring flaw: let's say the APU runs for a few clock cycles, and calls step. It is currently behind the CPU, but it is ahead of the YM2612. So it switches to it. The YM2612 runs for quite a while (my core is sample-based, so it eats 144 clock cycles per sample), and now it's quite a ways ahead of the CPU, so then it switches over to the CPU. BUT!! The CPU is still behind the APU, and now the CPU goes on to perform a write that affects the APU.
The naive solution to that is: every time any thread is about to do something that would influence another processor, we scan through every single thread, and find the thread with the smallest clock value, and if that's not the current thread, switch to it.
Code:
Scheduler::synchronize() {
uint minimum = ~0;
Thread* context = nullptr;
for(auto thread : allThreads) {
if(thread->clock < minimum) {
minimum = thread->clock;
context = minimum;
}
}
//should never happen, don't check in release builds
assert(context);
if(context != currentThread) resume(*context);
}
uint minimum = ~0;
Thread* context = nullptr;
for(auto thread : allThreads) {
if(thread->clock < minimum) {
minimum = thread->clock;
context = minimum;
}
}
//should never happen, don't check in release builds
assert(context);
if(context != currentThread) resume(*context);
}
But that means we have to loop over a list of threads, compare their clock values, and keep track of which thread has the smallest value. And when we're the YM2612, that means synchronizing the PSG potentially, even though we usually don't care about that. And synchronizations are *painful*, so this approach ends up costing me about 20% more performance and I absolutely cannot spare a hit that big anymore.
My initial thought was something like a binary min-heap array for the threads, so I can find min(list) in O(1) time, but that doesn't solve the problem with synchronizing threads that aren't important.
I can also unroll the synchronization logic, for instance in the YM2612's case:
Code:
YM2612::step(uint clocks) {
clock += clocks;
if(clock >= cpu.clock) {
if(clock >= apu.clock) {
resume(apu.clock >= cpu.clock ? apu : cpu);
} else [
resume(cpu);
}
} else if(clock >= apu.clock) {
resume(apu);
}
}
clock += clocks;
if(clock >= cpu.clock) {
if(clock >= apu.clock) {
resume(apu.clock >= cpu.clock ? apu : cpu);
} else [
resume(cpu);
}
} else if(clock >= apu.clock) {
resume(apu);
}
}
Which is an annoying increase of code that I can hide in a library function, but mostly works ... until you have three threads. Or four. Or in the case of the Genesis CPU, the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.
So right now, the naive approach seems to be the most likely to work. But as it's way too slow, I was curious if anyone had better ideas?