Multithreaded emu designs

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic

Multithreaded emu designs
by Disch on 2012-06-10 (#95307)

Hey everyone. Long time no see. I've been busy with "real life" nonsense... and emudev (and really, all hobby programming) has sort of drifted out of my life.

However there's just that "something" about the NES that fascinates me a keeps bringing me back.

Anyway, I was kicking around ideas for a new emu. But, with today's multicore and multithreaded CPUs, making a single-threaded emulator seems rather antiquated. Especially since emulators have to run several different systems in parallel. So I figure if I start a new emu project I'm going to try to take advantage of a multithreaded setup.

Of course, multithreading is trickier, so I thought it'd be fun/useful to open a design discussion on the topic. Has anyone here made such an emu? I know byuu has. How did you go about it?

My idea is fundamentally the same as the "catch up" approach most people here are probably familiar with. The difference is, you don't catch up the PPU on register reads/writes... instead it's constantly catching up in a parallel thread. The time it's "catching up" to is the current CPU timestamp, which would constantly be increasing as the CPU executes instructions.

There's still the same sync issues. The PPU can't surpass the CPU and we need a way to sync them on register accesses.

I'm thinking that since the CPU is clocked slower and performs generally simpler tasks, if given the same CPU time, the CPU's timestamp will advance much more quickly than the PPU timestamp, which is benefitial. This means we can probably get away with having the PPU thread have a "dumb spin" while waiting for the CPU to advance. Something like:

Code:

while(ppu_time >= cpu_time);

... or if you want to be a little more intelligent... possibly this instead:

Code:

while(ppu_time >= cpu_time) std::yield();

This would be done every time the PPU emulates a cycle, to ensure it doesn't surpass the CPU.

On the CPU side, however, I don't think we'd want to do this. The CPU will have to wait for the PPU to catch up on register accesses so that the two systems are synced. Since the PPU is likely going to take longer to catch up, a dumb spin on the CPU end probably wouldn't be very effective.

I'm thinking that something like C++11's condition_variable could be used. The CPU would effectively sleep until the PPU emits a signal that it has caught up.

The same thing could be done for the other subsystems, with each running in its own thread.

My main beef with the catch-up approach was that you'd have to write your PPU logic in a way that it would have to be able to enter and exit at any given cycle. With a separate thread, that's no longer the case. You can write the logic cleanly and straightforward without having to allow logic to be interrupted and restarted later. That's the hope anyway... this is still all theory. I'm not sure how well it'd work in practice.

Anyone have any thoughts?

PS. This system actually would have quite a bit of "thrashing" between threads on things like $2002 spin loops. Maybe it would be wiser to have $2002 status predicted so that the CPU can read it and resume without having to wait for the PPU to catch up. Although that gets tricky with the weird sprite overflow behavior. Maybe the thrashing wouldn't be so bad... I'd have to try it out and see.

by byuu on 2012-06-10 (#95318)

> Has anyone here made such an emu? I know byuu has. How did you go about it?

I went with cooperative multithreading instead of preemptive multithreading. In English: only one thread actually runs at a time, and it decides when to switch to another thread. But they are actual unique threads, with their own stack frame and all.

For my model, every two processors that communicate share an int64_t clock value, starting at zero. When CPU A runs for X clocks, you add X*CPU B frequency. When CPU B runs for X clocks, you subtract X*CPU A frequency. If the two frequencies are the same (there's only one NES oscillator, so they always are there), remove the multiplication.

The goal is to minimize context switches between threads (they're very slow.)

Allow CPU A to run as long as possible. When it does something that would be visible to CPU B (say A = CPU, B = PPU; in this case, CPU writes to or reads from a PPU register), you check the clock. If it's >= 0, then A is ahead of B. Switch to B here before doing the write. B now runs and keeps subtracting from the clock. B runs until it does something visible to CPU A (let's say PPU sets a flag that the CPU might read), at which time you see if clock < 0. If so, switch back to A. And now A performs that write.

I don't do it due to complication, but you could take the idea further: when CPU A writes to CPU B, but is ahead of CPU B, you could store that in a 'read buffer' with a timestamp and keep going. You would then only sync when CPU A -reads- from CPU B while -ahead- of CPU B. Then CPU B would read from your buffer and compensate based on the timestamp. Done right, it could help speed a lot. Done wrong, it may be slower than not doing this at all.

That said, the cooperative run out of order thing works amazing well on narrow interfaces: CPU<>PPU is a great example, only eight registers that belong to the PPU. You know when the CPU is talking to the PPU. But take SNES SMP + DSP, or CPU + SA1, they share RAM. So even if they're both doing their own things with their own areas of RAM (most likely), you won't have any way to know for sure. Forces you to sync all the time. Threading can get -very- painful in these cases.

-----

Now if you're looking for preemptive instead ... you definitely can't yield threads waiting for events. That requires OS intervention (ring 3 -> 0 -> 3 transitions on x86) ... you'll be able to call yield, at most, ~100,000 times with major latency just waiting for a thread to resume when it should be immediate. The same goes for mutexes, you can't use those.

You are going to have to use your own lock values that you read from and write to using atomic operations. It's the only way you'll get things fast enough for real emulation. So follow something like the cooperative model, but set "lock" flags that do things like: while(atomic_read(cpu_lock) == true);

-----

Both approaches have major pros and cons.

Cooperative is great in that you get the totally clean code, you don't have to allow enter+exit at every cycle, you don't have to worry about race conditions or any of that stuff, and all of the operations are in user space so it's super fast. You do have to keep in mind that you will pay a performance penalty compared to switch tables (state machines), but it's not a massive penalty. It's up to what you value more: code speed or code clarity.

Preemptive is the only way to actually take advantage of multicore. So if you want to take advantage of more than one core, this is the only way to do it. The downside is complexity, and all of the cores you use will be pegged at 100%. It may run faster, but it'll drain a lot more watts of power.

My personal opinion: cooperative works great on systems with only one CPU. Preemptive works terribly there. If you are emulating a simple enough system (NES, SNES, GB(C), GBA, etc); why -require- a multi-core processor to use the emulator? Just use cooperative. Easier, quicker, better. If you are emulating a system that likely won't run at full speed on single core systems (be it because the host hardware is slow like an ARM quad core cell phone; or because the emulated system is a beast, like the PS2), you have no choice. Go preemptive.

> Maybe it would be wiser to have $2002 status predicted so that the CPU can read it and resume without having to wait for the PPU to catch up.

Nemesis wrote a Sega Genesis emulator that used something similar to prediction. He'd save that processor's state when a prediction occurred, and then would eventually switch to the other thread. If it turns out the prediction caused a problem, he'd load that processor's state back to the old value to 'unwind' running ahead to fix it.

It was a pretty amazing design, but I will note that it -required- a quad? core CPU. I don't know how much was due to the accuracy of his emulation, and how much was due to cost of his emulation model. I presume it was a little bit of both.

-----

Sorry for rambling, hopefully this helps.

by Dwedit on 2012-06-11 (#95321)

I guess you could think of PocketNES as "multi-threaded", in that it uses the GBA's tilemapping hardware to render the screen images based on data collected during the previous frame.

Collected data includes scrolling locations for every scanline, CHR bankswitching changes, sprite table, and of course, map and character data. But not detailed enough (only scanline-level accuracy) to run MMC2/4 games.

PPU interactions are limited to sprite 0 hit. MMC3 is simulated based on simple rules based on which pattern tables sprites and bg use. Accurate enough to avoid shaky screens, not accurate enough to fail on Mario Adventure.

Even an inaccurate emulator can still pass many of the emulation tests. Sprite 0 hit works well, passes all the timing tests and everything, even though it is incorrect for any scanline where you had done unclean horizontal scrolling before the sprite.

by Bisqwit on 2012-06-11 (#95323)

Thread switching by yield() or similar mechanisms is considered harmful.

http://www.altdevblogaday.com/2012/06/05/in-praise-of-idleness/
http://www.technovelty.org/code/c/sched_yield.html
http://msmvps.com/blogs/peterritchie/archive/2007/04/26/thread-sleep-is-a-sign-of-a-poorly-designed-program.aspx
http://weblogs.asp.net/george_v_reilly/archive/2006/09/13/Never-Sleep_2800_0_2900_-in-an-Infinite-Loop.aspx
http://www.allegro.cc/manual/4/api/timer-routines/rest

by tepples on 2012-06-11 (#95326)

I'd recommend using timestamping and prediction with some sort of lightweight thread-safe queue. Have functions that peek at OAM and the mapper state to find the earliest possible time that the PPU could have an effect on the CPU (e.g. $2002 change due to sprite 0, mapper IRQ, or Zapper light detection). Run the CPU until that time, storing PPU- or mapper-relevant write timestamps in that queue, and then have it peek again. Then run the PPU until the most recent timestamped write from the CPU. The APU can run similarly, up until the last timestamped write or the next length counter change or the next frame counter IRQ.

by Disch on 2012-06-11 (#95338)

Thanks for all the responses everyone!

byuu wrote:
Switch to B here before doing the write. B now runs and keeps subtracting from the clock. B runs until it does something visible to CPU A (let's say PPU sets a flag that the CPU might read), at which time you see if clock < 0. If so, switch back to A. And now A performs that write.

I don't see running the PPU ahead of the CPU as an option. Too many things the CPU can do will need to change PPU execution. Any kind of midframe scroll change, bankswitch, etc. If you have the PPU ahead of the CPU and then the CPU makes such a state changing write... you'd have to "rewind" the PPU back to the CPU's timestamp. That doesn't really seem practical.

How do you solve this problem with your cooperative model?

byuu wrote:
My personal opinion: cooperative works great on systems with only one CPU. Preemptive works terribly there.

I guess I see your point. It seems to me, though, that it'd be relatively easy to make a design that works preemptively on a multicore system, but works more like the traditional "catch up" approach for a single core system. The only change you'd need to make is how/when the master thread (CPU) signals to the other threads that they can start running again.

byuu wrote:
why -require- a multi-core processor to use the emulator?

To me it's more about taking advantage of what's readily available. Even low-end single cores these days are at least hyper threaded. And the number of available cores is just going to go up with time.

Partially I'm using this as an exercise to practice writing things for a multithreaded environment, since that really is the way of the future (or rather, the present). But I think it's also practical in this situation.

And it's not like this wouldn't run on a single core system. It just wouldn't get as good of performance. A program getting worse performance on a lower performance machine is not really something I'll lose sleep over.

byuu wrote:
Just use cooperative. Easier, quicker, better

I am very intrigued by your cooperative approach. If you can explain your solution for the PPU running ahead of the CPU problem I brought up above, I'm very heavily considering it.

byuu wrote:
Sorry for rambling, hopefully this helps.

It does! This was exactly the kind of post I was hoping for. Discussions like this are super informative and fascinating. I appreciate you sharing your wisdom/experience.

Bisqwit wrote:
Thread switching by yield() or similar mechanisms is considered harmful.

Many of the articles you linked to seem to say the exact opposite... that yielding should be preferred over dumb looping when waiting in a multithreaded environment. I tend to agree, as well.

tepples wrote:
Run the CPU until that time, storing PPU- or mapper-relevant write timestamps in that queue, and then have it peek again. Then run the PPU until the most recent timestamped write from the CPU

I've considered this in the past, as well, but in a single threaded environment. For some reason it never really sat right with me.

But now that you mention it, it does seem like it would scale up to a preemptive multi threaded environment very well. This would greatly reduce the number of times the CPU would have to sync with the PPU.

Ordinarily the CPU would have to stop and wait for the PPU to catch up on such register writes (although granted, only under very specific circumstances -- like if rendering is enabled and we're not in VBlank). But by logging writes the CPU can just plot ahead at full steam and the PPU can process those writes as they come.

The downside to this is that it requires the PPU to do a lot of additional checking after every cycle. But I don't think that'd be significant.

Something else I just thought about. The PPU acts entirely different when rendering is enabled vs. disabled. If you're running it with single logic where rendering state can be changed at any time, this means that each individual cycle will need to be wrapped in a if(rendering) statement. Furthermore, 'rendering' would have to be volatile.

That seems like a performance killer....

by tepples on 2012-06-11 (#95342)

Changes to 'rendering' are stored in the queue as timestamped writes to $2001.

by byuu on 2012-06-11 (#95351)

> I don't see running the PPU ahead of the CPU as an option. How do you solve this problem with your cooperative model?

Sorry, I had to pick CPU<>PPU because the NES doesn't have a separate processor for audio.

With my emulators, I typically don't run the PPU ahead of the CPU, because we'd have to check our clock value after every read of a non-cached register value.

But one of the fun things is that cooperative threading works great even if you can only run one thread ahead of another for a good while. Aside from spin loops (-; lda $200x; bpl -), and hell even then really, you are executing several cycles in the CPU at once. So when we switch back to the PPU, we get to run that many CPU cycles worth of operations before switching back to the CPU.

So basically, your add_clocks() or step() or whatever function for the PPU is inserted after every time-stepping event (like after your nametable read, or whatever); and that function will always switch to the CPU if the PPU is now caught up (clock >= 0)

When both can run ahead of each other (SNES CPU<>SMP): each can run ~100 opcodes ahead of each other. You'll probably end up with a context switch every ~200 executed instructions.
When only one can run ahead of the other (NES CPU<>PPU): one runs ~100 opcodes ahead, then the other catches up. Repeat. You end up with a context switch every ~100 executed instructions.
When neither can realistically run ahead of the other (SNES SMP<>DSP): disaster. You end up with a context switch every cycle.

There are no worst-case scenarios on the NES, at least. But I'll elaborate more on the SNES. Both the SMP (a processor that executes instructions) and the DSP (a sound processor that decodes BRR samples ... you know, you wrote an SNES emu ;) can access APURAM. Both to read it and to write it.

Nearly every cycle for the SMP contains a read or a write to APURAM. Lots of DSP cycles do as well. The odds both are touching the same memory at the same time are slim to none, but it's possible. So without a "rewind" (save state) mechanism like Nemesis had, we have to sync all the time. Since the SMP executes at 1MHz and the DSP at 2MHz, that means we are switches contexts at ~2 million times per second (1 million each way.)

Cooperative threading really isn't a model you should use everywhere for consistency. What I do with my compatibility profile is turn the DSP into a state machine. It's simple enough that I only need a single switch() right inside the main loop. Some clever coding hides the fact that it's a state machine, and makes it 'run' like it were a thread.

The main difference is that a state machine -has- to be able to exit at every cycle. So in this case, since we do have to exit at virtually every cycle, we just make that always the case. So instead of co_switch(dsp), we just do: while(dsp.clock < 0) { dsp_step(); dsp.clock += cycle; }

But get a situation like the SNES CPU <> SMP, and cooperative threading really shines. When you have each one running hundreds, if not thousands, if instructions ahead of each other, you get massive speed gains over the traditional state machine even at the opcode precision level. If you consider a full SNES CPU state machine (breaking not only per opcode, not only per cycle, but in the middle of reads and writes for bus hold delays), the threading model turns 3-level nested state machines into linear code. It turns a forced stackless context switch of ~10 million times/second into a stack context switch ~50,000 times a second. So you get a major speed boost and way cleaner code.

It's also easy to get carried away. You may be tempted to make the math unit a separate thread, because they act in parallel on real hardware. This is all well and good, but you can easily bog down your performance this way. Threads are not cheap.

> I guess I see your point. It seems to me, though, that it'd be relatively easy to make a design that works preemptively on a multicore system, but works more like the traditional "catch up" approach for a single core system.

I dislike complexity. Detecting whether a system is single or dual core and acting differently is more complicated, and thus, more error prone.

Also note that when you're pre-emptive, every operation that CPU A can read, CPU B writes have to be atomic, and vice versa. Atomic operations are more costly than regular ones.

> To me it's more about taking advantage of what's readily available.

But is it wise to drive 100% utilization of a quad core to run an NES emulator, when you could easily do it with 25% on one core?

Even if it gets you a higher FPS on that quad core chip, it won't be 400% faster. It'll be more like 20-40% faster or something. And it won't continue to scale, because the NES doesn't have very many chips.

> And it's not like this wouldn't run on a single core system. It just wouldn't get as good of performance. A program getting worse performance on a lower performance machine is not really something I'll lose sleep over.

In the ideal multi-threaded model, two threads would never have to stuff all their non-volatile regs on the stack, invalidate the pipeline, effectively flush much of their data cache, and perform a ring 3 -> ring 0 -> ring 3 transition to swap threads. One would just wait for the other. On a single core system, you have to do all of that. It's not going to be a little bit slower. It's going to be substantially slower.

I feel you really need to understand the costs here. When people talk about multithreading as the way of the future, they're invariably giving you examples like web servers where a thread wakes up, what, ten times a second? A hundred? Maybe a website even gets a thousand hits per second? That is child's play. We are talking about processors that achieve MILLIONS of synchronizations/context switches per second.

Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.

Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.

Also, you still seem interested in yield()'ing a thread. Be sure to try that with your tests above. You will -not- be able to yield a thread ten million times a second. It will never happen.

> I am very intrigued by your cooperative approach. If you can explain your solution for the PPU running ahead of the CPU problem I brought up above, I'm very heavily considering it.

Please take a look at my emulator source. fc/cpu and fc/ppu.

Or for the lazy ;)
http://pastebin.com/t3id0NP7

That's my cycle-accurate PPU renderer (it doesn't handle the crazy sprite fetching stuff blargg found.) It looks exactly like a scanline-based renderer, and is written exactly the same way.

And be sure to look at the performance. I get ~300fps on my Core i7. That's very, very bad for an NES emulator. A lot of it is due to audio processing at 1.78MHz with a polyphase sinc resampler, though. But even at ~500-600fps, that'd still be very bad for an NES emu.

by tepples on 2012-06-11 (#95354)

If anything, the following can run in a separate thread because the needed one-way synchronization is especially suited to a lock-free queue:
APU tone generation (because $4018-$401A is disabled in a stock NES)
Audio filtering, preferably using blargg's blip-buf or another implementation of band-limited step resampling
NTSC filtering or Scale2x filtering or both (yes, "both" is possible using something like Super NES hi-res mode)

by Disch on 2012-06-11 (#95355)

@byuu:

Your code looks exactly like what I had envisioned when I wrote my first post. That's cooperative? I must be misunderstanding the difference between cooperative and preemptive.

You're running the PPU in a simple linear fashion without having to enter/exit at every cycle. Your 'tick' function I assume checks to make sure you aren't running ahead of the CPU, and waits for the CPU to get further ahead before resuming.

This is more or less exactly what I had planned. Adding onto that tepples' suggestion of queuing register writes -- and you would greatly decrease the number of times the threads would have to sync.

I'll go over this more when I get home from work tonight.

@tepples

Yes, but changing whether rendering is enabled or not will change the logic flow of the PPU, so the logic would have to be wrapped in a conditional.

Bah I'm not explaining it well. I'll post again later.

by Nessie on 2012-06-11 (#95356)

Cooperative threading seems to simplify the source code by an order of magnitude compared to a typical state machine.
But how do you handle save states? The thread context can't really be saved. Did you decide on some sort of "safe" point (e.g. start of vblank) where all the threads can be told to stop/resume simultaneously?
Also, how feasible is it to to implement this on other platforms such as Android or Wii?

by byuu on 2012-06-11 (#95387)

> Your code looks exactly like what I had envisioned when I wrote my first post. That's cooperative? I must be misunderstanding the difference between cooperative and preemptive.

Cooperative is just like preemptive, only it's much easier. You never have to worry about dead locks or race conditions. You don't have to perform atomic operations, and you have absolute control over the scheduler. Context switches are a hundred times faster. The literal only drawback is that you only use one real processor core this way. A lot of software has cooperative modes, HTTP servers and SQL databases often include this. You use that mode when the cost of switching contexts becomes more expensive than the work being done. Rare for web servers and databases, very very common for emulators.

But yeah, the idea is the same in both. You have:

Code:
void CPU::main() {
whle(true) { //threads 'run forever', they switch out control on occasion though
while(interrupt_pending) execute_interrupt();
execute_opcode();
}
}

Code:
void PPU::main() {
while(true) raster_scanline();
}

Code:
void APU::main() {
while(true) {
run_square();
run_triangle();
run_dmc();
run_sequencer();
//.. made up example, it's a lot more complex, obviously
}
}

Every thread has its own main(), and it's as if that main function -is- the entire program. It can call 30 levels deep into functions, and switch to another thread right then and there. No need for state machines. When you switch back, it resumes right where it left off.

> Your 'tick' function I assume checks to make sure you aren't running ahead of the CPU, and waits for the CPU to get further ahead before resuming.

Yep, this is your typical tick():

Code:
void PPU::tick() {
clock += 4; //4:1 PPU to CPU
if(clock >= 0) co_switch(cpu.thread); //we always sync no matter what when PPU is ahead, since it's hard to run the PPU ahead of the CPU
}

In my emulator, I set and clear IRQ flags based on X/Y position too, but that's kind of a lazy hack :P

> This is more or less exactly what I had planned. Adding onto that tepples' suggestion of queuing register writes -- and you would greatly decrease the number of times the threads would have to sync.

I believe I mentioned register write timestamps in my first post ... although perhaps tepples' method is easier than mine?

But yes, if you queue writes with a timestamp, you don't have to switch threads. It's only the reads that can affect the way eg the CPU executes.

> Cooperative threading seems to simplify the source code by an order of magnitude compared to a typical state machine.

It's a little better, but practically identical if you can put all your logic in a single function with a single switch/case. Most chips are not that simple, but NES APU + SNES DSP are. SNES SMP is close.

But when you get into cases like the SNES CPU that can execute an HDMA in the middle of a 10-frame DMA in the middle of a bus read delay inside the middle of a cycle right inside the middle of an instruction ... yeah, it can become several orders of magnitude more simple than nested state machines. Even if you think you can do nested state machines, you'll often end up with subtle bugs you don't realize. I went that route for 2-3 years and could tell you horror stories. Again though, there's nothing in the NES that complicated.

> But how do you handle save states? The thread context can't really be saved. Did you decide on some sort of "safe" point (e.g. start of vblank) where all the threads can be told to stop/resume simultaneously?

You lose just a tiny bit of accuracy when you save a state, unfortunately. But in the five years I've had save states, I've never had a problem reported because of it.

How I do it: you pick a master thread. This is the most complicated, and most delicate, thread of all. The one that tends to run the longest without stopping. Usually it's the CPU. But on the NES, I actually make it the PPU. All other threads are slaves.

So the user asks to save a state. We set a flag and everything proceeds to run as normal. As soon as we hit the entry point of eg CPU::main(), we can exit that thread right then. The magic is that as long as you save all of the CPU class variables and load them again, the code executes from the same point as before. You can even delete and recreate the CPU thread, it doesn't matter.

Here is where we lose a tiny bit of accuracy. Now we run each slave thread, one by one. With the rule that it cannot synchronize with other chips again. So even if they read something that the CPU might change ... too bad. Allow the read anyway, and allow us to reach eg APU::main()'s top. Serialize that thread too. Repeat until all threads are done. In practice, the -worst case- desync is the longest time a thread can run for. For the SNES SMP, that's one instruction. For the NES APU, it's one audio sample.

This sucks, I know. But I can guarantee you it's not a problem in practice. Not even for TAS'ers that save hundreds of times per frame, as bsnes is the #1 emulator at tasvideos for SNES. It basically means your emulator goes from cycle accuracy to opcode accuracy for -one instruction- at the very, very worst case. Usually it doesn't even matter. I've rigged bsnes to save a state after every single instruction, and was able to play any game I tried with no visible issues.

The longest it takes to save a state is the longest it takes every thread to reach the entry point. So for most systems, that ends up being one PPU scanline. So for -extreme- debugging, this is probably not acceptable. You should keep that in mind.

> Also, how feasible is it to to implement this on other platforms such as Android or Wii?

You have to have a cooperative threading library available for any target you want.

I use (and wrote) libco. Lots of people have helped me (Aaron Giles, Bisqwit, Nach, vladitx, blargg, etc.)
It's used by bsnes (NES+SNES+GB+GBC+GBA); twoMbit (SMS); daifukatt.su (arcade); and MAME/MESS (sparingly so far.)
It works on x86 and amd64 (Windows, OS X, Linux, BSD); PowerPC 32-bit and 64-bit (OS X, Linux, BSD, PS3, Wii); MIPS (Loongson); SPARC (although it only runs well if you disable register windows here); ARM (Raspberry Pi); etc. It has assembly implementations for many processors, a Windows Fibers implementation that runs on any version of Windows, and a setjmp/longjmp version that works on virtually any Unix-alike operating system.

But if you're targeting a low power device, you'll want to be careful. State machines, especially at one level deep, are typically quite a bit faster. If you want a highly accurate emulator plus clean code, you may have trouble getting full speed on low powered systems.

by Dwedit on 2012-06-11 (#95389)

How is libco different from pushing a big char array on the stack and using setjmp/longjmp?

by byuu on 2012-06-12 (#95393)

> How is libco different from pushing a big char array on the stack and using setjmp/longjmp?

longjmp only modifies the PC. You can push a big block of memory on the stack and go to a new "thread", but how do you return from it when you're done? You can't pull a big char array back off the stack arbitrarily, nor can you really keep track of how many bytes you're supposed to push and pull for any arbitrary point you'd longjmp at.

libco, however, is very very simple. It swaps all non-volatile registers, including the stack pointer and program counter.

Code:
; fastcall ABI:
; return = eax
; arguments = ecx,edx
; non-volatile registers = ebp,esi,edi,ebx

co_swap_fastcall(to = ecx, from = edx):
mov [edx],esp
mov esp,[ecx]
pop eax ;get return address as soon as possible (so the CPU can start caching) [big boost over 'ret' at the end]

mov [edx+ 4],ebp ;push / pop is slower on most archs
mov [edx+ 8],esi ;with the notable exception being the Pentium 4
mov [edx+12],edi
mov [edx+16],ebx

mov ebp,[ecx+ 4]
mov esi,[ecx+ 8]
mov edi,[ecx+12]
mov ebx,[ecx+16]

jmp eax

Twelve instructions. You can do the rest (thread creation and deletion) in ISO C.

by Dwedit on 2012-06-12 (#95394)

Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.

by byuu on 2012-06-12 (#95397)

Well, technically yes, it appears to save and load SP.

If you want a more in-depth explanation than I can give:
http://en.wikipedia.org/wiki/Setjmp.h#C ... imitations
Quote:
If the function in which setjmp was called returns, it is no longer possible to safely use longjmp with the corresponding jmp_buf object. This is because the stack frame is invalidated when the function returns.

To be honest, the setjmp/longjmp version in libco was written by Nach. I'm not a true expert on that.

But ultimately, there's a reason why there are dozens, if not hundreds, of cooperative threading implementations, rather than people just using a simple alloca() + longjmp() pair.

These are true threads. You don't have setjmp+longjmp, you just have switch_to(target); and you can easily have any number of them, switching in any order from any source to any target, and they don't have conditions where they become 'invalidated.'

by Disch on 2012-06-12 (#95400)

Okay, I think I finally get it @ byuu. I'm starting to agree with you that cooperative threading would be better for an NES emu. This is very interesting.

I might still try it both ways and see which I like better. I'm not entirely ready to rule out preemptive because I still think it might get better performance on multicore machines.

Detecting multicore machines (in theory) is simple with C++11 (std::thread::hardware_concurrency), but I don't think that has been widely implemented yet. I suppose I could fall back to boost for it.

Allowing for both in the same emu wouldn't be complicated at all. You'd just have to abstract the switching mechanism.

I'll have to play around with it. But this is definitely sounding like a very fun project.

@ tepples: The problem about rending being enabled/disabled that I was trying to illustrate before is that you have to wrap each cycle in an if statement:

Code:
tick();
if( rendering_on )
{
// fetch tiles, render pixel normally, adjust scroll, etc
}
else
{
// render bg pixel, don't fetch anything, no scroll updates
}

Might not seem like a big deal, but if you have to do this for every cycle it will get ugly very fast.

Interestingly, it doesn't look like byuu is doing this in the source he posted. How are you handling mid-scanline rendering disables/enables @ byuu?

by tepples on 2012-06-12 (#95401)

Disch wrote:
The problem about rending being enabled/disabled that I was trying to illustrate before is that you have to wrap each cycle in an if statement:

Code:
tick();
if( rendering_on )
{
// fetch tiles, render pixel normally, adjust scroll, etc
}
else
{
// render bg pixel, don't fetch anything, no scroll updates
}

The PPU can be split into a chain of several stages. One generates indices into $3F00-$3F1F; the operation of this phase varies greatly based on whether or not rendering is turned on. The rest of the stages are fairly self-contained: palette lookup, the monochrome bit, the tint bits, NTSC filtering and/or other scaling, and Zapper light detection. None of these stages depends at all on whether rendering is turned on for a particular dot, and thus they're prime candidates to run in another preemptive thread.

by Disch on 2012-06-12 (#95402)

I don't really see the benefit to moving everything after pallet lookup/tint to yet another thread (other than *maybe* whole image filtering, but that's probably better left to a pixel shader anyway).

Besides that doesn't solve the issue of one thread (the one actually determining which pixel the PPU is outputting, before pallet/tint/etc) having to run two very different flows of logic.

by tepples on 2012-06-12 (#95403)

Disch wrote:
I don't really see the benefit to moving everything after pallet lookup/tint to yet another thread (other than *maybe* whole image filtering, but that's probably better left to a pixel shader anyway).

Which illustrates my point. A pixel shader itself is another thread, albeit one that runs on the GPU if your GPU is sufficiently powerful.

Quote:
Besides that doesn't solve the issue of one thread (the one actually determining which pixel the PPU is outputting, before pallet/tint/etc) having to run two very different flows of logic.

That's the same problem whether you use threads or not. The idea is that you run one flow of logic in a loop until the timestamped event that causes the switch from rendering logic to not-rendering logic or vice versa. If anything, this will be the switch to not-rendering logic at the start of line 240, the post-render scanline.

by Disch on 2012-06-12 (#95404)

Quote:
That's the same problem whether you use threads or not.

The difference is, with a state machine you can write two entirely different functions, and only call the one which is appropriate.

With a separate thread, you can't make a "rendering disabled" thread and a "rendering enabled" thread, because the whole point is that you want to be able to write the code without having to enter/exit at any cycle.

by byuu on 2012-06-12 (#95410)

> I might still try it both ways and see which I like better. I'm not entirely ready to rule out preemptive because I still think it might get better performance on multicore machines.

My preference for cooperative threading is based on dozens of hours of testing each model.

But I'm far from perfect, and I've never actually implemented a preemptive threaded emulator. So I say go for it! Perhaps you'll end up convincing me to implement a preemptive model, too :D

I'd love to see a model that allows cooperative and preemptive as a compile-time switch. It's definitely something I've thought about. Keep in mind again that although the underlying code can likely be identical, there will have to be some differences.

Preemptive means "threads can start and stop on their own", and may not exist on their own cores. Make sure when you use preemptive that you have a -real core- for each thread. Otherwise fallback on cooperative, which will be faster. You may also want to wrap your atomic accesses. Your preemptive code will have dead locks and race conditions that the cooperative code won't. But once preemptive works like a cooperative model with appropriate locks to maintain synchronization, making it preemptive should be easier.

[In a sense: preemptive was really made for the era where most people had one core and didn't want a hung cooperative program to crash their OS. We really need a new name for the model of one true core per thread. We really don't -want- preemption in this case. Although there's no way to 100% own a core yet, that I know of.]

Just be sure to test context switching heavily with a skeleton emulator first. Understand the speed of each approach before you write a full emulator, so that you know how much overhead is spent there.

> How are you handling mid-scanline rendering disables/enables @ byuu?

raster_pixel(), raster_sprite(), scroll(xy)_increment(), etc bail out early or blank the pixel when raster_enable() [bg+sprites on] is false.

by Disch on 2012-06-12 (#95425)

Quote:
I'd love to see a model that allows cooperative and preemptive as a compile-time switch

I was even thinking as a runtime option, rather than a compile-time switch. If you abstract the thread switching behind a common interface, the only thing you'd have to change is which implementation you're going to use. That can be easily accomplished at runtime.

But of course you'd have to write the code under the assumption that it will be running as preemptive, which might hinder coop performance.

But this is all stuff I want to try out. This is kind of exciting.

Thanks for all the ideas and feedback everyone!

by Disch on 2012-06-14 (#95533)

Okay I have a basic plan outlined, but haven't actually tried it out yet.

Context switching is abstracted behind a ContextManager class, with 2 implementations: PreemptiveContext and CooperativeContext.

Since I haven't really looked at byuu's coop lib yet, I kind of made some assumptions about how it worked. Maybe that will bite me in the ass, but I think I should be alright:

(note of course the classes are incomplete (missing ctors), but hopefully the idea is illustrated)
Code:

#ifndef SCHPUNE_CONTEXTMANAGER_H_INCLUDED
#define SCHPUNE_CONTEXTMANAGER_H_INCLUDED

namespace schpune
{

class ContextManager
{
public:
virtual ~ContextManager() { }

virtual void Wait() = 0;
virtual void WaitUntil(ContextManager* switchto, function<bool()> pred) = 0;
virtual void Suspend() = 0;
virtual void Resume() = 0;
// virtual void Join() = 0;

private:
ContextManager(const ContextManager&) { } // no copying
void operator = (const ContextManager&) { } // no assigning
};

}

#endif // SCHPUNE_CONTEXTMANAGER_H_INCLUDED

Code:

#ifndef SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED
#define SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED

#include "contextmanager.h"

namespace schpune
{

class PreemptiveContext : public ContextManager
{
public:
virtual void Wait()
{
mParent->mCV.notify_all();
if(mSuspended)
WaitUntil( mParent, [this] () { return !mSuspended; } );
else
threadlib::this_thread::yield();
}

virtual void WaitUntil(ContextManager* switchto, function<bool()> pred)
{
static_cast<PreemptiveContext*>(switchto)->mCV.notify_all();
mCV.wait( threadlib::unique_lock<threadlib::mutex>(mMutex, pred) );
}

virtual void Suspend()
{
mSuspended = true;
}

virtual void Resume()
{
mSuspended = false;
}

private:
PreemptiveContext* mParent;
threadlib::mutex mMutex; // note, 'threadlib' is std if C++11 threads supported, or boost otherwise
threadlib::condition_variable mCV;
volatile bool mSuspended;
};

}

#endif // SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED

Code:

#ifndef SCHPUNE_COOPERATIVECONTEXT_H_INCLUDED
#define SCHPUNE_COOPERATIVECONTEXT_H_INCLUDED

#include "contextmanager.h"

namespace schpune
{

class CooperativeContext : public ContextManager
{
public:
virtual void Wait()
{
// TODO: switch to parent
}
virtual void WaitUntil(ContextManager* switchto, function<bool()>)
{
// TODO: while(!pred()) switch to 'switchto'
}
virtual void Suspend()
{
// do nothing - suspension is not required with cooperative threading
}
virtual void Resume()
{
// do nothing - suspension is not required with cooperative threading
}

private:
// TODO - look up byuu's cooperative threading lib and implement
};

}

#endif // SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED

For purposes of this discussion, there are 3 threads: NES, CPU, and PPU. The NES is the one the user will be interfacing with, and the CPU/PPU will only be referenced inside of the NES.

PPU's context parent is CPU
CPU's context parent is NES
and NES's context parent is null.

The intended flow is like so:

When the NES is called to emulate a frame:
- Call CPU::FrameStart (not shown above)
- WaitUntil( CPU_context, cpu_is_finished_with_frame && ppu_is_finished )

CPU logic on FrameStart:
- Resume current context
- Resume PPU context

CPU logic (in CPU's thread):
- emulate up to a given timestamp
- on writes where it's necessary to sync with PPU, WaitUntil( PPU_Context, ppu_is_caught_up )
- when frame is complete, Suspend()

PPU logic:
- emulate up to CPU's current (realtime) timestamp
- if PPU catches up to CPU, Wait()
- when frame is complete, Suspend()

Additionally, the Suspend/Resume bits can be used for implementing a debugger with breakpoints (when a breakpoint is hit, Suspend() all contexts except for NES's)

But like I said it's so far untested. Against byuu's advice I'm kind of jumping right in with this rather than doing smaller tests first. It's mostly an educational/experimental project anyway, so if it doesn't turn out I won't feel so bad.

Anyway, thought you guys might be interested. If you spot any problems with the format let me know. Other feedback/critique also welcome.

by byuu on 2012-06-14 (#95534)

Well, here's the entire libco API library:

Code:
void* co_active();
void* co_create(unsigned stack_size, void (*entrypoint)());
void co_switch(void *target);
void co_delete(void *target);

I know it's really huge and complicated ... but bear with me here :P

co_active() returns a handle of the active thread. This even happens for the main program thread (and is in fact the only way you can jump back to it later. Store it when your program starts.)

co_create() makes a new stack, with however many bytes you want. 1M or less is more than enough. C++ apps usually allocate 512K or 1M. Go too small (256 bytes or something absurd) and expect crashes. Do not return from the entrypoint, that won't destroy the thread. It can't, because that thread is active.

co_switch() will save the context in the active thread, then load the context from the target thread, and resume processing there. Don't switch to the already active context. It'll probably be a no-op, but don't be stupid.

co_delete() will free the memory of a created context. Do not try and delete the main context, do not try and delete the current context. If you do, things will blow up, and you will deserve it.

libco doesn't have any concept of parent<>child relationships. They are all siblings. It also doesn't pass around parameters to entry points. There is no scheduler built-in. You can do all of this with wrapper code that you create. (Note: parameters to entry points are easy, use an std::map<void *thread, ParameterList>) The idea is for you to do things your way, and leave libco simple and easy to port to other targets. There's also a handy-dancy typedef void* cothread_t; ... you should NEVER stab at the raw memory, of course.

The way I do my scheduler:

cothread_t host; //this is the thread that the GUI runs in
cothread_t active; //this is the currently executing -emulation thread-
cothread_t cpu, ppu, apu;

On reset, co_delete() all threads; co_create() all threads [gets us to entry point on all threads]; active = cpu; reset all variables in all classes
When you call Emulator::runForOneFrame(), it does co_switch(active), and then the core keeps switching between threads as it needs to.
Once an entire frame has been rendered, it does active = co_active() [so it knows where to re-enter], co_switch(host), and the host thread calls videoRefresh(ppuVideoData); and returns.

You're free to do things however you want, of course.

by Disch on 2012-06-15 (#95541)

I figured it would be simple like that. Very nice.

I'm going to give preemptive a run first and see how it goes. Then I'll work on getting your coop lib in there (should be VERY easy to drop in, I've already abstracted around it). Once both are in side-by-side comparisons will be easy. I gotta say I'm very curious as to how each approach is going to go.

by blargg on 2012-06-17 (#95585)

Dwedit wrote:
Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.

No, you're correct, it's not just PC reassignment. longjmp saves the current registers on the stack, then swaps PC with another thread's PC and restores its registers off the stack. This is what libco does as well. longjmp()-based threading can work on some platforms, since it's doing the same thing, though it's undefined behavior since you're using longjmp() in an illegal way. The main trick is getting each thread a separate stack. "Portably" you can allocate a large char array on the stack, as you mentioned, then divide it up with a recursive function with a smaller char array on its stack, noting where each stack starts.

One snag is that it doesn't save volatile registers, which may mean that local variables not declared volatile may lose changes to their values after setjmp(). byuu's library preserves all local variables even if not declared volatile.

by Disch on 2012-06-19 (#95682)

Well I got preemptive and cooperative multithreading in as a compile time switch now. Can't speak to performance on either one of them yet, as my preemptive implementation still have a few kinks I need to work out.

libco was EXTREMELY easy to drop in, though. Took me all of 10 minutes.

Kudos, byuu.

In other news, basic CPU/PPU are working. No sprites or attributes currently. I'll probably do sprites next.

by byuu on 2012-06-20 (#95801)

Awesome job! Can't wait to see cooperative vs preemptive numbers on the full emulator. Especially on quad cores and higher. If it's a huge improvement, I may want to ask for your help in a wrapper library to allow cooperative or preemptive as a switch, so I can use it too :D

by mozz on 2012-06-21 (#95831)

blargg wrote:
Dwedit wrote:
Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.

No, you're correct, it's not just PC reassignment. longjmp saves the current registers on the stack, then swaps PC with another thread's PC and restores its registers off the stack. This is what libco does as well. longjmp()-based threading can work on some platforms, since it's doing the same thing, though it's undefined behavior since you're using longjmp() in an illegal way. The main trick is getting each thread a separate stack. "Portably" you can allocate a large char array on the stack, as you mentioned, then divide it up with a recursive function with a smaller char array on its stack, noting where each stack starts.

One snag is that it doesn't save volatile registers, which may mean that local variables not declared volatile may lose changes to their values after setjmp(). byuu's library preserves all local variables even if not declared volatile.

Wow, its been a long time since I dropped by here and I'm so happy to see the same crowd still obsessed with this stuff.

An important note about the co-operative multithreading: If you allocate your own stack area (as libco's co_create does) then you MUST BE CAREFUL not to make any OS calls while running that co-thread! Some of them may work, but there is no guarantee. On at least some versons of Windows (95 through XP at least, I think) any Win32 API function that goes into kernel space will check that the user-mode stack pointer is inside the stack area that Windows allocated for the thread. If you have allocated your own stack space elsewhere and set the stack pointer to it, it will raise an SEH exception or something. (I encountered this ages ago while working on a Win32 emulator for a mobile device which ran a cooperatively-threaded OS).

I don't know if this still applies to newer versions of Windows (Vista+) or to other systems like OSX, but the safest thing to do is never make OS calls while your stack pointer has been swapped out to some other stack area besides the OS-allocated stack supplied for that OS thread. To be safe, all you have to do is use the original host cothread (from co_active) for running the GUI and use your other co_create-d cothreads just for emulator code. Any file access, etc. should also be done from the host co-thread.

Also: I am a big fan of cooperative multithreading for emulators, and I am highly skeptical of this preemptive multithreading idea you guys are discussing. Preemptive thread switches are thousands of times more expensive than a libco co-operative switch, and Windows thread switching has all sorts of gotchas that will surprise you.

Read this post, for example. High-performance multi-threaded programming is painful and complicated. Here's an index of that guy's posts on the topic.

Be EXTREMELY careful if you decide to try and implement your own lock-free data structures or primitives! This is a subtle art filled with unexpected gotchas, and you will waste weeks or months trying to get this right (or more likely, you won't get it right and there will be one-in-a-million crash bugs in every future version of your preemptive emulator).

And be even more careful if you decide to use somebody else's. DO NOT trust any lock-free implementations you find on the Internet or even in academic research papers--nearly all of them have bugs or race conditions, or make assumptions about memory effect ordering that are not always true on real hardware, etc. In my day job as a game engine programmer, I've watched an extremely smart programmer develop a set of working lock-free primitives for our game engine, and it was a six-month process where he would be positive he had a working implementation, and for around one month it would apparently work until finally a rare-and-weird crash would be traced back to the lock-free stuff and he'd find yet another extremely obscure and impossible-to-reproduce race condition that none of us had thought of, and re-design the thing yet again.. this happened at least 5 times, and every single published paper we read about lock-free data structures contained errors of this type. Believe me, lock-free stuff can be a real minefield and sensible programmers should stay the hell out of it!

by Disch on 2012-06-21 (#95880)

Thanks byuu.

I'm switching periodically back and forth between the two just to try them out. I think I have most of the stability issues with preemptive resolved (it used to hang every once in a while, but since a recent fix in how ppu<->cpu are synced at the end of the frame, I haven't gotten it to hang again no matter what I tried).

No actual "numbers" yet, but so far it's looking like cooperative gives better performance on my machine, as I notice a significant speed difference in Excitebike when running the two modes. HOWEVER! I am doing incredibly dumb unoptimized syncing at this point (sync after every single reg read/write no matter the state), so there is a LOT of context switching going on, so this is pretty much what I expected. I'm not giving up on preemptive yet -- once I put in some smarter code to make it so I don't have to sync as often, I'll do some proper tests.

Plus... preemptive is way easier to debug since I can switch thread contexts in the debugger. So even if it turns out not to be practical from an end user standpoint, I'm glad it's there ;)

@ mozz:

I was planning to implement a simple lockless queue to track register writes. It would basically work like this:

Code:
// in ppu thread:

if( ppu_pos != cpu_pos )
{
// read and act on queue[ppu_pos]
++ppu_pos;
}

// in cpu thread, on register write
queue[cpu_pos] = whatever;
++cpu_pos;

What kind of gotchas would bite me here? And is there a way around them?

by byuu on 2012-06-22 (#95889)

Heya mozz, long time no see!

> you MUST BE CAREFUL not to make any OS calls while running that co-thread!

It is a good idea to be overly cautious, sure. You can keep the cothreads almost entirely based around only your own code, and back out to the real main thread to do OS interaction. But this really isn't an issue on x86/amd64.

I've had my PPU call interface->videoRefresh(), which invokes the GUI, which calls DirectX + window geometry querying stuff. Same for input that polls the keyboard, mouse and gamepads, and audio which writes to the sound card. I *definitely* use lots of C-level file I/O stuff in the core even now. And bsnes has been run on x86, amd64, PPC32, PPC64, MIPS, ARM, SPARC; Windows, Linux, OS X, Nintendo DS, PS3, Loongson, and FreeBSD.

I can see a secure environment doing wildly ineffective things like stack bounds checking, but desktop OSes don't really do that. Also, I have a Windows Fibers backing, and I'd be absolutely shocked if fibers didn't let you call other API functions.

> I am highly skeptical of this preemptive multithreading idea you guys are discussing

I concur, said as much at the beginning. I have very little hope of seeing good performance numbers from the preemptive version. But you never know unless you try! Having a real-world, full-fledged example is great, if someone is willing to put the work in.

That said, I do want to add that Disch's abstraction to allow both types is going to have a ton of overhead in and of itself. We're talking about millions of switches a second here. I wouldn't be surprised to see a 20-40% speed boost by using libco without a wrapper.

> Plus... preemptive is way easier to debug since I can switch thread contexts in the debugger.

Neato! I never use debuggers, but that does sound like it'd be really helpful. Valgrind goes crazy when it sees my userspace stack switching.

> once I put in some smarter code to make it so I don't have to sync as often, I'll do some proper tests.

That will indeed help a lot. Cooperative threading can actually outperform state machines by a large margin when you are able to run each thread far longer than a state machine could. But swapping every cycle, it's a very clear loser in terms of performance.

by Disch on 2012-06-22 (#95897)

Quote:
We're talking about millions of switches a second here.

I'm not quite switching every cycle... just every register access and at end of frame.

I plan on eventually implementing the following to prevent unnecessary switching:

1) detect common $2002 polling loops and run the PPU ahead until 2002 status changes.

2) Queue register writes at have the PPU dequeue them and actually perform them as it runs, rather than force it to sync up every register write.

3) Only sync up on 2004/2007 register reads if the read will affect (or be affected by) ppu behavior (ie: if rendering is disabled and the ppu address is outside the $3Fxx range, there is no need to sync)

So the only things I'll have to sync for are:
- once for each 2002 poll loop
- crazy stuff like 2004 or 2007 poll loops while the PPU is rendering (damn you, Micro Machines)
- mapper reg writes that affect ppu execution (mirroring/chr switching)
- end of frame

I'm estimating maybe 8-12 syncs per frame on average, with 2 context switches per sync. I don't think that will be too hard on performance. But I won't know until I actually get there!

Quote:
That said, I do want to add that Disch's abstraction to allow both types is going to have a ton of overhead in and of itself.

"Tons" is an overstatement. All I'm doing is putting libco function calls behind an abstract base class. The extra overhead is a pointer dereference and a virtual function call. Hardly performance killers. Especially after the amount of syncing is minimized.

The bigger hinderance to performance is probably due to making the timestamp counters volatile.

Quote:
Neato! I never use debuggers

Good god, man. How do you function? XD

by byuu on 2012-06-22 (#95898)

> 2) Queue register writes at have the PPU dequeue them and actually perform them as it runs, rather than force it to sync up every register write.

Please be sure to do benchmarks before you implement this and immediately after. I'm very curious about this one.

There's a high probability that the queue + time compensation will outweigh the benefit of less switches. I'm not really sure here.

> I'm estimating maybe 8-12 syncs per frame on average

That is realllllly ambitious, but we will see :D
I take it your APU is a state machine? Fine if it is, mine isn't solely for consistency.

My SNES core syncs between 200,000 and 6,000,000 times a second on average (tons of games do PPU / APU spin loops); highly dependent upon the game. So ~600 is something else.

Hell, maybe NES really only needs that many on average, I haven't really messed with it. I just did the "sync whenever ahead" thing there too, as I already get 400+fps on my system, and I don't expect anyone to use my NES emulation.

> Hardly performance killers. Especially after the amount of syncing is minimized.

Well yeah, at 8-12 I agree. In the hundreds of thousands, it adds up.

> The bigger hinderance to performance is probably due to making the timestamp counters volatile.

Definitely add:
#ifdef PREEMPTIVE
#define PVOLATILE volatile
#else
#define PVOLATILE
#endif

> Good god, man. How do you function? XD

Lots of well-placed printf() statements :D
Very rarely I'll use gdb bt, but that works with cothreads just fine, amazingly enough.

by Disch on 2012-06-22 (#95901)

Quote:
Please be sure to do benchmarks before you implement this and immediately after. I'm very curious about this one.

There's a high probability that the queue + time compensation will outweigh the benefit of less switches. I'm not really sure here.

I will.

I suspect queuing will do more for preemptive than it will for cooperative, as switching is alot more costly there.

Quote:
That is realllllly ambitious, but we will see :D

From what I've seen of typical NES games, most of them don't really do a lot of PPU/CPU sync. But I'm partly pulling that number out of my ass, so yeah maybe it is a bit optimistic. We'll see indeed.

Quote:
I take it your APU is a state machine? Fine if it is, mine isn't solely for consistency.

I currently have no APU, but eventually I'll make it another thread. I was just speaking about CPU<->PPU syncing previously. With the addition of an APU, there will of course be more switching.

I don't plan on adding any other threads, though. I thought about doing one for a mapper, but most mappers don't need it and those that do can operate very easily with a state machine, as their logic isn't very complex.

Quote:
Definitely add:
#ifdef PREEMPTIVE
#define PVOLATILE volatile

Well my ultimate goal was to make this a runtime option.

by tepples on 2012-06-22 (#95902)

How do you plan to handle IRQs triggered by the PPU? MMC3 and MMC5 watch PA12 or PA13 to count scanlines. If you're running the CPU ahead, you'll need to estimate how many scanlines ahead the IRQ is going to occur.

by Disch on 2012-06-22 (#95904)

IRQs will be precalculated and predicted. I've always done it this way so this isn't really new.

AFAIK there are 3 different PPU-triggered IRQs:

1) A12 rises (MMC3, mapper 90)
2) MMC5's scanline counter
3) mapper 90's "ppu read" counter (not used in any games?)

All of which are relatively easy to predict. Though state changes midframe would complicate them and probably require additional syncing.

For example, if a game switches from sprites using $1000 for CHR to using $0000 normally wouldn't require a sync because that could be queued. But since it would change A12 prediction it probably would have to sync.

But honestly I'm not concerned about mapper details yet. I don't expect many complications in working that in.

by byuu on 2012-06-22 (#95907)

I very much enjoyed implementing the mappers as true independent processors with their own cycles. It does start to cross beyond your average emulator though, keep track of values on the busses between cycles and such.

There's so much speculation out there, but almost nothing on how the boards definitively detect scanline edges. Especially on the more complex mappers.

Very interestingly, I found that the method I use to detect MMC5 scanlines (runs all MMC5 games) [I believe I watch for the two dummy nametable fetches ... but I haven't worked on NES code in a while now] results in the counter being drastically desynced from where the actual PPU is at frequently, but only during scenes without any graphics onscreen. Kind of neat. A test ROM could probably easily detect most emulators' implementations.

by tepples on 2012-06-22 (#95908)

byuu wrote:
the method I use to detect MMC5 scanlines (runs all MMC5 games) [I believe I watch for the two dummy nametable fetches ... but I haven't worked on NES code in a while now] results in the counter being drastically desynced from where the actual PPU is at frequently, but only during scenes without any graphics onscreen. Kind of neat. A test ROM could probably easily detect most emulators' implementations.

I've read speculation that the MMC5 watches writes to $2001 and disables the scanline counter while rendering is disabled. It has to watch $2000 because some of the CHR bank behavior differs for 8x8 and 8x16 pixel sprites.

by mozz on 2012-06-22 (#95912)

byuu wrote:
I can see a secure environment doing wildly ineffective things like stack bounds checking, but desktop OSes don't really do that. Also, I have a Windows Fibers backing, and I'd be absolutely shocked if fibers didn't let you call other API functions.

I'm afraid its been almost 15 years and I can't remember most of the details, but this would have been either on Windows 95/98 or NT4, and probably not all Win32 API functions but more likely just specific functions (things involving a window handle ??? I can't remember anymore). I'm not sure about Windows 2000/XP.

Also I would expect Windows Fibers to work fine, because their stack space is handled by the OS. I remember that we were allocating our own "stack space" from the process heap for each cothread, and switching between them similar to how libco does it. For one thing, I don't think Fibers were supported on 95/98, I think they were only supported on NT at the time, so we didn't want to use them. The main thing I remember was that some Win32 API functions would freak out if we called them while our userspace stack pointer that pointed into the heap. (It was a GUI app, maybe it was something that involved WndFuncs or message queues etc.) Fortunately our emulator had a known fixed set of threads it needed to create, so we worked around it by doing the easy thing. On startup, we just allocated those stacks inside the main Windows thread's stack with _alloca. Then we could switch them around and still call Win32 API functions without triggering whatever it was.

Perhaps this old restriction is not enforced at all in newer versions of Windows. If people can run bsnes on Windows XP, then I guess everything is fine. Carry on

by mozz on 2012-06-22 (#95913)

Disch wrote:
@ mozz:

I was planning to implement a simple lockless queue to track register writes. It would basically work like this:

Code:
// in ppu thread:

if( ppu_pos != cpu_pos )
{
// read and act on queue[ppu_pos]
++ppu_pos;
}

// in cpu thread, on register write
queue[cpu_pos] = whatever;
++cpu_pos;

What kind of gotchas would bite me here? And is there a way around them?

I'm no expert on lock-free stuff, but that's the kind of question that would take a textbook to answer.

In general, other threads might not see the results of reads and writes in the same order that your thread executes them, and there might be arbitrary delays before they see those results. Various types of barrier instruction can be used to "flush" writes to other processors. But even then, you have to be careful--out-of-order execution means reads won't necessarily occur in program order, and neither will writes (except I think they do on x86, but its not safe to rely on that) and reads can be reordered before/after writes, and writes can be combined in a buffer before they actually get flushed out to the cache, etc. Non-atomic operations can always be interrupted by a thread switch or hardware interrupt, allowing other threads to see the partial results of the operation. Cache line contention (even false contention on different parts of the cacheline) can easily kill performance. So when you write your own lock-free primitives you are trying to build a reliable, complex atomic op that overcomes all of these problems, out of simpler ops (atomic and non-atomic, and barriers). It's definitely not for the faint of heart.

The good news, is that a circular FIFO single-producer single-consumer queue is about the simplest lock-free data structure anyone makes, so its entirely possible to get it right after a few tries.

I suggest to read these blog posts by charles bloom, he is one of the guys who likes to figure this kind of stuff out and he's smarter than most.
Intro to low level threading issues
Introducing CAS and cache line issues.
Intro to Lock-Free Programming
SPSC FIFO: the simple message queue
A look at some bounded queues
A look at some bounded queues - part 2

Edit: Just making variables volatile is NOT enough to make them safe across multiple threads/cores. All volatile does is keep the compiler from optimizing away your memory access (and on MSVC on x86 it adds some reordering constraints in the compiler too, but none at the CPU level).

by Disch on 2012-06-22 (#95916)

Wonderful, mozz. Thanks for the links. I'll check them out tonight.

by Disch on 2012-06-24 (#96007)

Well I've been reading those links and yutzing around with optimizing and stabalizing the preemptive approach. I have come to the agreement that lock-free techniques are not worth the trouble, at least not until std::atomic is widely implemented.

Just about ready to start initial performance benchmarks before I do the "queue register writes" thing.

by Disch on 2012-06-25 (#96042)

Grrrr... Preemptive is still deadlocking. I thought I got that resolved. Grrrrr.

by Disch on 2012-06-26 (#96069)

Initial numbers are in. As expected, Cooperative seems to be the better performer so far:

Each game ran for 10000 frames, no user input (they just ran the demo).
Ticks = time in milliseconds it took to run all frames.
Context switch = applicable to Cooperative only, how many times thread contexts switched (ie: co_switch calls)
Waits = applicable to Preemptive only. The number of calls to condition_variable::wait (ie, thread sleeps waiting for the other thread to catch up and wake it up)
Yields = applicable to Preemptive only. The number of times the PPU "caught up" to the CPU and had to spin in a yield loop while it waited for the CPU to gain some ground (the PPU does not sleep/wait for the CPU in my implementation).

As I expected, volatility is a considerable performance hit in of itself. Just adding volatiles knocks 30 off the average FPS.

Preemptive performed poorly on SMB (and especially Excitebike) due to their $2002 spin loops. Those sync ups are really heavy on Preemptive.

I'm going to do two more optimizations. Profiling again after each one:

1) Put register writes in a FIFO queue and have the ppu pull them out and apply them as it runs, rather than having to switch/sync up on every write.

2) Detect common $2002 spin loops and run the PPU ahead until $2002 status changes.

#2 should do wonders for Preemptive.

Also... interesting note. Balloon Fight has ~110,000 context switches in 10,000 frames. That's ~11 context switches per frame. Not far from my estimate XD.

But unless preemptive pulls out some kind of miracle with these optimizations, it's looking like I'm going to gut it and stick to cooperative. Still, this is a very very fun experiment.

by Hamburgler on 2012-06-27 (#96076)

Very interesting work! I'm currently toying with preemptive threading on my own code (pthreads/osx/c++/opengl) so this thread has been particularly interesting to me. I would have figured preemptive would win but these results are hard to argue with.

A few questions:

1) What OS is this running under? Or more specifically, what preemptive threading model are you using?

2) How many processor cores were available to the program?

3) How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?

4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?

by James on 2012-06-27 (#96079)

Do you have numbers for a non-multithreaded version?

by byuu on 2012-06-27 (#96111)

> 1) Put register writes in a FIFO queue and have the ppu pull them out and apply them as it runs, rather than having to switch/sync up on every write.

My prediction: it helps out preemptive a little-to-fair bit, but hurts cooperative (timestamp computation and choosing whether to use FIFO or direct read all the time will be painful.)

I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.

> That's ~11 context switches per frame. Not far from my estimate XD.

That is really interesting that thread switching is so low on NES. I have to say, I never bothered to count the switch counts at all on anything but SNES. But I will say there, SNES is very very very much worse. I've seen 200,000 - 4,000,000 syncs a second there (you can have three CPUs, two PPUs and a DSP.) Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.

Preemptive can definitely help emulators, but only at much less fine scales. It's a balance between the amount of time spent emulating to amount of time spent switching. NES is so trivial that your cache stalls and sync primitives are much more computationally expensive. PS2+ would obviously be a different story, plus later systems don't need as fine-grained syncing.

> 2) Detect common $2002 spin loops and run the PPU ahead until $2002 status changes.

Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.

> Do you have numbers for a non-multithreaded version?

A state machine design would have to be a completely separate emulator, unfortunately.
It would also be quite difficult to do a precise measurement, because the syncing point style would be totally different, but you could get fairly close.

It's hard to say in this case, given those amazingly low sync counts, I'd be tempted to say it'd be faster than a state machine, at least in Balloon Fight. But generally, for most emulators, cothreading will be slower. You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.

by Disch on 2012-06-27 (#96117)

Hamburgler wrote:
1) What OS is this running under? Or more specifically, what preemptive threading model are you using?

Win7 Home premium. Emu is written in C++ with Boost for threading.

Quote:
2) How many processor cores were available to the program?

Intel i3, Dual core, hyperthreaded. So 4 "cores" (technically 2 cores with 2 threads each). 3.2 Ghz

(I realize that ~230 fps in an nes emu is pretty poor performance for this processor, but I am emulating as many of the nitty gritty details as possible).

Quote:
How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?

2. PPU and CPU. There is no APU implementation yet. I didn't want to get too far with development until I had an idea on performance.

Quote:
4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?

I didn't really stray from the default "Release" settings in Visual Studio.

James wrote:
Do you have numbers for a non-multithreaded version?

Nope. As byuu mentioned, that would require an entirely separate emulator.

byuu wrote:
I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.

I interpretted it the same way, which is why I'm pretty sure I'm going to gut out the preemptive support and switch to cooperative exclusively.

Quote:
That is really interesting that thread switching is so low on NES.

I'm not surprised at all. There's very little communication between CPU and PPU usually.

It's not always the case, though. Games like Rad Racer do a lot of mid-frame scroll changes, so sync ups will be more frequent there. But I still suspect that $2002 polling is the biggest culprit.

Quote:
Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.

Yeah after seeing this I wouldn't consider it for a SNES emu either.

I'm glad it's appreciated. Really this is more of an exercise and curiosity rather than a practicality, but I'm glad it has some value apart from that.

Quote:
Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.

It's not very complicated at all on the NES. I'd probably just look for one of the following templates on $2002 reads:

Code:
loop:
LDA $2002
AND #xx
Bxx loop

; and...

loop:
BIT $2002
Bxx loop

; and possibly...

loop:
LDA $2002
ASL A
Bxx loop

If I don't find those exact templates, I'd just sync up the PPU normally.

by tepples on 2012-06-27 (#96118)

Another template with a fail-safe inside can be found in some of my own projects:
Code:
s0wait:
BIT $2002
BMI skip_rastereffect ; we missed sprite 0 entirely this frame
BVC s0wait

by Disch on 2012-06-27 (#96119)

Actually, now that I've mulled over the details... I've decided I'm not going to bother with those optimization attempts. I doubt they would push preemptive over where cooperative already is, and they would severely complicate the code.

So I'm going to call an end to the experiment. Switch to cooperative exclusively, and continue on the emulator.

It was fun and informative though!

by James on 2012-06-28 (#96141)

byuu wrote:
You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.

It really is an intuitive method for writing an emulator. I don't know if yours is the first implementation, but I learned about this method from your posts a few years ago, so you get the credit; nice job!

by byuu on 2012-06-28 (#96189)

Oh, before I forget ... I actually did have an interesting multithreading idea for an SNES PPU renderer (there, the PPU is an absolute beast.)

For a scanline renderer, instead of blitting one line at a time ... cache the MMIO registers to a "line" buffer, and do this for the entire frame, under the knowledge that VRAM isn't writable, and OAM isn't supposed to be (only one game does it anyway.) You could pack the state down to ~50 bytes per scanline, plus possibly extra for palette changes (store as patches for each change, start with the CGRAM copy at start of frame.)

Now when you go to render the entire frame at once, use OpenMP (or better to roll your own lighter weight version; OpenMP has crazy overhead) to split up the scanlines.

The thing with OpenMP is that it's almost always slower than non-OpenMP, except when things get really intense. Like for instance, it really helps with my HQ2x graphics filter, but it tends to hurt a simple 2X bilinear interpolation scale.

The idea probably won't work as well on the NES, since that tends to demand a cycle-based renderer for a lot more than just one game.

But even with one thread, having all that code running in a tight loop with no contextual changes should lead to a nice performance boost by way of caching.

by sktsqrl on 2012-06-30 (#96268)

Has anyone tried User-Mode Scheduling http://msdn.microsoft.com/en-us/library ... p/dd627187 for an emulator project, or aware of someone else doing it?
It seems to be the way to have the advantages of cooperative threading with multicore backing.

Re: Multithreaded emu designs
by creaothceann on 2019-07-21 (#241060)

*mother-of-all-bumps*

byuu wrote:
[...] Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.

Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.

As a fun weekend project I thought I'd give this a try... except that
there aren't 2 counters but 5 (65c816, PPU, S-SMP, S-DSP, MSU1)
the mainboard clock syncs at 2 * 21.477 MHz
the audioboard clock syncs at 3,075,840 Hz instead of the full 24.6 MHz

http://www.mediafire.com/folder/1nr1soivjggkh/
Usage: open & compile the *.lpr file in Lazarus or use the binaries. The program logic details are in the big comment at the top. (I think byuu had a similar design once before switching to extremely small fractions of a second for time keeping?)

Results:
On an i7-4790K (May 2014 CPU) @ 4.4 GHz I get ~9.x * 60 = ~550 fps (without MSU1) and ~6.7 * 60 = ~400 fps (with MSU1).
On an i5-3570 (June 2012 CPU) @ 3.8 GHz I get ~5.8 * 60 = ~350 fps (without MSU1) and ~4.3 * 60 = ~255 fps (with MSU1).

The main advantages is of course the ability to stop anywhere, even between PHI1 and PHI2 (making it a phase-accurate emulator instead of just cycle-accurate ). The main problems are the number of components that must run in parallel (those 550 fps will go down faster than a brick) and quickly getting into the relevant state handlers. Using CPU caches and modern branch prediction is key here. Given how many cycles there are, a simple switch would probably exceed most CPU caches given the current pointer sizes.