New save algorithm suggestion for co-threaded emus

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
New save algorithm suggestion for co-threaded emus
by on (#18253)
I remember some discussion a while ago about how to safely and portably save and restore the "intermediate state" of various co-threads. The problem is that except at certain clearly-defined points (like at instruction boundaries), the co-thread will be in the middle of some computation and intermediate state will be stored in various places (e.g. host machine registers). Even if you could identify and save all of the places containing that intermediate state, it is in a very non-portable form (it depends on host machine type + compiler + particular build).

Here is an approach I thought of to work around that problem. I explain it here in terms of an emulated CPU, but analogous steps should work for any emulated device:

(1) arrange for your Read and Write functions to append the data value in a buffer.
(2) when you actually save the emu state, save the registers as of the last instruction boundary, but include the contents of this buffer in the saved state.
(3) when you restore the emu state, your Read and Write functions should use the buffer instead of performing real accesses, until the buffer runs out.

In other words, save the state as it was at the instruction boundary, but include enough info to re-emulate part of that instruction *without any side effects*. From the point of view of other co-threads that might be affected by your side effects, this is equivalent to saving and restoring the intermediate state.

For performance reasons you might want a way to enable and disable this "extra buffer stuff" in your Read and Write function(s). For an emu written in assembler, you could patch it into the functions only when it is needed, and remove it after.

So to save the entire emulator state at the very beginning of your "frame", you might do the following:
- Advance each co-thread until it is close to the desired "tick" (smallest time unit you can pause at, e.g. for an SNES CPU it would be a master cycle). This probably means stopping 1-2 "instructions" before the desired tick
- Capture the state of each co-thread at the "instruction boundary"
- Install the extra logging stuff in the Read/Write functions of each co-thread (it should also include the ability to suspend each co-thread at any given tick)
- Advance each co-thread one at a time until it reaches exactly the desired tick
- Restore the normal Read/Write behaviour and resume running

To load an entire emulator state, you would:
- Read all the data back from the file, and set up each co-thread at the "instruction boundary" it was saved at (restore timers, etc)
- Install the log playback version of Read/Write functions for each co-thread
- Advance each co-thread one at a time until it runs out of logged data, stopping it at THAT VERY TICK
- Restore the normal Read/Write behaviour and resume running


Notice that even if the logging versions record MORE than an entire instruction's worth of reads and writes, it will still work correctly.

Something like this might be useful for emulators which want to save exactly at a certain point (i.e. the start of a frame). You effectively capture some previous state of each task, plus all the side effects that occurred while advancing the task from that previous state til EXACTLY the point you're interested in.

by on (#18257)
Bravo for working out the details of the idea mentioned a while back. I'm exercising the idea in my mind and it seems to be pretty much the only way to portably save and restore state. You can't capture state at just any moment, and the moments when you can capture from each entity won't generally coincide reliably, so you have to be able to capture between these moments. This seems the only portable way, to record a movie of all input to the entity and replay that from the nearest moment that can be captured normally.

The key point of your idea is that you're capturing the state of more than one entity, and these captures each occur at different times. If you could capture everything at the same time (even if earlier than the desired timed), then nothing special would be needed. Also, this technique would only be necessary on all but one of the entities, since it can determine the point at which memory 'recording' ends. For example, you could use this with the APU but not the CPU, stopping it at an instruction boundary (I'm assuming that the PPU is not threaded). This algorithm is very much like movie recording, only much more low-level, and there are separate movies for each entity and when "fast-forwarding" to the final point each one is isolated from the others.

The only potential problem that comes to mind is the efficiency impact of having to allow reads and writes to be redirected. This could require indirection where there was none before. A possible solution is to have each entity compiled into a normal and synchronizing version, where only the latter allows read/write redirection. The synchronizing version would only be used when saving a state or restoring; once synchronized, the normal version would be switched to at the next execution boundary.

It's interesting that what you describe is exactly how I wrote my CPU core tests. It first runs the correct CPU for one instruction, logs reads and writes, then runs the CPU under test and merely verifies that writes match what the correct CPU wrote, and replays the data read by the correct CPU. Then it goes to the next instruction. This allows use in a full emulator without any effect on operation. Essentially it provides a realistic set of test data (instructions and data reads) and what a correct CPU would do.

by on (#18265)
blargg wrote:
Bravo for working out the details of the idea mentioned a while back.


It seems kind of obvious now, but it took me a while to think of it ;)

Basically we can't save at an "intermediate state" portably, but we CAN save a "known state" plus enough information to get from that known state back into the intermediate state, without causing any side effects or depending on the state of other co-threads in any way.

blargg wrote:
The key point of your idea is that you're capturing the state of more than one entity, and these captures each occur at different times. If you could capture everything at the same time (even if earlier than the desired timed), then nothing special would be needed.

That is an interesting way to think of it. I assume by "different times" you mean different real-world times... the old method called for capturing all the state at the same real-world time even if the simulated times did not match up. This method provides a way to ensure the simulated times line up as closely as possible, but it requires you to capture the states at different (real-world AND simulated) times, and then advance the simulated time of each co-thread one tick at a time until they exactly line up where you want.

Logging the values and playing them back allows you to "advance one tick at a time" on loading without repeating any of the side effects you already caused before saving. Any state that does not get saved as part of a particular co-thread (e.g. contents of CPU RAM) should be captured at the end after you've advanced each co-thread to the proper tick. Otherwise you'll be in the difficult situation of needing to apply writes to CPU RAM but not apply them to anything that could influence another co-thread (such as PPU registers).

by on (#18398)
Interesting idea, thank you for sharing.
This is the first idea I've heard that sounds feasible to me, and that includes all of my ideas from the past few months.

Quote:
It seems kind of obvious now, but it took me a while to think of it


Same thing for me and cothreads to begin with. It's trivial and works great, but being the first to tread into new waters is always very difficult, especially when you need help with it. Hence why I'm so appreciative of your input on this. Especially since I have the only known cothreaded emulator in existence :/
Perhaps if I get the speed up, demonstrate how much cleaner the code is, as well as have working savestates, I can convince some other emulator authors to give this a try :)

Quote:
I'm assuming that the PPU is not threaded


Not yet, but it will be. The PPU will definitely be a lot harder to replay, but I believe I can use a similar trick with it. Just make a generic read/write function, but instead of reading and writing memory, it will do this with externally controllable values such as the PPU registers mapped to $2100-$213f. So it will be like ppu->read(PPU::BG1VOFS).

However, the big problem I think I'll have with this technique is saving immediately when the user requests to save a state. In order to do that, I would have to start logging since the last safe resume point. And since I can't accurately predict what the user is going to do in the future, that means I would have to log all accesses to a buffer, and flush it after each safe return point. That would absolutely destroy my performance.

However still, I think this will be ok. If you save a state and it ends up being at the next safe point, we're talking at most one opcode or one scanline. And even if someone were using a debugger, they wouldn't expect the savestate to resume in the middle of the currently executing opcode.

So now, when the user hits "save state", the emulator puts a lock on save state functionality to prevent conflicts, then sets the emulator into savestate write mode until all states are accurately captured, then switches back to normal mode, captures an emulator savestate (the contents of the buffers plus a typical emulated console state) and removes the save state lock and dumps the file to disk.

When the user hits "load state", the emulator locks save state functionality, loads the typical emulator savestate info, sets the emulator into savestate read mode, and as soon as each "buffer" is exhausted for each thread, that specific thread switches back to normal mode. Once all threads are accounted for, load state has been completed and we resume emulation.

One big problem I may encounter is my CPU DMA implementation. For speed reasons and to really take advantage of cothreading, I run all DMA in a loop. That means up to 65536*8 bytes could be transferred between two safe CPU points. For a major speed hit, I could do DMA byte-by-byte, which is probably what I will have to end up doing. Otherwise, the restore state could take up to ten full video frames to complete, yow.
EDIT: oh yeah, this will be a big problem, actually. I cannot break out in the middle of a DMA transfer, because the DMA can trigger in the middle of an opcode... hmm.

The last problem I might face is that I'm planning on improving my cothreading performance with a slightly newish idea of mine.

Right now, whenever a read or write to an external emulation component occurs (eg CPU writes to an SMP register), I immediately swtich back to the main thread, and then determine which component is running slower and switch to that component. See the problem? It's funny I first realized the problem and came up a solution while being, erm ... really far from sober x.x

The problem is, what if SMP (APU) is currently way ahead of CPU? Why should CPU have to return with a costly context switch, check all of the processors just to determine CPU is still behind SMP, and then jump right back into CPU again and resume? Ideally, I should wrap co_return() to check if what it needs to sync against is currently ahead of it already, and block the co_return + co_call in that case.

I don't think it will make a difference, but lots of variables in this idea, I need to sit down and think this through, to make sure it works before I start trying to implement it, hehe.

Oh yeah, lastly... my idea for switching to logging mode, playback mode, and normal mode, would be to make read() and write() member function pointers. Yes, I lose inlining (big speed hit) and now read+write calls are indirect pointer references (small speed hit), but it's probably better than a 3-state switch inside inlined read+write functions. And it's certainly cleaner, IMO.

by on (#18406)
For commercial games, DMA shouldn't be too big a problem with regards to savestates since huge DMA transfers aren't likely to occur except when the screen is blank, at which point a use wouldn't mind too much if there were a little latency before the state could be taken.

If you allow pausing emulation, you may consider stopping at the next save point so the user can take a savestate immediately without having to resume emulation. (The exception would be if the user were debugging, obviously.)

by on (#18414)
Yeah, it theoretically shouldn't cause too much trouble, I'd just have to log up to 512kb worth of data for the absolute worst case scenario. Shouldn't be too bad. I don't know of any SNES emulators that save in the middle of DMA anyway.

As far as pausing... that's absolutely trivial. No need for any special code there, just stop running the main thread. The other threads will also stop (even in the middle of an instruction), since there's really only one true thread as far as the OS is concerned. The only thing is, obviously you have to wait for things to return. So yeah, a pause when the worst case DMA occurs, would take 1/6th of a second to react.

by on (#18455)
What I don't like about this whole method is that it seems too complicated. I think real implementations will have some tricky details which we've glossed over here.

I feel like there might be a simpler idea lurking behind this one, but I can't figure out what it is. :P


Byuu, regarding switching co-threads faster... since the number of co-threads is small, why not keep them in a list sorted by the simulated time of each co-thread. Do the sorting in some sort of inlined routine that is optimized for the most common case, and always switch directly to the co-thread that currently has the oldest simulated time.

That takes advantage of the fact that co-threads never need to wait for other co-threads to pass them in simulated time, at most they need to wait for them to catch up (otherwise you'd be simulating hardware that can see the future!) Under this scheme, when a suspended co-thread resumes, whatever it was waiting for is guaranteed to have occurred by now.

by on (#18458)
Quote:
What I don't like about this whole method is that it seems too complicated. I think real implementations will have some tricky details which we've glossed over here.


Undoubtedly. The 65816 wai/stp and spc700 sleep/stop opcodes pretty much deadlock the thread until you delete it (SNES reset/power does this).
I wish it could be simplified as well, but your idea seems feasible. I just need to adjust the core for it. The worst part I think will be supporting both cothreaded core components and non-cothreaded ones (I plan to have a fast opcode-based one for platforms without a libco port and for slower computers).

Quote:
Byuu, regarding switching co-threads faster... since the number of co-threads is small, why not keep them in a list sorted by the simulated time of each co-thread. Do the sorting in some sort of inlined routine that is optimized for the most common case, and always switch directly to the co-thread that currently has the oldest simulated time.


Good idea, thanks. I can use co_jump instead of co_call / co_return for that. That's one less thread switch. I can see the fine details getting tricky, but if I'm really going to show off I need to take advantage of this as much as possible.

by on (#18473)
Okay I thought of another way to describe the save algorithm, which seems clearer to me. And one important detail that I was totally overlooking before, which simplifies things:

(1 - Optional step) Advance co-threads so they are "near each other" in simulated time. You might do this by advancing them all to a time shortly before the beginning of a new frame, for example.
(2) Install the logging versions of the Read/Write functions. Each function should log values ONLY AFTER THE STATE OF THE CO-THREAD THAT USES THE FUNCTION HAS BEEN SAVED.
(3) For each co-thread, advance the co-thread to the nearest "safe point", and save its state.
(4) After saving the state of ALL co-threads, now you save the log buffers.

This simplifies things because each co-thread's log contains exactly the values that were read or written AFTER saving the co-thread's main state (which always happens at a "safe point"). In effect, those values represent side effects which occur AFTER the point at which the co-thread was saved, but BEFORE saving some other co-thread (which will have already observed the side effects if it cares about them).

Edit: the purpose of the optional step 1 is to minimize the distance you might have to "catch up" a co-thread after saving it. In other words, if advancing A to a safe point requires you to catch up B, where B has already been saved and is therefore logging all of its reads/writes, then B will only have to catch up a short distance (and thus the amount of data B will log while it is catching up is kept small).

The load steps are the same as before -- load all co-thread states at their "safe points", install the playback version of the functions, play each log back until it ends, restore the regular version of the functions, and resume running.