You mentioned that MAME uses a scheduler that's accurate to the attosecond, but for that you needed to use 128-bit numbers. May I ask why?
Code:
1,000,000,000,000,000,000 => the number of attoseconds in one second
18,446,744,073,709,551,615 => 2^64-1
As long as we subtract the smallest value from all counters every ~9 seconds, there's no chance of an overflow. Doing it once per frame on an emulated system should be far more than enough.
If you're going to use 128-bit counters, then why not up your accuracy to Planck time? :P
...
For what it's worth, I am using this currently:
Code:
auto create(... long double frequency) {
/*uint64*/ scalar = 1.0 / frequency * 1'000'000'000'000'000'000.0 + 0.5; //round to attoseconds
}
auto step(uint clocks) {
clock += clocks * scalar;
}
auto synchronize(Thread& thread) {
if(clock > thread.clock) co_switch(thread.handle);
}
Seems to be working just fine for me. Here's my before and after results:
Code:
[FC]
Mega Man 2 = 169fps (stage select) -> 186fps
Zelda = 172fps (in-game) -> 193fps
[SFC]
Zelda 3 = 131fps (save select) -> 140fps
Kirby 3 = 99fps (stage select) -> 106fps
[GB]
Mega Man II = 228fps (stage select) -> 246fps
Zelda DX = 211fps (save select) -> 214fps
[GBA]
Riviera = 180fps (opening) -> 180fps
Minish Cap = 154fps (save select) -> 154fps
[WSC]
Riviera = 177fps (title screen) -> 198fps
Final Fantasy = 158fps (name select) -> 163fps
Not bad for a few hours of work. It's a really great idea to use attoseconds instead of relative frequency counters.
MAME never rebases the counters; it uses them to log the exact emulated time that events occur (e.g. inputs in a movie file) Also, the counters aren't simply 128-bit integers; they consist of a 64-bit attoseconds part and a 64-bit seconds part. Great for friendly logging, not so great for efficiency (attotime calculations are a significant source of overhead in MAME drivers with many CPUs and other devices that use the scheduler)
Choosing a suitable base unit of time for an arbitrary set of clock domains is indeed the hard part. The problem MAME has is that many common crystal values have a repeating decimal when you take their reciprocal. Even using attoseconds, you end up with component A that's supposed to run at exactly 3 times the speed of component B, but gains an extra cycle once every 20 minutes due to truncation. I believe there are some demos that fail on MAME for this exact reason.
If you want perfect accuracy you have to do something like this:
Calculate the greatest common divisor of all the clock domains in the emulated system, and divide all the clocks by it (this step is only needed to reduce the likelihood of overflow in the next step)
Calculate the scaled reciprocal of each clock. The scaled reciprocal is the product of all the clocks except that one (or alternatively, the product of all the clocks divided by that clock)
Again calculate the greatest common divisor of all the clock reciprocals, and divide them all by it.
The result is an exact-integer cycle length for each clock, in some arbitrary unit. If you care what that unit is for some reason, multiply the cycle length of any one of the devices by its frequency in Hz. The result is the number of "some arbitrary units" in a second.
(Also, if you want to make sure you've done all the math right, assert that every device's calculated cycle length times its frequency is the same number)
If you need to add a new clock domain at runtime... Sorry, you're fucked. Hope you don't need to support hot plugging devices with their own crystal in them.
ETA: I said "something like" for a reason. I literally came up with the above algorithm at 5 a.m. right before falling asleep, and I'm sure someone can improve on my method or point out some awful flaw in it...
Posted timing results to the first post before I saw you replied. They're pretty interesting.
> MAME never rebases the counters; it uses them to log the exact emulated time that events occur
Oh, I see. Well, someone out there is going to be pissed when they run MAME for 31.71 billion years and then suddenly the program deadlocks because not all of the counters wrapped. So much for MAME caring about accuracy /s
> Great for friendly logging, not so great for efficiency (attotime calculations are a significant source of overhead in MAME drivers with many CPUs and other devices that use the scheduler)
Yeah, I can totally see that. Especially given the above timing results on my end.
You could speed that up by using a __uint128 where available. And don't separate it to {top 64-bits = seconds; bottom 64-bits = attoseconds}; just divide by 1 quintillion when you need the number of seconds, and modulus by it when you need the fraction. For systems without 128-bit integers, simply add the clocks to the low 64-bit value, and if the new value is less than the old value, increment the upper 64-bit value by one. Or even better, do it with assembly instructions and use the processor carry; then add zero with carry to the upper 64-bits. Extracting time would be a bit more painful on systems without 128-bit types, but you could write a software divide for that. I'd imagine extracting timestamps is way less common than time passing during emulation.
> Even using attoseconds, you end up with component A that's supposed to run at exactly 3 times the speed of component B, but gains an extra cycle once every 20 minutes due to truncation. I believe there are some demos that fail on MAME for this exact reason.
Well ... the way I'm doing it is: if there's a single oscillator, then I'll use that. The GBA, WS/C, MD all share this. In this case, my step() function just adds the clock cycle count to the clock counter and that's it.
But if there is more than one oscillator, then I add clocks * reciprocal_attoseconds instead. In the case of the SNES, even in the worst case of an extra clock every 20 minutes ... nothing is going to break because of that. The real oscillators on real consoles vary by larger amounts than that due to aging, temperature changes, etc.
My old system did handle this case perfectly, but being perfect beyond what's even possible for real hardware really isn't all that critical, and it didn't scale to the Genesis scenario anyway, so it had to go.
> Calculate the greatest common divisor of all the clock domains in the emulated system
Completely impractical. The SNES can have 6+ oscillators thanks to controllers, cartridges and expansion port devices. And they're mostly repeating fractions thanks to color burst timings being psychotic values for both NTSC and PAL.
> If you need to add a new clock domain at runtime... Sorry, you're fucked. Hope you don't need to support hot plugging devices with their own crystal in them.
Yep, there are several cases where I have to adjust the frequency dynamically. And this is another area where this attosecond-based approach works better than my old "weighted scale" approach: there's no longer any need to do any adjustments to the clock to represent the clock skew difference anymore. Just change the scalar and you're done.
This comes up with: the SNES SGB's speed settings (with the Horii controller), the GBC toggling double-speed CPU mode, and hot-plugging controller and expansion port devices (the latter is, of course, psychotic to attempt.) I'm sure more cases will come up in the future, too.
I thought the GBC only changed its clock divider. Can't you emulate it the same way as the SuperFX, by adding 1 clock per instruction cycle in double speed mode and 2 clocks in single speed mode?
Yeah, my suggested approach is complete overkill. You don't care about exact ratios between components on different clock domains, only ones that are on the same clock with different dividers.
Here's take 2, which is similar to yours but eschews floating point math (which is a whole kettle of worms...)
Code:
scaler = (HUGE_NUMBERull / frequency()) * divider();
Where frequency() and divider() are virtual functions that return integers (not data members, so as not to bloat every object layout with data that's only used once at startup).Chips that run off of the same xtal should return the same frequency
as each other; this will ensure that their scalers are exact integer ratios. Chips that change dividers on the fly (like the GBC CPU) should report a divider of 1 and multiply their cycles themselves, at runtime. On the other hand chips with a fixed divider can take advantage of precalculating it (integer multiplication is not
that expensive on modern CPUs, but why do unneeded extra work every emulated cycle? And weren't you hoping to make your emulator more ARM-friendly in the future?)
> Can't you emulate it the same way as the SuperFX, by adding 1 clock per instruction cycle in double speed mode and 2 clocks in single speed mode?
I probably can. But that'd have to go on the to-do list. I don't want to keep getting side-tracked from the Genesis core while it's so young.
> Here's take 2, which is similar to yours but eschews floating point math (which is a whole kettle of worms...)
scaler is a uint64 type. I just do the initial computation via long double. That math only happens once at power-up, so I'm not too worried about its performance. All my audio resampling / anti-aliasing is done in floating point, which is probably a whole lot worse. Especially at the obscene 2-3MHz sampling ratios of some of these systems.
> And weren't you hoping to make your emulator more ARM-friendly in the future?
I certainly was before I killed the balanced and performance profiles.
But amazingly, this change seems to have boosted framerates from 29fps (v100) to 45fps on my 1.4GHz Celeron box. I can't explain how this would be such a large speedup. Hopefully it lasts.
[The assumption of one crystal is] trickier though with the SNES where there are multiple independent clock rates.
You can come close by running the audio unit at 1/7 of the master clock.
APU ceramic resonator: 24576000 Hz
Divided by 8 to form audio master clock: 3072000 Hz
Cost-reduced APU shares CPU/PPU crystal: 945/44 MHz = 21477272 Hz
Divided by 7 to form audio master clock: 3068181 Hz
Would being 1 part in 804.6 off improve frame rates? And would it be within tolerance of the APU's ceramic resonator? If not, you could instead divide by 945/44/3.072 = 39375/5632 = 6.9913, and everything would still be within a clock domain.
Based on the consensus resulting from a discussion two months ago about overuse of moderator powers, I have chosen to move this post from a previous derailed topic through cut and paste, as if I were not a moderator, rather than using my moderator powers. I apologize for this bump.
byuu wrote:
scaler is a uint64 type. I just do the initial computation via long double. That math only happens once at power-up, so I'm not too worried about its performance. All my audio resampling / anti-aliasing is done in floating point, which is probably a whole lot worse. Especially at the obscene 2-3MHz sampling ratios of some of these systems.
Again, it's precision I'm worried about, not performance. After all the work you've done on accuracy you don't want to start letting the S-PPU gain a cycle on the S-CPU every 20 minutes.
Here's another problem I've just spotted:
Code:
auto synchronize(Thread& thread) {
if(clock > thread.clock) co_switch(thread.handle);
}
Before, you had one side testing for clock < 0 and the other side testing for clock >= 0: exact opposite conditions. Now you've got both sides testing for inequality. This introduces a wobble between two devices running at the same frequency: sometimes device A will run 1 cycle ahead of device B and sometimes device B will run 1 cycle ahead of device A, depending on which is currently executing when their clocks become equal.
Unfortunately I suspect that this wobble is responsible for a significant amount of the performance gain you're seeing. Devices that should be running in perfect lockstep (e.g. the SMP and the DSP) are now leapfrogging each other, each running 2 cycles at a time.
The solution is to initialize each device's clock with a unique bias, so that two devices' clocks never become exactly equal. If you initialize the SMP's clock to 0 and the DSP's to 1, the SMP will always lead the DSP instead of leapfrogging, or vice-versa (I'd look at which way the relationships went before and choose biases that preserve the same priority)
> Again, it's precision I'm worried about, not performance. After all the work you've done on accuracy you don't want to start letting the S-PPU gain a cycle on the S-CPU every 20 minutes.
Will that happen when the two chips are running at the same frequency? They would both exhibit rounding errors at the same time.
But, I don't see a solution, short of moving to Planck time units or something.
> Unfortunately I suspect that this wobble is responsible for a significant amount of the performance gain you're seeing.
... damn. It was such a nice performance boost, too.
What about "if(clock >= thread.clock) co_switch(thread.handle);" instead? This test will never be called until a thread adds to the clock value, obviously, so there's not going to be some pathological non-stop switching event going on. It's starting to get sloppy, introducing virtual offset values into each clock.
byuu wrote:
> Again, it's precision I'm worried about, not performance. After all the work you've done on accuracy you don't want to start letting the S-PPU gain a cycle on the S-CPU every 20 minutes.
Will that happen when the two chips are running at the same frequency? They would both exhibit rounding errors at the same time.
No, it's not a problem if both of the chips are created with exactly the same frequency. Just make sure that clock dividers are applied via integer multiplication (either at creation time or at runtime), not via division nor any floating-point operation. Example:
Code:
auto create(... long double frequency, uint divider) {
/*uint64*/ scalar = 1.0 / frequency * 1'000'000'000'000'000'000.0 + 0.5; //round to attoseconds
scalar *= divider;
}
Don't divide
frequency by
divider, and don't multiply
scalar by
divider until after
scalar has been cast to an integer.
For the MD, you'd create all the chips with the master oscillator as
frequency and their respective dividers as
divider (only for chips where it doesn't change at runtime, i.e. not the VDP).
Quote:
> Unfortunately I suspect that this wobble is responsible for a significant amount of the performance gain you're seeing.
... damn. It was such a nice performance boost, too.
What about "if(clock >= thread.clock) co_switch(thread.handle);" instead? This test will never be called until a thread adds to the clock value, obviously, so there's not going to be some pathological non-stop switching event going on. It's starting to get sloppy, introducing virtual offset values into each clock.
That's going to have exactly the same problem, except that the non-currently-executing device will win when the clocks become tied. You're still going to have the two devices leapfrogging each other. The bias values aren't arbitrary or "sloppy"; they indicate exactly what order you want lockstep devices to execute in (and in a much more clear way than the difference between >= and > does) The device that you want to execute first gets a bias of 0, the one you want to execute second gets a bias of 1, etc.
> That's going to have exactly the same problem, except that the non-currently-executing device will win when the clocks become tied.
But how is that ever going to be a problem in practice? The chip that runs first will be the one that gets entered first after a reset.
SMP = 0; DSP = 0
CPU runs first; synchronizes to SMP; so SMP is always first
SMP step 1; sync if 1 >= 0; switch to DSP
DSP step 1; sync if 1 >= 1; switch to SMP
SMP step 1; sync if 2 >= 1; switch to DSP
DSP step 1; sync if 2 >= 2; switch to SMP
The only way things would get nasty is if we did the sync if test before the step (would cause wobbling), or possibly a sync call without any step (infinite loop if both sides did it.) But I never do that anywhere in my code. It's always "step(n); sync();" everywhere.
> The bias values aren't arbitrary or "sloppy"
Well, I have one decent idea for them, at least. In Thread::create, I have this:
Code:
handle = co_create(entrypoint, stacksize);
scheduler.append(*this);
So inside scheduler.append, I can do:
Code:
thread.clock += scheduler.threads.size();
So that CPU=1, SMP=2, PPU=3, DSP=4, <coprocessors=5-30>, <peripherals=31-33 ... will get a bit screwy with hot-swapping but will be bounded by threads.size() so they won't go off the rails.>
I definitely don't want to build a giant enum that lists the "priority" of every possible Thread source. There's so damned many in the SNES.
byuu wrote:
> That's going to have exactly the same problem, except that the non-currently-executing device will win when the clocks become tied.
But how is that ever going to be a problem in practice? The chip that runs first will be the one that gets entered first after a reset.
SMP = 0; DSP = 0
CPU runs first; synchronizes to SMP; so SMP is always first
SMP step 1; sync if 1 >= 0; switch to DSP
DSP step 1; sync if 1 >= 1; switch to SMP
SMP step 1; sync if 2 >= 1; switch to DSP
DSP step 1; sync if 2 >= 2; switch to SMP
The only way things would get nasty is if we did the sync if test before the step (would cause wobbling), or possibly a sync call without any step (infinite loop if both sides did it.) But I never do that anywhere in my code. It's always "step(n); sync();" everywhere.
That's fine for the SMP and DSP, but what about devices that have different dividers and/or that don't always sync after every cycle, like the CPU and PPU? Every time one leapfrogs the other due to adding more cycles than the other's divider or due to not syncing for a bit, their priority will flip until the next time they happen to leapfrog. With the CPU/PPU, you'll have the h/v counters sporadically becoming off by one for a while.
Quote:
> The bias values aren't arbitrary or "sloppy"
Well, I have one decent idea for them, at least. In Thread::create, I have this:
Code:
handle = co_create(entrypoint, stacksize);
scheduler.append(*this);
So inside scheduler.append, I can do:
Code:
thread.clock += scheduler.threads.size();
So that CPU=1, SMP=2, PPU=3, DSP=4, <coprocessors=5-30>, <peripherals=31-33 ... will get a bit screwy with hot-swapping but will be bounded by threads.size() so they won't go off the rails.>
How are you ending up with 25 coprocessors on the SNES? Are you creating a thread for every possible coprocessor, and not just the one in the currently-loaded cartridge?
You don't actually need a unique bias for every chip on the SNES, because you never have coprocessors syncing to the PPU etc. You can just use 0 for the CPU and SMP and 1 for everything else. The only chips that need biases to prevent leapfrogging are those that both (1) run off of the same clock (possibly with dividers)
and (2) directly sync to each other.
> but what about devices that have different dividers and/or that don't always sync after every cycle, like the CPU and PPU?
The CPU and PPU have the same exact scalar value 1/21MHz*Attosecond. I don't see why it matters how many clocks pass before syncing to another, as long as the value is greater than zero.
I'm likely going to need an actual demonstrable simulation showing a problem. I'll toy around with a tiny mock-up on my end to look for problems.
> How are you ending up with 25 coprocessors on the SNES?
Theoretical worst case with the manifest from hell :P
> You don't actually need a unique bias for every chip on the SNES
The goal, if I decide to use a bias, is to do it in one place rather than in ten places or more.
Does MAME actually have this bias system in place? Being honest, I wish I had known about this before I started on this redesign. But, I'd still have to solve the MD sync issue, so ... maybe I'd still have chosen this anyway.
byuu wrote:
> but what about devices that have different dividers and/or that don't always sync after every cycle, like the CPU and PPU?
The CPU and PPU have the same exact scalar value 1/21MHz*Attosecond. I don't see why it matters how many clocks pass before syncing to another, as long as the value is greater than zero.
I'm likely going to need an actual demonstrable simulation showing a problem. I'll toy around with a tiny mock-up on my end to look for problems.
Okay, imagine you have two devices that normally both step 1 clock at a time but sometimes step 2 clocks at a time (because of memory wait states or whatever):
Code:
dev1 step 1 (total 1); sync if 1 >= 0; switch to dev2
dev2 step 1 (total 1); sync if 1 >= 1; switch to dev1
dev1 step 1 (total 2); sync if 2 >= 1; switch to dev2
dev2 step 1 (total 2); sync if 2 >= 2; switch to dev1
; so far everything is okay... let's see what happens when dev2 steps 2 clocks at a time
dev1 step 1 (total 3); sync if 3 >= 2; switch to dev2
dev2 step 2 (total 4); sync if 4 >= 3; switch to dev1
dev1 step 1 (total 4); sync if 4 >= 4; switch to dev2
dev2 step 1 (total 5); sync if 5 >= 4; switch to dev1
dev1 step 1 (total 5); sync if 5 >= 5; switch to dev2
; this isn't good... dev2 is now running ahead of dev1 instead of behind it
; dev1 should've gotten an extra cycle when dev2 took 2 at a time, but didn't
dev2 step 2 (total 7); sync if 7 >= 5; switch to dev1
dev1 step 1 (total 6); sync if 6 >= 7; no switch
dev1 step 1 (total 7); sync if 7 >= 7; switch to dev2
dev2 step 1 (total 8); sync if 8 >= 7; switch to dev1
dev1 step 1 (total 8); sync if 8 >= 8; switch to dev2
; well, at least they can't get more than 1 cycle out of step. what happens when dev1 takes 2 cycles?
dev2 step 1 (total 9); sync if 9 >= 8; switch to dev1
dev1 step 2 (total 10); sync if 10 >= 9; switch to dev2
dev2 step 1 (total 10); sync if 10 >= 10; switch to dev1
dev1 step 1 (total 11); sync if 11 >= 10; switch to dev2
dev2 step 1 (total 11); sync if 11 >= 11; switch to dev1
; and now dev1 is running ahead of dev2 again
Each time the trailing device gets
ahead of the leading device rather than merely catching up to it, their order of execution swaps. You might object that the SNES doesn't have any pair of devices with those exact clock characteristics, but the point is that the order of execution is fundamentally unstable, and your core shared emulation functionality shouldn't rely on accidental characteristics of the emulated hardware to avoid the consequences of its bugs.
Quote:
Does MAME actually have this bias system in place? Being honest, I wish I had known about this before I started on this redesign. But, I'd still have to solve the MD sync issue, so ... maybe I'd still have chosen this anyway.
MAME uses a very different type of scheduler; the only thing it's got in common with your new scheduler is how the time bases are handled. The MAME scheduler runs CPUs (and other "executable" devices) in a round-robin sequence in the order they were declared in. A CPU in MAME can't yield to another specific CPU; it can only give up its timeslice, causing the scheduler to pass it to the next CPU in the sequence. MAME can't handle things like context switching in the middle of a read handler to let the device being read from catch up. To achieve bsnes-level accuracy in MAME, you have to run at "perfect interleave", meaning the round robin timeslices are equal to the frequency of the fastest chip in the emulated machine. The 8-bit Commodore drivers do this, and the C64 and C128 barely run at full speed on a top-end PC (and for other reasons those drivers still aren't as accurate as dedicated emulators)
Basically, MAME is amazing for being able to emulate everything from 1970s 6502 boards to Pentium-class PCs using a single infrastructure, but it's not something you should try to imitate in all its particulars in a project of much more limited scope like higan.
Testing reveals some nasty results. You'll need the latest nall/libco from Gitlab to use this:
http://hastebin.com/raw/ukoyikabezBut this is a test framework for the old and new synchronization methods.
When I run the CPU and SMP at their specification rates, and step the CPU by 4, the SMP by 5; the attosecond-based method desyncs at 15,151 context switches. That number is horrifyingly low, and completely unacceptable.
I wasn't gaining any precision beyond femtoseconds, so I moved from long double reciprocal division to pre-scaled integer division and scaling with 128-bit types. Even yoctoseconds isn't enough, I had to go 1000x more precise than that. Since I had the headroom, I went ahead and increased this to 100000x the precision of a yoctosecond. Can't fit planck time * 32-bit frequency into a 128-bit integer, so ... boo.
Still, this is really shitty. I trusted the "one clock desync every 20 minutes" thing, but it's not even close to that in practice with attoseconds.
The bias thing only helps when the two clocks are dividers of the same base frequency (which is also the only time that level of perfection should matter). When you have two completely unrelated frequencies, there's nothing you can do to stop the streams from crossing eventually. Except for keeping a separate "priority" field and testing "if((clock > other.clock) || (clock == other.clock && priority < other.priority))
Does one of the threads actually permanently gain a cycle on the other, or do they just execute one cycle out of order and then fall back into synchronization? It's one thing if one of the chips is actually running 0.01% too fast (although I don't see how that large an error is remotely possible) but I don't think it's a big deal if one chip executes a cycle early or late every now and then compared to the previous method (relative to a chip on a different oscillator, I mean).
This is the best I can do:
http://hastebin.com/raw/ovoxadugefAt 79,228 times the precision of the yoctosecond (2^-24), I bring you ... the byuusecond, or 2^-96.
I reserved 2^32 headroom so that up to a 2GHz* clock can run for up to one emulated second before all the clocks need to be normalized. (* since the normalization can leave some values up to 50% capacity, we cut the 4GHz range in half.)
Basically, what I've observed from my testing is ...
Attoseconds just don't cut it. We're getting misses ~292 times a second against the CPU<>SMP clock rates stepping 4 and 5 clocks each, respectively. Adding clock=1 to the SMP, toggling between clock >= target.clock and clock > target.clock, etc does not help.
The errors are not extra cycles, they're just chips running in the wrong order. Instead of ABABABAB we get ABABBAAB periodically.
There's basically no way to stop this. Even if I move all the way up to the byuusecond, it's not enough. In my test, it runs in sync as long as I can bear to run the test, but ... if I drop the CPU counter value by 1, it swaps a sync order once every 10,000,000 iterations or so.
Since attoseconds are insufficient, we have to move to a 128-bit type. And since we're doing that, there's no point in stopping at zeptoseconds. Might as well use the full 2^96 range available to us.
Surprisingly, the uint128 type didn't result in any slowdown in the profiling tests. Although I'm sure this will break compatibility with non-x86 and non-amd64 systems; and to hell with writing a Booth algorithm software multiplication for that, so it'll be required to fallback to uint64 types and 2^32 precision for ARM, etc systems.
> The bias thing only helps when the two clocks are dividers of the same base frequency (which is also the only time that level of perfection should matter).
Yeah ... I see that now in my tests. When I set the cpuFreq and smpFreq to the same value, I start seeing the bias errors you're talking about when they don't run in perfect lockstep (eg cpuStep and smpStep are different for this test suite.) I still need to try and grok your explanation as to why, I've been busy working on the test suite.
> I don't think it's a big deal if one chip executes a cycle early or late every now and then compared to the previous method
Yeah, it's the latter as explained above and true ... it's probably not a big deal, especially given they're separate oscillators. Still, it's a shame to give up my old method, which was basically total perfection. But, at 2^96, the precision is quite great.
I guess if one wanted, they could reduce the timing to 2^64, and not worry about normalizing the counters ever. If someone runs the emulator for four billion seconds, then they will be dead anyway because humans do not live that long.
byuu wrote:
The errors are not extra cycles, they're just chips running in the wrong order. Instead of ABABABAB we get ABABBAAB periodically.
If you really are getting AABB (the
slower chip getting two cycles in a row sometimes), that would be bad, but I'm not seeing that. I
rewrote your test program in plain nall-free C++, using plain 64-bit integer math (no floating point and no exotic 128-bit integers), added more verbose output, and all the differences that I'm seeing look like the following, once every 30000 cycles or so:
Code:
old: 121211212121
new: 121212112121
i.e. the faster chip's extra cycle when it "laps" the slower one occasionally happens 1 cycle later with the new scheduler than with the old one. That is really, really, not something to worry about. Given the imperfections of real hardware oscillators, you can't even really claim that one is "right" and the other is "wrong". It's the difference between rounding exactly 0.5000000000 up and rounding it down. It's less than the difference between powering the same SNES on twice (and unlike hardware it's completely deterministic)
Also, my logging shows that clock ties basically never happen between devices with unrelated frequencies (search for "!" and "@" in the output, which represent cycles when the clocks are exactly zero (for the old scheduler, on the left) or equal (for the new scheduler, on the right)) Another reason you only have to worry about them between devices that are running on different dividers of the same base frequency.
> I rewrote your test program in plain nall-free C++
That ... seems like a lot of effort for no benefit, but ... okay. I appreciate you trying out the test suite on your end regardless.
This was a hacked up version; but I might implement something much cleaner in C (well, as clean as you can be in a language without classes) to distribute with libco itself; which would also serve as better usage examples than what's there now.
> i.e. the faster chip's extra cycle when it "laps" the slower one occasionally happens 1 cycle later with the new scheduler than with the old one.
Yes, that's what I was trying to express. My AB example had four As, four Bs, in both examples. It probably wasn't a long enough stream to show it off, but it's the same result as you get.
> Given the imperfections of real hardware oscillators, you can't even really claim that one is "right" and the other is "wrong".
True, I've always struggled with the justification in higan for one chip to use >= and the other to use < on my int64 clock values. There was no compelling reason why the latter shouldn't be <=, it was just chosen because it was the opposite of >=.
> It's the difference between rounding exactly 0.5000000000 up and rounding it down.
Mathematically speaking, that always rounds to 1.0. It's a pretty basic rule.
I think you're right as I've said ... real oscillators have some variance. The important thing is it stays deterministic. I was just surprised that it doesn't seem like there's any amount of precision that will be enough to match the simple behavior I had before, at least for a million iterations. Rounding errors of fractional bits are a real bitch. I kind of wonder just how many bits of precision I'd need to make it to a million matches ... 256? 1024? 2097152? >__<
Given that I agree, I'm still going to go with this timing method, but it still bugs me that we can't get an exact match, at least for a million iterations, no matter how precise we get. I might toy around with some other ideas (like making the scalar the product of all oscillator frequencies.)
> Also, my logging shows that clock ties basically never happen between devices with unrelated frequencies
Right, that's what I perceive as well ... it should be psychotically rare (and probably even impossible with irrational numbers.) So why are we getting the mismatched output to the old ThreadA style if they're not clock ties?
Even if we decide it's not important ... I think a mismatch every 30,000 cycles is a very high margin of error for time measured in yoctoseconds (my version) or attoseconds (your version.)
................................................................................
EDIT: wow ... yeah, that actually works really well. In the example test suite, we end up with a CPU scalar of SMP freq, and an SMP scalar of CPU freq.
If I add in two more oscillators ... 44100hz (MSU1) and 8000000hz (DSP1), I end up with:
8670412800000000000
7577181561600000000
Which I can divide by 100000000 and get:
86704128000
75771815616
And indeed, these run perfectly in sync forever ... I think?? I don't know if I can round down products like that. I feel like I can, but if I factor the numbers or try GCD calculations on them, I end up reducing it right back the original frequencies which is obviously wrong.
I think there's a significant opportunity here to have a smart frequency generation tool. Start by using clock dividers (multipliers in the clock step values) wherever possible ... the Genesis only has one, the base SNES only has two. Then build a list of all the unique frequency values. Then go through and try to produce GCDs of any relationships we can. Eg 21477272 and 24576000 = 8; so we can drop the multipliers to 2689659 * 3072000. Or for the bigger example, it's 4, so:
5369318*6144000*11025*2000000=727409429913600000000000, which is still below the headroom of 2^96.
So let's say we keep adding in more oscillators ... and eventually the number gets too big, then we can start sacrificing some precision through rounding things until we can fit nicely into the 2^96 headroom.
The downside to this approach is that adding and removing oscillators dynamically won't really be possible, because if we change the scalars, then we invalidate the current clock value meanings of all threads. That's not the end of the world ... that will only affect hotplug peripherals, which currently I have no cases where they have oscillators that are different from ones used in the base system.
................................................................................
EDIT 2: okay, changes are in for the next WIP.
The SNES retains about ~80% of its performance boost over v100 official. I'm betting the reason is because of the simplifications to the SNES CPU step() function. We've gone from adjusting the SMP, PPU, two controllers, expansion port, and all coprocessors (the peripherals and coprocessors being separate vector lists to iterate over) with 64-bit multiply-adds to just a single 128-bit multiply-add. I guess the SMP<>DSP wobbling wasn't as big a portion of the speed boost as I thought it was.
But now the NES and WSC cores are a bit slower than they were before. The psychotic ~40% speed boost for the SNES on my Intel Celeron box remains as well.
higan may no longer run on 32-bit CPUs, but at least I can claim that it now has ~80 billion times more accurate timing than MAME :P (joking aside, I'm still going to keep working on this. There's gotta be a better way to come up with approximations that have very good precision and yet fit into 64-bit types.)
byuu wrote:
True, I've always struggled with the justification in higan for one chip to use >= and the other to use < on my int64 clock values. There was no compelling reason why the latter shouldn't be <=, it was just chosen because it was the opposite of >=.
If you used <= and >=, you'd have the same wobbling issue I demonstrated in the previous post. You'd never have gotten all your PPU counter tests to match hardware.
byuu wrote:
Right, that's what I perceive as well ... it should be psychotically rare (and probably even impossible with irrational numbers.) So why are we getting the mismatched output to the old ThreadA style if they're not clock ties?
Rounding errors because 1e+18 / 21477272 and 1e+18 / 24576000 aren't exact integers. The CPU is running very, very, very slightly slower relative to the SMP (less than one part in 100 billion) than with the non-reciprocal algorithm. The error isn't large enough for the total cycles executed to diverge over 20 million or even 20 billion iterations; if you ran the test for hundreds of billions of iterations (omitting the per-cycle logging because you'd run out of RAM, and twiddling your thumbs because the test will take hours) you'd eventually see the SMP gain a whole cycle on the CPU relative to the old scheduler. As you got closer to that point, the single-cycle divergences (if you were capable of logging them) would get closer and closer together.
1e+18 / 21477272 is 46560848137.5. The error from truncating to 46560848137 is 0.000000001%, which is orders of magnitude smaller than the tolerance of a crystal oscillator, let alone a ceramic resonator. You're taking an error of 1 in 100 billion and exaggerating it out of all proportion by using inappropriate test criteria. That level of error is really not worth introducing complexity and killing portability over. What's important is that chips that are at an exact integer frequency ratio (because they're clocked off the same oscillator) are emulated at that exact ratio and don't wobble, which I've explained how to ensure; for chips on different oscillators an error of 1 part in 100 billion is vastly better than good enough.
Try this: Run the test twice using the same algorithm both times (either the old one or the new one), but the second time change the SMP frequency to 24576001. You'll see that the first divergence in which cycle a thread switch happens on will happen after only a few thousand iterations, even though the clock speed difference is 1 part in 24 million. "how long does it take before the first thread switch happens on a different cycle than with the old scheduler" is simply not a reasonable or meaningful measure of accuracy.
ETA: I checked when the
first divergence happens in my test program (it scrolls out of the terminal window when I log them all) and it happens at 284019 cycles. It looks like you're actually *losing* precision by using 128-bit integers if you're really getting a divergence after only 15151 cycles (I would guess that converting from long double to uint128_t is where it's being lost)
Quote:
And indeed, these run perfectly in sync forever ... I think?? I don't know if I can round down products like that. I feel like I can, but if I factor the numbers or try GCD calculations on them, I end up reducing it right back the original frequencies which is obviously wrong.
Ugh.. listen to yourself, man. You're doing convoluted math you don't even pretend to understand, with only brute-force and highly artificial tests to convince yourself it's correct. I thought you liked to keep your software simple?
Is there a reason you can't have a mode that pretends everything in the SNES is on one crystal, at least if no Japan-only coprocessors are used? APU=8/7*CPU would be well within the
0.5% tolerance of a ceramic resonator. That is, unless you want to simulate change in frequency as the thing heats up.
> If you used <= and >=, you'd have the same wobbling issue I demonstrated in the previous post.
Oh, that's a duh moment for me >_<
> As you got closer to that point, the single-cycle divergences (if you were capable of logging them) would get closer and closer together.
Still, it's impressive to me that I'm seeing ten instances of rounding errors every one million context switches even at the yoctosecond level. It's fine since we aren't actually introducing a real extra cycle, just shuffling the order that the two chips run in. Just, I didn't expect it to be that common.
> 1e+18 / 21477272 is 46560848137.5. The error from truncating to 46560848137 is 0.000000001%
Which is why I would have expected the rounding error to happen only once every ~billion or so synchronizations, not every 16,000 - 100,000. Even if it doesn't matter, and it doesn't, fine. Just didn't expect to see that.
> That level of error is really not worth introducing complexity and killing portability over.
The main issue here though is that attoseconds are too precise for one 64-bit number.
If I were to have a frequency of 1hz (which I do for RTCs), then I'd eat up 1/18th of the total headroom for each tick. Yes, I know there would only be one tick per second, obviously. And we normalize once per second, sure. (Well, the framerate number of times per second.) Then if we increase the frequency higher toward 4GHz, the precision falls off a cliff as rounding errors become more pronounced.
MAME manages with attoseconds, but it also keeps a seconds counter as well and doesn't worry about wrapping.
In my case, I'd want to leave 2^32 headroom, so that I can easily handle a full second of any frequency from 1Hz to 4GHz (that which fits into an unsigned integer) with strong precision. That would only leave 2^32, which is only 10^9.6 or so.
It obviously works anyway with uint64 and attoseconds and normalizing the counters after every frame. But the 128-bit version is more precise and runs at basically the same framerate. I'll probably write some code that picks the sizes and time base off the largest available integer size, so it compiles on 32-bit platforms to 64-bit, and on 64-bit platforms to 128-bit.
Like you said, I'm not that great with math at this scale. It's hard for me to calculate exact limitations where things would be "acceptable" or not, so I use the test code as a brute force (and inaccurate) crutch. If it's not any slower, and I have a 32-bit fallback without adding more than a single ifdef on two lines of code, then why not do it anyway, just to be 100000000000% sure?
> "how long does it take before the first thread switch happens on a different cycle than with the old scheduler" is simply not a reasonable or meaningful measure of accuracy.
I realize the first hit could be anywhere. It was the recurring hits that was worrisome. I was getting ten per 1m iteration loop with yoctoseconds.
> It looks like you're actually *losing* precision by using 128-bit integers if you're really getting a divergence after only 15151 cycles (I would guess that converting from long double to uint128_t is where it's being lost)
Yeah, I believe that was the case, and I switched to pure integer math. Current results are:
//femtosecond (10^15) = 16306
//attosecond (10^18) = 688838
//zeptosecond (10^21) = 13712691
//yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble)
//byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond)
And yes, adding one to the SMP causes even the "byuusecond" to start desyncing ~10 times per 1m iterations. Nasty.
> Ugh.. listen to yourself, man. You're doing convoluted math you don't even pretend to understand
Right, but this is how we learn.
I didn't know anything about blackman windowed sinc FIR filters or butterworth biquad IIR filters, yet I was able to learn and implement both fairly effectively to eliminate audio aliasing during resampling.
> I thought you liked to keep your software simple?
I do. And honestly, I think my old method was better in terms of simplicity (not needing normalization nor introducing any errors.) It was just sloppier because it was more work to manage more than 1:1 timing relationships.
It's just a thought experiment. If we can get 128-bit levels of precision in 64-bit levels of space though, I think that'd be pretty neat and *may* justify the code necessary to compute it. Or maybe not... its achilles heel is adding/removing oscillator frequencies during run-time (from peripherals.)
> Is there a reason you can't have a mode that pretends everything in the SNES is on one crystal, at least if no Japan-only coprocessors are used? APU=8/7*CPU would be well within the 0.5% tolerance of a ceramic resonator. That is, unless you want to simulate change in frequency as the thing heats up.
higan allows the end-user to set the frequencies of every oscillator by editing manifest files. Along with per-byte mapping granularity, it's a conscious design choice, so I'm not going to cut corners on it.
byuu wrote:
Which is why I would have expected the rounding error to happen only once every ~billion or so synchronizations, not every 16,000 - 100,000. Even if it doesn't matter, and it doesn't, fine. Just didn't expect to see that.
See, you're not understanding the problem at all. With an error of one part in a billion, after one billion cycles one chip would be
a full cycle ahead of where it would be with theoretical perfect timing. After
half a billion cycles, the chip would be
half a cycle ahead, meaning the context switches would be 180 degrees out of phase--every single one would be "wrong" the way you're measuring them. In between those points, the "errors" gradually increase and then decrease again.
"What percentage of context switches happen at exactly the same time" is a completely bogus, irrelevant statistic. To make a rough analogy, you're trying to change the amplitude of the beat between two sine waves by twiddling their frequencies.
Here's another analogy: Think about two kilometer-long rulers, one marked with lines dividing it into 100000 segments (each segment is exactly 1cm) and one marked with lines dividing it into 100001 segments (each segment is 0.001% shorter than a cm). If you put the rulers next to each other, the lines line up just about perfectly at one end. As you walk along, the lines diverge more and more, until in the middle of the two rulers the 50000th line on one ruler is exactly between two of the lines of the other ruler.
You're looking at the lines somewhere in the middle of the rulers and saying "don't try to tell me the difference between these rulers is only 0.001%; look how far apart the lines are!". And you're proceeding to demand that someone make you a pair of rulers long enough that the lines
never get farther than 1mm apart.
Quote:
If I were to have a frequency of 1hz (which I do for RTCs), then I'd eat up 1/18th of the total headroom for each tick. Yes, I know there would only be one tick per second, obviously. And we normalize once per second, sure. (Well, the framerate number of times per second.) Then if we increase the frequency higher toward 4GHz, the precision falls off a cliff as rounding errors become more pronounced.
4 GHz is an exact divisor of 1e+18, so let's try 3 GHz.
1e+18 / 3000000000 = 333333333.333
0.333 / 333333333.333 = 1e-9. Which is still
four orders of magnitude smaller than the tolerance of a quartz oscillator.
Quote:
You're looking at the lines somewhere in the middle of the rulers and saying "don't try to tell me the difference between these rulers is only 0.001%; look how far apart the lines are!". And you're proceeding to demand that someone make you a pair of rulers long enough that the lines never get farther than 1mm apart.
Thanks, that analogy worked perfectly. I get what you're saying now.
I have to say, it's intimidating working in a field with people who are all 20-30 IQ points higher than me where this stuff is obvious to them. But I do appreciate you taking the time to help explain it to me.
I don't know how to do the math to work it out, but ... is 10^-18 the best time unit to use for a 64-bit clock/scalar? I don't really care if my unit of time has a snazzy name like "attoseconds."
I want something that works well for all frequencies 1Hz - 4GHz (I know this is overkill; but I don't want any hidden gotchas or certain frequencies we shouldn't use), won't overflow for one entire second (so I think we need 2s of overhead since the normalization could leave other counters near the limit still), and has the most possible precision we can get away with.
Ideally, it'd be great if there were a series of formulas we could compute to tell us these things. I'd like to have official documentation right there in the thread/scheduler that explains the rate of error for the worst case scenario, the amount of time before an overflow event would trigger, etc; instead of just saying, "well four orders of magnitude sounds like a lot and I don't *think* the counters could ever get close to the limit ..."
Given that I don't really know how to compute this stuff, I hope you at least understand why I feel more comfortable just brute-forcing it to batshit insane overkill of 2^-96 until I do, especially given that it comes at no discernible performance penalty (the penalty I received was due to fixing the wobbling issue that reduced my context switch overhead.)
The problem is that if two cores have their event at exactly the same time, which runs first is unpredictable. To fix this, disallow things happening at the same time. Clock the CPU and PPU at 43 MHz and run CPU on even clocks and PPU on odd.
I wrote another test program to demonstrate what I was talking about in the previous couple of posts:
http://hastebin.com/odijenugim.cppIt uses the old signed/paired scheduler algorithm. It runs 100 million cycles with a 21477272 Hz CPU and a 24576000 Hz SMP, then repeats the run with the SMP bumped to 24576001 Hz, and compares the results. I strongly suggest redirecting the output to a text file--printing tens of millions of lines to a terminal window takes surprisingly long.
Results summary: After 100 million cycles (~50 million cycles per chip), the total cycles diverge by just 1 (i.e. the run with the 1 Hz-faster SMP runs one more SMP cycle and one fewer CPU cycle than the other run). However, the first desync of the thread switches is only 14381 cycles in, and they quickly climb in frequency. Around the 50000000-6000000 cycle neighbourhood, the two runs are maximally out of phase. By the end of the runs, they're fairly close to being in phase again--the desyncs are back down to one every hundred or so cycles.
Quote:
I have to say, it's intimidating working in a field with people who are all 20-30 IQ points higher than me where this stuff is obvious to them. But I do appreciate you taking the time to help explain it to me.
There are probably only 4 or 5 people on the MAME mailing list who really understand the MAME scheduler and one of them, the man who originally wrote it, is retired from hobbyist emulation. A
major emulator-wide regression that occurred when the scheduler was converted from C to C++ took three years to get fixed.
Quote:
I don't know how to do the math to work it out, but ... is 10^-18 the best time unit to use for a 64-bit clock/scalar? I don't really care if my unit of time has a snazzy name like "attoseconds."
If you really wanted to maximize the number of frequencies that yielded exact integer reciprocals, you could open up an electronics supplier's catalogue of oscillators and start multiplying them all (dropping trailing zeroes, and skipping ones that are multiples of ones you've already included) until the product got close to 2^59. But since you're allowing users to provide arbitrary frequencies that aren't even necessarily integers, you're not going to gain much by doing that. Any arbitrary number in the 2^59~2^60 neighbourhood is as good as any other. I think Aaron Giles chose attoseconds because it's the smallest named unit of time whose reciprocal fits into a 64-bit integer.
It sounds like you might be confused on this point, but rebasing the clocks every frame is no problem even when you have clocks slower than 60 Hz. For every time that 1 Hz RTC clock gets incremented by 1 s, the rebaser will decrement it 60 times by on average 1/60 s each time. It doesn't matter whether the increments are less frequent but larger than the rebases (clocks < 60 Hz) or vice-versa (clocks > 60 Hz). None of the clocks will ever overflow, as long as the rebasing happens more frequently than your headroom (for attoseconds in a uint64, that's 18 seconds) and as long as one chip doesn't get drastically behind the others due to never being synced.
tepples wrote:
The problem is that if two cores have their event at exactly the same time, which runs first is unpredictable. To fix this, disallow things happening at the same time.
Yes, I've already explained this problem and how to solve it: initialize each device's time base with a small, unique bias (e.g. the index of its position in the scheduler's array) so that the time bases of two devices with integer-ratio clocks never become exactly equal.
Spoke to a PhD holding math expert (who frequently does vector calculus), yet we weren't able to come up with a good way to determine the maximum error rates for this type of scheduler through plain old computation.
So I had to fall back on a testing framework again. This is the best I can do:
http://hastebin.com/raw/isekeyuxecRequirements:
* must allow frequencies between 1hz and 2^32-1hz
* must allow at least one full emulated second to run before normalizing all the time counters
I tried many variations. What seemed to maximize the rounding errors the most was this scenario:
* precision[attoseconds] / frequencyA[hz] = xxxx.5
* precision[attoseconds] / frequencyB[hz] = xxxx.0
Where frequencyA and frequencyB were as close to the maximum (4GHz-1) as possible.
I used a brute force search to find the best values for this: 4294967262hz and 2000000000hz.
With this, I found the following:
* the iteration count doesn't really affect the error rate; just the precision of reporting it
* with picosecond precision, 7742318 clocks or 0.244283% error rate
* with femtoseconds, 6009 clocks or 0.000189% error rate
* with attoseconds, we end up with 4 clocks of rounding errors in 10 billion iterations; or basically 0.00000025% out of spec
* with 2^96, there are zero errors. I don't think I could stand to run the test long enough to find an error
With a cesium fountain clock, that's 10^-15 accuracy. Rubidium is 10^-12. But those two are atomic clocks. A typical crystal is going to be accurate to between 10^-4 and 10^-5. With attoseconds, we are above 10^-6 accuracy rate (my math could definitely be wrong here when converting between 0.x and x%); and we can run up to 18 seconds with uint64 between normalizing all the clock values to prevent overflow (though since some counters can be as high as half the value they were before, I would halve that to 9 seconds. Which means I wouldn't trust going to 10^-19 for the precision here.)
The best estimate I can give from my results is that a precision of 10^-n means we are accurate to a minimum of 10^-(n-12) for frequencies of up to 4GHz. That would be off by one second every 85 years.
So given this, I concur through empirical evidence with AWJ: there's no need at all for 128-bit integers here. Unless of course you want to brag about being more accurate than the best atomic clock in the world ;)
If anyone sees any errors in my test harness, results or reasoning, please let me know. I am, after all, quite bad with math :/
...
EDIT: but actually, I was mistaken in that we don't need 2^32 headroom for uint128_t. We only need enough such that 2^128/precision >= 2. So in this case, we can easily do 10^-38; or 10^20 (one hundred quintillion) times more precise than attoseconds. Which gets us an error rate of 10^-26. Which is 10^11, or one hundred billion times more precise than the most precise cesium-based atomic clock mankind has ever invented :D That's an error rate of one second every 2 quintillion years. And we can push it to 2^-127 in that case, which is 70% more accurate than even that.
So we might as well use 2^-63 instead of 10^-18 for our uint64_t version. That gives us 922% more precision for free. Gets me to an error of 1 cycle on my test harness for 10 billion iterations.
byuu wrote:
Spoke to a PhD holding math expert (who frequently does vector calculus), yet we weren't able to come up with a good way to determine the maximum error rates for this type of scheduler through plain old computation.
So I had to fall back on a testing framework again. This is the best I can do:
http://hastebin.com/raw/isekeyuxecRequirements:
* must allow frequencies between 1hz and 2^32-1hz
* must allow at least one full emulated second to run before normalizing all the time counters
I tried many variations. What seemed to maximize the rounding errors the most was this scenario:
* precision[attoseconds] / frequencyA[hz] = xxxx.5
* precision[attoseconds] / frequencyB[hz] = xxxx.0
Where frequencyA and frequencyB were as close to the maximum (4GHz-1) as possible.
I used a brute force search to find the best values for this: 4294967262hz and 2000000000hz.
With this, I found the following:
* the iteration count doesn't really affect the error rate; just the precision of reporting it
* with picosecond precision, 7742318 clocks or 0.244283% error rate
* with femtoseconds, 6009 clocks or 0.000189% error rate
* with attoseconds, we end up with 4 clocks of rounding errors in 10 billion iterations; or basically 0.00000025% out of spec
* with 2^96, there are zero errors. I don't think I could stand to run the test long enough to find an error
With a cesium fountain clock, that's 10^-15 accuracy. Rubidium is 10^-12. But those two are atomic clocks. A typical crystal is going to be accurate to between 10^-4 and 10^-5. With attoseconds, we are above 10^-6 accuracy rate (my math could definitely be wrong here when converting between 0.x and x%); and we can run up to 18 seconds with uint64 between normalizing all the clock values to prevent overflow (though since some counters can be as high as half the value they were before, I would halve that to 9 seconds. Which means I wouldn't trust going to 10^-19 for the precision here.)
The best estimate I can give from my results is that a precision of 10^-n means we are accurate to a minimum of 10^-(n-12) for frequencies of up to 4GHz. That would be off by one second every 85 years.
So given this, I concur through empirical evidence with AWJ: there's no need at all for 128-bit integers here. Unless of course you want to brag about being more accurate than the best atomic clock in the world
If anyone sees any errors in my test harness, results or reasoning, please let me know. I am, after all, quite bad with math :/
...
EDIT: but actually, I was mistaken in that we don't need 2^32 headroom for uint128_t. We only need enough such that 2^128/precision >= 2. So in this case, we can easily do 10^-38; or 10^20 (one hundred quintillion) times more precise than attoseconds. Which gets us an error rate of 10^-26. Which is 10^11, or one hundred billion times more precise than the most precise cesium-based atomic clock mankind has ever invented
That's an error rate of one second every 2 quintillion years. And we can push it to 2^-127 in that case, which is 70% more accurate than even that.
So we might as well use 2^-63 instead of 10^-18 for our uint64_t version. That gives us 922% more precision for free. Gets me to an error of 1 cycle on my test harness for 10 billion iterations.
I think 2^63 might be cutting it a bit too close. A 1 Hz clock will go from zero to overflow in just two ticks. I've got a nagging feeling that if you have two or more 1 Hz clocks (e.g. a hypothetical SNES cart with both types of RTC in it) there might be some circumstance in which an overflow could occur even with regular rebasing. Even 2^63-1 would be substantially safer.
Also, since you've apparently convinced yourself that uint64_t is precise enough, I would strongly recommend against any kind of platform-dependent compile time selection (e.g. defaulting to 128-bit on x86_64 and 64-bit on i686 and ARM) Problems include:
* Savestate incompatibility between platforms
* Potential input-movie incompatibility between platforms (there's a tiny but nonzero chance that a movie recorded on a more precise build will desync on a less precise build or vice-versa. Why risk it?)
* Bug reproducibility between platforms (you have enough trouble with compiler bugs given all the bleeding-edge C++ you use; why compound it by intentionally introducing more platform-dependent behaviour right in the core of the emulation?)
ETA: Also, you're using uintmax in nall/windows/detour.hpp, so now it behaves differently in 32-bit and 64-bit builds. I have no fucking idea what detour.hpp is for, but it looks very low-level to me, and I would guess it wasn't your intention to change it...
> I think 2^63 might be cutting it a bit too close. A 1 Hz clock will go from zero to overflow in just two ticks. I've got a nagging feeling that if you have two or more 1 Hz clocks (e.g. a hypothetical SNES cart with both types of RTC in it) there might be some circumstance in which an overflow could occur even with regular rebasing. Even 2^63-1 would be substantially safer.
Neat, I thought about that exact same thing. My fear was that 2^63-1 results in a number that's far more likely to round poorly and result in more fractional error accumulation. Especially with clocks that are powers of two already (I have 32*1024, 4*1024*1024, etc clocks.) However, at the margin of error we're dealing with, that's probably statistically insignificant. Still, I wish I was better at math to know for sure whether this even matters or not.
The reason why I believe it should be safe: I don't actually wait one full second before normalizing counters. I actually do it once per frame. It's just that framerates vary per system, so I wanted to leave the headroom for "up to one second", and "up to one second minus one cycle" is close enough.
If you're sure it won't have adverse effects on the precision, then I'll switch it to 2^63-1 just to be safe.
> ETA: Also, you're using uintmax in nall/windows/detour.hpp, so now it behaves differently in 32-bit and 64-bit builds. I have no fucking idea what detour.hpp is for, but it looks very low-level to me, and I would guess it wasn't your intention to change it...
detour is dead code. Just kept around in case I need it again.
I used it for the automation of a proprietary work application, and again against Lunar Magic. I can talk about the latter.
So, FuSoYa refused to add headerless ROM support to Lunar Magic, and it's closed source. So I wrote a program called Header Magic that opens Lunar Magic as a debuggee process, then used DLL injection to get code into its process. Then I had that code use detour.hpp to hijack all the Win32 file APIs to pretend that ".smc|.sfc" files had an extra 0x200-byte header that really didn't exist. And voila, Lunar Magic now supports headerless ROMs anyway :D (And yes, that really works. Perfectly, in fact. Have a few smwcentral.net people using it.)
...
Good catch though on intmax ... this is something that's been bothering me.
The idea of intmax is "hold the biggest possible number this platform supports", yet on 64-bit platforms, that is 128-bits, not 64-bits. And so now, all of my code that uses intmax_t won't be able to work with 128-bit numbers without truncation.
detour is just dead code that happened to not be updated from back when I had non-_t versions of intmax_t/intptr_t. But my intention now is to go in and update all of my intmax_t functions to be compatible with 128-bits. Obviously to do that, I have to use my own non-_t versions, since I can't override the official ones from stdint.h now.
Now obviously, it's immediately apparent that relying on > 64-bits makes your code no longer portable to 32-bit systems. But that was always the idea behind the max variables in the first place. If you want portability, you use intN_t. This mess is why int is still 32-bits instead of 64-bits. That type is basically stuck as short-hand for int32_t now.
NOTE: I understand that technically uint128_t is the compiler working on two uint64_t fields. But the same can be said of 32-bit systems where intmax_t is 64-bits. If this were the reasoning, then intmax_t should be 32-bits on 32-bit systems, right?
byuu wrote:
Neat, I thought about that exact same thing. My fear was that 2^63-1 results in a number that's far more likely to round poorly and result in more fractional error accumulation.
2^63 (or any power of 2 for that matter) won't evenly divide by any divisor that isn't a power of 2 itself. Likewise, 10^18 or any power of 10 won't evenly divide by anything that has prime factors other than 2 and 5. This is basic number theory. On top of the fact that worrying about which numbers are "more likely to round poorly" is silly when you're dealing with maximum rounding errors of parts in 100 billion, you aren't even worrying about the right numbers
"nice round numbers" like 2^n or 10^n have the property that
very few divisors will divide them evenly.