Emulation and streaming asynchronous audio resampling

2007-01-31

Hello, I'm sorry to ask this question here (I also asked at mameworld.info), but there's a (not so) surprising lack of people out there who might be able to help me with this ... I was hoping that perhaps someone here has encountered this issue in the past and worked out some possible solutions already.

Basically, emulation has two really important outputs: video and audio. Video typically runs at close to but not exactly a monitor refresh rate (in my case, ~60.09fps emulation vs ~60.00fps monitor), and the same for audio (~32,040hz emulation vs ~32,000hz sound card). So, in order to get smooth audio and video: you either have to resample the video (skip/duplicate frames), which is quite visible to a user; or you have to resample the audio. This theoretically should be less noticeable to the user if done right, as you can interpolate per sample rather than outright drop frames. I've been completely unable to think of how to do this, however.

My setup is that I use DirectSound (the API is unimportant) and a 3-ring audio buffer, plus one very very large temporary buffer. The ring size is adjustable, typically ~800 samples for 32khz (~25*3=75ms latency). I run emulation, and each time a sample is generated, I add it to the temporary buffer and then check the current playback position in the audio buffer. As soon as I see that the playback position has reached a new ring (of 3, it wraps), I then take the temporary buffer, resample it and stick it two rings ahead of the currently playing ring, and flush the temporary buffer. Why two ahead of the current ring? A: to give time for resampling, and B: I use 3 rings instead of 2 because I also want to allow syncing via dropping video frames and that requires 3 minimum if you think about it.

Now, the problem is the resampling part. My output clock rate is consistent at 32khz, but I can only roughly detect the current playback position, and not every sample edge. My input clock rate is known (~32,040hz), however the number of samples I actually generate can vary wildly based on how much time the OS gives my emulator, and how complex emulation is during the next "ring". The number of input samples can be more or less than the number of output samples I need to fill the next audio ring buffer, so I'm using a godawful point resample ("drop-sample") filter for now (I know you can practically get into quantum physics and n-dimensional trig/calculus trying to resample audio, but I chose the simple approach to start) to convert x input samples to y output samples.

The problem is that audio is just waaay too sensitive to this, regardless of latency. A sample conversion log (with a massive latency of 8000 samples, ~250 * 3ms) looks like this:
7950 samples -> 8000 samples
7980 samples -> 8000 samples
8020 samples -> 8000 samples
7990 samples -> 8000 samples

Every single ring buffer that gets played back has an extremely noticeable pitch shift from the previous sample block (even for very, very small sample differences, which surprised me), and the result is truly something horrible to listen to. It actually sounds better to leave it crackling by not resampling the audio at all and just ignoring buffer overruns/underruns. In the rare cases that I get several blocks with the same number of input and output samples, it sounds quite tolerable, again until the input sample rate changes and the pitch shifts again. I can get more blocks sounding the same by testing playback position every n samples rather than every single sample, but then eventually that rounds off and one ring buffer has an even more massive pitch difference from the others to compensate. I also want to be able to handle differing times with at least a ~5-10% tolerance (eg if I need 800 output samples, I want audio to sound good when the input sample count is within ~700-900 samples), as sometimes emulation will fall below 60fps, and I'd like it if the audio didn't go all to hell when that happened.

Obviously, I'd need to even out the pitch over multiple ring buffers. But my question is, how in the world is this possible when you are streaming asynchronous audio data? You can't possibly predict how many samples you'll get in the next ring/block until you emulate it, and the more you buffer to try and even things out, the more terrible your latency gets. At anything > 100ms, you can noticeably tell the delay between eg Link swinging his sword and the sound effect for it playing back. If you buffer the video back to account for this delay, then your input to onscreen response gets delayed even more.

Please note that manipulating the CPU/SMP frequency counters to control video frames/second to audio samples/second is not an option, as that affects emulation accuracy. I'm also not able to ask the emulated DSP to generate more samples, as that would result in changes to DSP processor registers visible to program code: again, also not good for accuracy.

Any help would be greatly appreciated, thanks in advance.

2007-01-31

Overall, your emulator is generating video frames and audio samples at rates that are of a precise relation to each other (say exactly 32040 stereo samples per 60 video frames, or 60.xxx/59.xxx or whatever). These are fed to the host system, which accepts them at rates very close to what they are generated at. If the host's rates were both off by the exact same amount, say 1% faster, then emulation would simply run a bit faster. The problem is that the host's rates are off by different amounts, say the video 1% faster and audio 2% slower. The solution is to a) buffer one or both, b) choose one to be the "master", c) continually adjust the slave's data rate so that the slave's buffer is kept filled to approximately the same level. Note that none of this is affecting emulation accuracy, only the data after the emulator has output it.

You have apparently chosen to use video as the master and audio as the slave, changing the pitch of the audio very slightly so as to keep the audio buffer from completely emptying or staying completely full. If the problem is that the pitch is changing too much, then you probably need to average out the adjustments so they are finer. This might require a larger audio buffer or finer granularity.

My main aim is to keep unnecessary implementation details out of the picture (but I may have misunderstood your problem).

2007-01-31

byuu wrote:

Basically, emulation has two really important outputs: video and audio. Video typically runs at close to but not exactly a monitor refresh rate (in my case, ~60.09fps emulation vs ~60.00fps monitor)

I haven't seen a monitor that can do anything close to 50 Hz. Have you chosen not to support Europe-only titles in your emulator? Or are you requiring a 100 Hz monitor for PAL mode?

Quote:

and the same for audio (~32,040hz emulation vs ~32,000hz sound card).

A lot of sound cards have hardware resampling, and Windows can do software resampling.

Quote:

The problem is that audio is just waaay too sensitive to this, regardless of latency. A sample conversion log (with a massive latency of 8000 samples, ~250 * 3ms) looks like this:
7950 samples -> 8000 samples
7980 samples -> 8000 samples
8020 samples -> 8000 samples
7990 samples -> 8000 samples

What you are hearing is flutter. The human auditory system can perceive pitch changes of about 6 "cents", where an octave is 1200 cents. This 6 cents corresponds to a change in sample rate of about 0.35 percent.

Quote:

Obviously, I'd need to even out the pitch over multiple ring buffers. But my question is, how in the world is this possible when you are streaming asynchronous audio data? You can't possibly predict how many samples you'll get in the next ring/block until you emulate it

How does this vary? At least on $(SomeOtherSystem), every 2-frame unit should give exactly (341*262*2-1)/3 = 59,561 samples on NTSC or (341*312*2)/3.2=66,495 samples on PAL. There are similar figures for any set of PPU and DSP operated off the same clock divider.

(Edited for context)

2007-01-31

Quote:

The problem is that the host's rates are off by different amounts, say the video 1% faster and audio 2% slower. The solution is to a) buffer one or both, b) choose one to be the "master", c) continually adjust the slave's data rate so that the slave's buffer is kept filled to approximately the same level.

a) buffering doesn't work well.
If audio is always slower than video, then the audio buffer continues to drain until it is empty. If it's faster, then it will overflow. The SNES outputs the same video frames to samples ratio (which changes based on runtime toggleable interlace setting), but the PC is where the variance comes into play. The PC can make it so that one is always faster or slower than the other, and to compensate, eventually you have to resample or drop/skip data.

b) I want to have the master be selectable as an option. Audio master is "easier", as you just simply drop/skip video frames. I'm sure it's theoretically possible to "blend" video frames somehow, but I don't have the power or the high refresh rates needed to even joke about attempting something like that ... anyway, I'm now trying to get another optional mode that always outputs every video frame in sync with the video card refresh rate and resample the audio. Hence this post.

c) How can I adjust the slave's data rate without affecting accuracy? You mean I should constantly ask the host to change the playback frequency? I've had problems with cheap onboard sound cards refusing to work with non-standard pitch rates (32040, etc), even on Windows XP. The emulator would just freeze completely. Probably bad drivers, but it was a common chipset (AC'97), so this really isn't the "best" option, but it would probably sound better than any pitch shifting I tried.

Also, I can't make my audio buffer that big. The bigger the buffering, the longer the latency. It's already barely perceptible at 75ms. And I can't buffer the video more to compensate as that would lag out the input responsiveness even more.

So now, here's what I've come up with: the pitch change, fluttering, is detectable even with the lowest possible latency setting (~25ms ring buffers x 3). Any lower latency, and sound breaks up because audio is playing too fast. The more I raise the latency, the less responsive audio appears to video/input. I can try and smooth out each block by overbuffering past the exact start of a new ring buffer (eg if one frame has 790 samples and one has 810, I can get two 800's out of it by buffering them), but this rounding continues to stack up, one frame I'm 10 samples behind, next I'm 20, then 40, and eventually it rounds down and now I have a buffer that's resampled from 700 samples instead of 800 samples, and the pitch distortion is even worse.
The only way I can see getting around the fluttering is to resample smaller blocks at a time. I'd have to resample it at a latency rate lower than the best sound card can handle, so that won't work either.

Quote:

I haven't seen a monitor that can do anything close to 50 Hz. Have you chosen not to support Europe-only titles in your emulator? Or are you requiring a 100 Hz monitor for PAL mode?

Neither have I. I won't have a choice with PAL. Either the user will have to use the audio sync method and drop/skip frames, or turn off video buffering all together and deal with shearing/tearing, or I could try and add a 100hz mode that just waits two vsyncs instead of one. Kind of annoying as I won't be able to test it (of my four CRTs and one LCD, none support 100hz in any resolution. They all go back down to 85hz. Two of the monitors are very expensive, too).
The last, most evil method might be to pull off some sort of pulldown method to perform 50->60hz transform. It won't look good, though.

Quote:

What you are hearing is flutter. The human auditory system can perceive pitch changes of about 6 "cents", where an octave is 1200 cents. This 6 cents corresponds to a change in sample rate of about 0.35 percent.

I don't understand "cents" (why do we need a million measurements for the same thing?), unfortunately. You mean if I'm playing at 32000hz, a change to +/ 32000*0.0035->32112hz will be perceivable, but anything below this (31889hz - 32111hz) will not be detectable?

I was reading a forum post where someone was resampling 7999hz to 8000hz ( http://www.dsprelated.com/showmessage/33844/1.php ), and the one sample difference was audible for them as a clicking sound. The person was duplicating the last sample, as that was really his only choice.

2007-01-31

Quote:

a)If audio is always slower than video, then the audio buffer continues to drain until it is empty. If it's faster, then it will overflow.

And if you adjust the audio to be faster/slower in just the right proportion, the buffer never becomes completely empty or full. This would be a closed-loop system, where the adjustments are in response to how full the buffer is.

Quote:

The SNES outputs the same video frames to samples ratio (which changes based on runtime toggleable interlace setting), but the PC is where the variance comes into play. The PC can make it so that one is always faster or slower than the other, and to compensate, eventually you have to resample or drop/skip data.

Exactly. So another way of stating the solution is that you resample the audio or video so that the resulting audio/video ratio matches that of the PC (again, using feedback).

Quote:

How can I adjust the slave's data rate without affecting accuracy? You mean I should constantly ask the host to change the playback frequency?

Adjust the slave's data rate by speeding up/slowing down the data that the emulator produces. For video, you change the delay between frames. For audio, you resample it. In either case you ultimately go from X to Y data units per second (frames or samples). Since these transformations are done after emulation, accuracy is not affected.

Quote:

Also, I can't make my audio buffer that big. The bigger the buffering, the longer the latency. It's already barely perceptible at 75ms.

As long as the audio buffer is several times the resolution of the "how full is the buffer?" function, you can get enough feedback to adjust the audio rate. For example, if buffer is 1/2 full, resample output to 32000 Hz. If buffer is less than 1/3 full, resample output to 31950 Hz. If buffer is more than 2/3 full, resample output to 32050 Hz. You should probably look at the average of several buffer readings so you don't change the rate really often (in case the reading itself jumps around often).

Quote:

So now, here's what I've come up with: the pitch change, fluttering, is detectable even with the lowest possible latency setting (~25ms ring buffers x 3).

Pitch change (change in resample ratio) will be much more noticeable if you're not using a high-quality resampler, as the artifacts introduced by lower-quality methods can change a lot when even small changes are made to the ratio. So the ratio change problem might not be due to pitch change alone.

Quote:

I was reading a forum post where someone was resampling 7999hz to 8000hz, and the one sample difference was audible for them as a clicking sound. The person was duplicating the last sample, as that was really his only choice.

Duplicating the last sample is not merely changing the pitch by 1/8000; it's doing a lot more things (like adding noise/harmonics to the signal). It is possible to resample by that amount, it's just not as cheap as duplicating the last sample.

I think slaving the video to audio is a far superior solution that's a ton simpler, and necessary when the monitor's refresh rate isn't exactly 60.whatever Hz. To do this, you adjust the frame rate based on how full the audio buffer is, on average. To reduce consistent frame dropping/doubling, you should resample the audio to an unchanging rate based on the audio card's actual rate and the monitor's actual frame rate.

Here's an example of the above: say your emulated SNES generates 60.09 frames and 32040 stereo samples per second, and your host PC's monitor runs at 60.10 Hz and audio card at 31900 Hz (even though it claims 32000 Hz). For the fewest dropped/doubled video frames, you'd want to resample the audio to 31905.31 Hz. Played at 31900 Hz, this would use 32045.332 SNES stereo samples per second, resulting in 60.1 video frames per second, which matches the host rate. Invariably, the host rates will vary a tiny bit over time, so there would be an occasional dropped or doubled video frame.

2007-02-02

Sorry for the delayed response.

Quote:

And if you adjust the audio to be faster/slower in just the right proportion, the buffer never becomes completely empty or full. This would be a closed-loop system, where the adjustments are in response to how full the buffer is.

That's what I was hoping to do, but it seems that according to tepples, I cannot adjust by more than 100 samples/second, or the audio pitch will noticeably change. While I should hopefully be able to stay within that range when emulation stays at 100%, I would like to cover cases when speed drops below 100%. The differences between every audio block is far too great for just resampling block by block. Example: with 750ms latency (eg a massive latency buffer): format = req:resampledfrom@32khz
8000:7850
8000:8200
8000:7400
... etc.

I would like to avoid locking the emulation speed to the lowest known frequency; in the above case, slowing emulation down 7400/8000% permanently.

Quote:

Adjust the slave's data rate by speeding up/slowing down the data that the emulator produces. For video, you change the delay between frames. For audio, you resample it. In either case you ultimately go from X to Y data units per second (frames or samples). Since these transformations are done after emulation, accuracy is not affected.

I guess I'm just worried that resampling audio to any inconsistent rate will be audibly noticeable. I suppose I can write a test application too to test the 0.35% thing. I do know that when I was getting variances of ~1% due to triple buffering being locking (haven't rewritten it to poll the video card vblank or added high resolution timers to test audio position yet), it was unbelievably noticeable. You could clearly hear when each buffer started playing at a different pitch rate.

Quote:

You should probably look at the average of several buffer readings so you don't change the rate really often (in case the reading itself jumps around often).

Doing this causes more severe rounding, and pitch rate changes even more drastically, but less frequently:
eg instead of 31119, 32007, 32002, 32003; now you have:
32000, 32000, 32000, 32011

Quote:

Pitch change (change in resample ratio) will be much more noticeable if you're not using a high-quality resampler, as the artifacts introduced by lower-quality methods can change a lot when even small changes are made to the ratio. So the ratio change problem might not be due to pitch change alone.

Ok, I'm using these filters now:
http://byuu.cinnamonpirate.com/temp/audio.txt

Aside from the small issue of no clamping in the cubic filter, they seem to sound ok. I can fix the cubic one, I was just running out of time when I wrote the above code.

Should cubic or cosine be sufficient for resampling? It still sounds just as noticeable to me when pitch change >= 1%.

Quote:

Here's an example of the above: say your emulated SNES generates 60.09 frames and 32040 stereo samples per second, and your host PC's monitor runs at 60.10 Hz and audio card at 31900 Hz (even though it claims 32000 Hz). For the fewest dropped/doubled video frames, you'd want to resample the audio to 31905.31 Hz. Played at 31900 Hz, this would use 32045.332 SNES stereo samples per second, resulting in 60.1 video frames per second, which matches the host rate. Invariably, the host rates will vary a tiny bit over time, so there would be an occasional dropped or doubled video frame.

Ah, ok. So what you want me to do is get the best resampling rate possible, and then always resample to that constant rate, and occasionally add/drop a video frame. This will at least make dropped frames far, far less frequent than once every ten seconds. I was thinking you were aiming for always resampling the audio so that the SNES is always synchronized to the video refresh rate; hence never any dropped frames.
I was hoping to offer two modes, one that always guarantees no resampling of audio, but will have video frame dropping; and one that guarantees no dropped frames, but will always resample audio. Are these two impractical / impossible?

2007-02-02

Quote:

While I should hopefully be able to stay within that range when emulation stays at 100%, I would like to cover cases when speed drops below 100%.

By 100%, do you mean how fast it's running relative to a SNES? I don't think you're going to have much luck handling a situation where the emulator drops to 80% real-time for a second without a significant audio glitches, unless you add a lot of latency. I can imagine it. Start emulating a game and see "buffering..."

Quote:

I guess I'm just worried that resampling audio to any inconsistent rate will be audibly noticeable.

An inconsistent rate? You should be resampling the audio to a consistent rate, as in, an equal time delay between each sample.

Quote:

You should probably look at the average of several buffer readings so you don't change the rate really often (in case the reading itself jumps around often).

Doing this causes more severe rounding, and pitch rate changes even more drastically, but less frequently:
eg instead of 31119, 32007, 32002, 32003; now you have:
32000, 32000, 32000, 32011

If you're averaging values, then they will vary less between readings, not more. Averaging is a form of low-pass filtering.

Quote:

Ah, ok. So what you want me to do is get the best resampling rate possible, and then always resample to that constant rate, and occasionally add/drop a video frame. This will at least make dropped frames far, far less frequent than once every ten seconds.

Right. But one frame glitch every 10 seconds is already pretty good, in my opinion.

Quote:

I was thinking you were aiming for always resampling the audio so that the SNES is always synchronized to the video refresh rate; hence never any dropped frames.

That's what you want(ed) to do, and what I was mostly offering ideas about. The simpler solution is what I prefer, since it's mostly self-explanatory.

Quote:

I was hoping to offer two modes, one that always guarantees no resampling of audio, but will have video frame dropping; and one that guarantees no dropped frames, but will always resample audio. Are these two impractical / impossible?

I doubt you'll frame glitches on a general-purpose operating system. If you really want to give it a fair chance, write some simple test applications that find out whether it's possible at all. One of them shouldn't do any audio, just video.

I've been working on my polyphase (FIR) resampler again and will put some time into allowing really fine ratio adjustment.

2007-02-02

Quote:

By 100%, do you mean how fast it's running relative to a SNES? I don't think you're going to have much luck handling a situation where the emulator drops to 80% real-time for a second without a significant audio glitches, unless you add a lot of latency. I can imagine it. Start emulating a game and see "buffering..."

Aww :(
And the problem with the buffering is that it delays input responsiveness. Not to mention, video buffering eats up a ton of memory. So, I guess I'll just have to specify that if you want clear audio, you'll have to cap emulation speeds at 50% for P4s, 25% for P2s, or for 486s that people still insist on running, 10%, heh.

Quote:

If you're averaging values, then they will vary less between readings, not more. Averaging is a form of low-pass filtering.

Hmm, my averaging technique was: instead of testing every sample to see when we hit the next audio buffer, we test every 256 SNES audio samples generated. So the # of samples->output samples is always mod 256. It was cleaning out the audio a little more, but the differences seemed more extreme as a result (eg instead of 4,4,5,5,5,4,5; you get 4,4,4,4,4,4,8).

Quote:

I've been working on my polyphase (FIR) resampler again and will put some time into allowing really fine ratio adjustment.

Sounds interesting. I already can't tell the difference between linear and cubic, despite it being very evident on graphs (hence I didn't bother writing a 4-tap hermite resampler), but I'm really not resampling by more than 50%, either.

2007-02-03

Ok, I've tried again for the 67th time to get the audio working plus the video syncing to vblank and dropping frames it can't possibly draw without tearing.

It's basically your method, blargg, but I haven't enabled the audio resampling to try and even out the audio<>video refresh rate ratio just yet (I actually tried it using a quick hack, and it helped a tiny bit, but there's a much larger problem somewhere making the whole thing not work).

I posted the code I was using, and an explanation of how it works, here:
http://board.zsnes.com/phpBB2/viewtopic ... 973#139973

I don't suppose anyone sees any obvious flaws with this? I don't even know where things are going wrong, sadly :(

2007-02-03

In the Super NES hardware, are the 65C816 CPU, the SPC700 CPU, and the DSP clocked from the same crystal, such that there are exactly 32040.00000 samples per 3579545.454 CPU cycles? Or is there unit-to-unit variation in their clock speeds?

2007-02-03

tepples, I'm guessing that you're going to suggest to take advantage of variations to simplify the task at hand. As in, if the SPC is clocked by a separate crystal (which it is), then make slight adjustments to the emulated crystal's rate rather than trying to resample the output. I took advantage of something similar in my NTSC filter regarding the output aspect ratio, since TVs can vary.

2007-02-03

Darn, seems no one knows why my code isn't working :/

tepples, there are two crystal clocks inside the SNES. There's one shared by S-CPU, S-PPU1, S-PPU2 and S-WRAM at ~315/88*6mhz, and another one shared by the S-SMP (SPC700), S-DSP and the 64k WRAM they share (don't know if that one has a name or not), running at ~32040*768hz.

Although both clocks obviously have precision variances in real life, in emulation I can at least guarantee a constant ratio of clock cycles between the two processors.

I really don't wish to to fake the emulated speeds of either processor to get them more synced up (in fact, EWJ2 sound effects break if you change the sound clock enough to get 32khz output). Though I realize that since I don't (and really can't) emulate a true crystal clock's variance in realtime, the 32040*768hz approach is still more accurate than 32000*768hz. I will have to resample the audio output and drop video frames, sadly.

2007-02-03

Quote:

Darn, seems no one knows why my code isn't working :/

The answer is probably that you're trying to solve it in a way that goes beyond your skills (remember, debugging takes twice the effort as coding). First thing, write a minimal app from scratch that manifests the problem. If possible, write all the relevant code from scratch and keep it damn simple. Real-time behavior can get complex very easily. From what you're written, it sounds like you're debugging all this in bsnes. I know the urge to hack away and put off doing a clean prototype.

I've also done more work on resamplers (did you get the ones I e-mailed yesterday?) and linear is quite fine for SNES. I just did an all-integer implementation of cosine interpolation, which is a bit better than linear. I love this fast sin approximation:

Code:

double fast_sin( double x )
{
    double pi = 3.14159265358979323846;
    assert( -pi <= x && x <= pi );
    
    x /= pi;
    return (1 - fabs( x )) * x * 4;
}

2007-02-04

Quote:

The answer is probably that you're trying to solve it in a way that goes beyond your skills (remember, debugging takes twice the effort as coding). First thing, write a minimal app from scratch that manifests the problem. If possible, write all the relevant code from scratch and keep it damn simple. Real-time behavior can get complex very easily. From what you're written, it sounds like you're debugging all this in bsnes. I know the urge to hack away and put off doing a clean prototype.

You're correct, I am not skilled enough to debug code running in realtime like this. I think I see the problem, however. I simply can't reliably detect the start of vblank and have the video code blitted to the screen. No matter how many places I tell bsnes to check to see if we've reached vblank (over 40,000 times a second), it's still not enough and I miss entire vblank periods 20% of the time. Most likely, there are too many parts in bsnes that eat up so much CPU time that it jumps right over vblank. I patched all the obvious ones, adding checks to every scanline rendered both by the PPU and the filter, to every audio sample generated, and during every wait state inside the audio sync code. I even removed all Sleep(1) calls in case that 1ms were to kick me out of vblank. Still wasn't good enough.

It's too bad video drivers and/or API developers are too incompetent to design APIs that handle page flipping completely transparently in the background without deadlocking your applications when you request to blit the image to the screen. Realistically, all you need is a hardware interrupt to trigger the second vblank edge is reached. It's amazing the way a twenty year old video game system has such a device (NES NMI, etc), and yet modern computers still lack this very useful functionality.

Anyway, I've given up, sadly. I can't think of any way to do this, and I've exceeded my patience and run out of ideas. Thanks for trying to help, though. I really wish I knew how other emulators managed to do this, but the answer is probably something I wouldn't be willing to do anyway (eg force emulation to generate more samples, which would be visible to the system's sound registers).

Quote:

I've also done more work on resamplers (did you get the ones I e-mailed yesterday?) and linear is quite fine for SNES. I just did an all-integer implementation of cosine interpolation, which is a bit better than linear. I love this fast sin approximation:

I like cosine, because the graphs for it are prettier, they look almost identical to hermite, and besides the end points, very similar to cubic as well. Whereas linear looks very mechanical and unnatural, even though I agree that it sounds just fine.

Hmm, I like the sin algorithm. Perhaps I can take advantage of it. How do you convert that value to cosine again?

2007-02-04

Well, I just realized something very stupid.

I had forgotten about the way I implemented cothreads in bsnes: I basically take advantage of the fact that I don't have to back out of the CPU or SMP like normal emulators do, since I can swap between the two at any time thanks to cothreads; and so I run the currently active core as long as possible: either until one accesses the other, or until ~300,000 clock ticks have passed on one or the other (and even that is only to keep the audio buffer from emptying and the difference counters from overflowing).

This of course means that the CPU and SMP can be desynchronized by up to 1/8th of a second at any given time. Hence, the wildly fluctuating samples generated per video frame.

I can disable this behavior, but I will lose much of the benefits of cooperative multithreading and take a massive speed hit, especially on platforms where libco doesn't run as fast. At least I'll still get the cleaner code benefits. I guess I can give it a try, though I still think the resampling rate will be too sporadic to sound good.

Funny that I just now realized that when sinimas made the comment that the number of audio samples generated per video frame should be constant, and while immediately thinking "no, that's not true", realized, "wait... actually, yes, that should be true". Oh well.

2007-02-04

Quote:

I simply can't reliably detect the start of vblank and have the video code blitted to the screen. No matter how many places I tell bsnes to check to see if we've reached vblank (over 40,000 times a second), it's still not enough and I miss entire vblank periods 20% of the time. Most likely, there are too many parts in bsnes that eat up so much CPU time that it jumps right over vblank.

Can't you just add another pseudo hardware device to the emuated SNES that claims to be able to affect the CPU just after the beginning of every frame? Then that device would get control at the right spot every frame, without adding any checks in the emulator (since I'm assuming you already have the framework for this sort of event).

Quote:

It's too bad video drivers and/or API developers are too incompetent to design APIs that handle page flipping completely transparently in the background without deadlocking your applications when you request to blit the image to the screen.

Could this be because your application is asking for another page flip before the first one has occurred, and the API must block that thread until the first completes?

Quote:

Anyway, I've given up, sadly. I can't think of any way to do this, and I've exceeded my patience and run out of ideas.

This always happens to me when I don't want to slow down and approach the problem in isolation from the main project. At some point in the future I eventually do that, then spend a week or more experimenting with the concepts alone and figure it out. I have to become interested in the topic for its own sake, rather than as a mere problem to be solved and forgotten.

Quote:

Funny that I just now realized that when sinimas made the comment that the number of audio samples generated per video frame should be constant, and while immediately thinking "no, that's not true", realized, "wait... actually, yes, that should be true".

The number of samples have to vary by one or two, since there is some fraction of a sample extra each frame.

Quote:

I like cosine, because the graphs for it are prettier, they look almost identical to hermite, and besides the end points, very similar to cubic as well.

Cosine introduces discontinuities at each point, and it shows in frequency graphs. Here are the four compared (FIR using 11 point kernels), using a sweep from 16 kHz to 0 kHz in a 32 kHz sampled stream, resampled to 44.1 kHz by these.

You can see the low frequency aliases in linear and cosine (cosine comes out worse in some ways), while Hermite and FIR have one that is mostly inaudible in the upper range.

2007-02-07

Quote:

Can't you just add another pseudo hardware device to the emuated SNES that claims to be able to affect the CPU just after the beginning of every frame? Then that device would get control at the right spot every frame, without adding any checks in the emulator (since I'm assuming you already have the framework for this sort of event).

I like the idea. However, I have kind of an odd setup. I wanted to account for the possibly adding special clockrate chips to the emulator in the future (eg DSP-1, SuperFX, SA-1, etc ... though they mostly use the S-CPU clock rate fed to the cartridge pins anyway). So what I have is one variable for each two clocks that need to synchronize. Right now, there's just one for S-CPU <> S-SMP. Since the S-PPU1/2 and S-DSP are not emulated at the clock level, they are just enslaved to the CPU and SMP. Therefore, for CPU<>SMP, I keep one 64-bit variable. Whenever the CPU adds clocks, I subtract from this value by clocks * smpclockrate. Whenever the SMP adds clocks, I add to it by clocks *cpuclockrate. If the clocks were identical (eg CPU<>PPU), then the multiplication wouldn't be necessary.
I can detect if one processor is ahead by seeing if this value is >=0 or <0, respectively.
Now, the fun part is that to save speed, when I ask one processor to sync to the other, it will run nonstop until that processor needs to access the other processor, in which case it switches contexts and runs the other processor.
The way I break out, is that each time the S-CPU vcounter reaches 240 (where no video can be rendered, regardless of region or overscan settings), I context switch back to the main thread to end the "run_frame();" call.
I could add a function that does something like "keep running the SMP until the clock rate is as close to equal as possible", but the only way I could prevent it from running forever and/or switching back to the other processor is to add a check to break out when even right inside the core "add_clocks();" function for each processor. This would add a ton of overhead, since these functions are called millions of times a second. Same thing if I substituted the function with a function pointer that I switched out, indirect function call overhead would then add up. Right now the add_clocks functions that sync the two processors are force inlined and mainly consist of one add, mul and compare.
Lastly, this could get a lot more complex if and when more clock syncs were thrown into the mix. Still, it's a good idea, and the most viable one ...

Quote:

Could this be because your application is asking for another page flip before the first one has occurred, and the API must block that thread until the first completes?

That's a very real possibility, however neither DDraw nor D3D give you a way to see if you already have a page flip that is pending, so you can hold off. If it did, that would be absolutely perfect.

Quote:

This always happens to me when I don't want to slow down and approach the problem in isolation from the main project. At some point in the future I eventually do that, then spend a week or more experimenting with the concepts alone and figure it out. I have to become interested in the topic for its own sake, rather than as a mere problem to be solved and forgotten.

If you're still interested in this topic, then I certainly don't mind continuing to discuss it with you. I'd like to have a definitive solution for this problem as well :D

Quote:

The number of samples have to vary by one or two, since there is some fraction of a sample extra each frame.

For the aforementioned reasons, I'm getting a lot more than that, sadly. Resampling by one sample or two per 25ms audio buffer should be quite easy.

Quote:

Cosine introduces discontinuities at each point, and it shows in frequency graphs. Here are the four compared (FIR using 11 point kernels), using a sweep from 16 kHz to 0 kHz in a 32 kHz sampled stream, resampled to 44.1 kHz by these.

...

You can see the low frequency aliases in linear and cosine (cosine comes out worse in some ways), while Hermite and FIR have one that is mostly inaudible in the upper range.

To be honest, I really don't understand the graph or what you're meaning, but I have virtually no experience with audio. No need to explain it in layman's terms, though. I'll take your word (and picture) for it that the FIR resampler is best. By the way, how does cubic look on that graph? Comparable, or worse than hermite?

2007-02-07

Cubic and Hermite are often used to mean the same thing: interpolation based on the value and first differential at the start and end of each interval.

2007-02-09

Quote:

you can interpolate per sample rather than outright drop frames. I've been completely unable to think of how to do this, however.

My NES emulator does just what you're talking about. The emulation loop is synced to 60Hz (using my monitor's vsync rate, but it could use any timer). Every other frame, I check how full the DirectSound buffer is and adjust the playback frequency to compensate. I keep the buffer about 70% full so I never have to drop video frames to catch up or block waiting for the buffer to accept more data. The adjustments are small enough that there's no (obvious) audible frequency changes. This is with an 80ms buffer (at 70% full, an average of 3-4 frames latency) on a SoundBlaster X-Fi (so not sure how well it works with onboard audio).

If you're intersted, I'll clean up and post the current version of the code I use (the version on the web site uses a different technique, though the idea is similar).

James