Hello, I'm sorry to ask this question here (I also asked at mameworld.info), but there's a (not so) surprising lack of people out there who might be able to help me with this ... I was hoping that perhaps someone here has encountered this issue in the past and worked out some possible solutions already.
Basically, emulation has two really important outputs: video and audio. Video typically runs at close to but not exactly a monitor refresh rate (in my case, ~60.09fps emulation vs ~60.00fps monitor), and the same for audio (~32,040hz emulation vs ~32,000hz sound card). So, in order to get smooth audio and video: you either have to resample the video (skip/duplicate frames), which is quite visible to a user; or you have to resample the audio. This theoretically should be less noticeable to the user if done right, as you can interpolate per sample rather than outright drop frames. I've been completely unable to think of how to do this, however.
My setup is that I use DirectSound (the API is unimportant) and a 3-ring audio buffer, plus one very very large temporary buffer. The ring size is adjustable, typically ~800 samples for 32khz (~25*3=75ms latency). I run emulation, and each time a sample is generated, I add it to the temporary buffer and then check the current playback position in the audio buffer. As soon as I see that the playback position has reached a new ring (of 3, it wraps), I then take the temporary buffer, resample it and stick it two rings ahead of the currently playing ring, and flush the temporary buffer. Why two ahead of the current ring? A: to give time for resampling, and B: I use 3 rings instead of 2 because I also want to allow syncing via dropping video frames and that requires 3 minimum if you think about it.
Now, the problem is the resampling part. My output clock rate is consistent at 32khz, but I can only roughly detect the current playback position, and not every sample edge. My input clock rate is known (~32,040hz), however the number of samples I actually generate can vary wildly based on how much time the OS gives my emulator, and how complex emulation is during the next "ring". The number of input samples can be more or less than the number of output samples I need to fill the next audio ring buffer, so I'm using a godawful point resample ("drop-sample") filter for now (I know you can practically get into quantum physics and n-dimensional trig/calculus trying to resample audio, but I chose the simple approach to start) to convert x input samples to y output samples.
The problem is that audio is just waaay too sensitive to this, regardless of latency. A sample conversion log (with a massive latency of 8000 samples, ~250 * 3ms) looks like this:
7950 samples -> 8000 samples
7980 samples -> 8000 samples
8020 samples -> 8000 samples
7990 samples -> 8000 samples
Every single ring buffer that gets played back has an extremely noticeable pitch shift from the previous sample block (even for very, very small sample differences, which surprised me), and the result is truly something horrible to listen to. It actually sounds better to leave it crackling by not resampling the audio at all and just ignoring buffer overruns/underruns. In the rare cases that I get several blocks with the same number of input and output samples, it sounds quite tolerable, again until the input sample rate changes and the pitch shifts again. I can get more blocks sounding the same by testing playback position every n samples rather than every single sample, but then eventually that rounds off and one ring buffer has an even more massive pitch difference from the others to compensate. I also want to be able to handle differing times with at least a ~5-10% tolerance (eg if I need 800 output samples, I want audio to sound good when the input sample count is within ~700-900 samples), as sometimes emulation will fall below 60fps, and I'd like it if the audio didn't go all to hell when that happened.
Obviously, I'd need to even out the pitch over multiple ring buffers. But my question is, how in the world is this possible when you are streaming asynchronous audio data? You can't possibly predict how many samples you'll get in the next ring/block until you emulate it, and the more you buffer to try and even things out, the more terrible your latency gets. At anything > 100ms, you can noticeably tell the delay between eg Link swinging his sword and the sound effect for it playing back. If you buffer the video back to account for this delay, then your input to onscreen response gets delayed even more.
Please note that manipulating the CPU/SMP frequency counters to control video frames/second to audio samples/second is not an option, as that affects emulation accuracy. I'm also not able to ask the emulated DSP to generate more samples, as that would result in changes to DSP processor registers visible to program code: again, also not good for accuracy.
Any help would be greatly appreciated, thanks in advance.
Basically, emulation has two really important outputs: video and audio. Video typically runs at close to but not exactly a monitor refresh rate (in my case, ~60.09fps emulation vs ~60.00fps monitor), and the same for audio (~32,040hz emulation vs ~32,000hz sound card). So, in order to get smooth audio and video: you either have to resample the video (skip/duplicate frames), which is quite visible to a user; or you have to resample the audio. This theoretically should be less noticeable to the user if done right, as you can interpolate per sample rather than outright drop frames. I've been completely unable to think of how to do this, however.
My setup is that I use DirectSound (the API is unimportant) and a 3-ring audio buffer, plus one very very large temporary buffer. The ring size is adjustable, typically ~800 samples for 32khz (~25*3=75ms latency). I run emulation, and each time a sample is generated, I add it to the temporary buffer and then check the current playback position in the audio buffer. As soon as I see that the playback position has reached a new ring (of 3, it wraps), I then take the temporary buffer, resample it and stick it two rings ahead of the currently playing ring, and flush the temporary buffer. Why two ahead of the current ring? A: to give time for resampling, and B: I use 3 rings instead of 2 because I also want to allow syncing via dropping video frames and that requires 3 minimum if you think about it.
Now, the problem is the resampling part. My output clock rate is consistent at 32khz, but I can only roughly detect the current playback position, and not every sample edge. My input clock rate is known (~32,040hz), however the number of samples I actually generate can vary wildly based on how much time the OS gives my emulator, and how complex emulation is during the next "ring". The number of input samples can be more or less than the number of output samples I need to fill the next audio ring buffer, so I'm using a godawful point resample ("drop-sample") filter for now (I know you can practically get into quantum physics and n-dimensional trig/calculus trying to resample audio, but I chose the simple approach to start) to convert x input samples to y output samples.
The problem is that audio is just waaay too sensitive to this, regardless of latency. A sample conversion log (with a massive latency of 8000 samples, ~250 * 3ms) looks like this:
7950 samples -> 8000 samples
7980 samples -> 8000 samples
8020 samples -> 8000 samples
7990 samples -> 8000 samples
Every single ring buffer that gets played back has an extremely noticeable pitch shift from the previous sample block (even for very, very small sample differences, which surprised me), and the result is truly something horrible to listen to. It actually sounds better to leave it crackling by not resampling the audio at all and just ignoring buffer overruns/underruns. In the rare cases that I get several blocks with the same number of input and output samples, it sounds quite tolerable, again until the input sample rate changes and the pitch shifts again. I can get more blocks sounding the same by testing playback position every n samples rather than every single sample, but then eventually that rounds off and one ring buffer has an even more massive pitch difference from the others to compensate. I also want to be able to handle differing times with at least a ~5-10% tolerance (eg if I need 800 output samples, I want audio to sound good when the input sample count is within ~700-900 samples), as sometimes emulation will fall below 60fps, and I'd like it if the audio didn't go all to hell when that happened.
Obviously, I'd need to even out the pitch over multiple ring buffers. But my question is, how in the world is this possible when you are streaming asynchronous audio data? You can't possibly predict how many samples you'll get in the next ring/block until you emulate it, and the more you buffer to try and even things out, the more terrible your latency gets. At anything > 100ms, you can noticeably tell the delay between eg Link swinging his sword and the sound effect for it playing back. If you buffer the video back to account for this delay, then your input to onscreen response gets delayed even more.
Please note that manipulating the CPU/SMP frequency counters to control video frames/second to audio samples/second is not an option, as that affects emulation accuracy. I'm also not able to ask the emulated DSP to generate more samples, as that would result in changes to DSP processor registers visible to program code: again, also not good for accuracy.
Any help would be greatly appreciated, thanks in advance.