Ok guys, I here's the problem. I am on a mobile device so performance is at a premium. I don't have enough cpu to emulate the apu at full 1.7mhz and then resample down to 44khz. What's the proper thing to do here? And how can I do it super fast?
I'm currently point sampling the waveform at regular intervals, and that works but occasionally does not sound very accurate.
No need to sacrifice quality, and this is super-fast (faster than anything else I've seen):
blip_buf library.
Explanation of how it works.
You don't need all 1.7mhz worth of samples, you just need to know when the volume level changes.
Here's an example of how to write a catch-up APU that uses linear interpolation.
You aren't running each cycle separately and accumulating the average volume, you are calling a function whenever the volume level of each channel would change.
This is in pseudocode. It's currently written for floating point, but it's easy enough to change to use fixed point math. (Best to change 'samples to run' into a 32.32 fixed point number on ARM, the UMULL instruction is a very fast way to divide.)
The noise channel will still run the slowest though, since it could change volume as quickly as 4 CPU cycles.
Code:
function run channel (new cpu cycle, new volume)
cycles to run = new cpu cycle - last cpu cycle
last cpu cycle = new cpu cycle
samples to run = cycles to run / (cpu speed / sampling rate)
//are we inside a fractional sample?
if (position within sample > 0.0)
//will we leave the sample?
if (position within sample + samples to run >= 1.0)
remaining within sample = 1.0 - position within sample
accumulated volume += remaining within sample * old volume
samples to run -= remaining within sample
wave[sample position] = accumulated volume
sample position += 1
else //we don't leave the sample
accumulated volume += samples to run * old volume
position within sample += samples to run
old volume = new volume
return
//run whole samples
whole samples to run = int(samples to run)
for i = 0, i < whole samples to run, i += 1
wave[sample position] = old volume
sample position += 1
samples to run -= whole samples to run
//entering a new partial sample
if (samples to run > 0.0)
accumulated volume = samples to run * old volume
position within sample = samples to run
else
position within sample = 0.0
old volume = new volume
return
I see Blargg also posted his code. He also used band-limited sound (better quality, but slower) as an option instead of just linear interpolation.
Quote:
I see Blargg also posted his code. He also used band-limited sound (better quality, but slower) as an option instead of just linear interpolation.
It can also do linear interpolation, and in my tests, it's faster than even the style of approach you outline above (as well as not requiring any arcane sample calculation so that the APU code is very very clean). If you'd be willing, we can run some benchmarks on whatever portable platform your example is for. Essentially,
Code:
void run_channel( int new_cpu_cycle, int new_volume ) // your code
{
int delta = new_volume - old_volume;
old_volume = new_volume;
blip_add_delta_fast( blip, new_cpu_cycle, delta ); // our waveform changed by delta units at new_cpu_cycle, simple
}
void blip_add_delta_fast( blip_t*, int time, int delta ) // library does this
{
int f = time * factor + offset;
short* b = buffer + (f >> 15);
int d2 = (delta * (f & 0x7FFF)) >> 15;
b [0] += delta - d2;
b [1] += d2;
}
I've surveyed the alternatives and just can't see much reason to use something that complicates the emulator code and is slower as well.
Droidsound on my Archos 43 Internet Tablet (4.3" PDA running Android 2.2) and on my ASUS Nexus 7 (7" tablet running Android 4.2) uses blargg's GMEPlugin and runs at full speed.
I see the difference, your code puts in deltas, and my code calculates the final values of the samples (hence the for loop that fills in samples).
So the buffer needs to start as zeroes, and get converted into sample values before it's played back. Sample buffers for a frames worth of time are very tiny, 800 samples for 1/60s at 48000Hz, converting deltas to samples is really fast.
Your code is probably a lot better here.
All the division in my code is just really multiplication in disguise, so that part of the code is no different. But taking the fractional and whole parts of the fixed point number and directly using that as array indexes and values looks really spiffy.
Yes, exactly. The algorithm kind of turns everything inside-out, so that the waveform synthesis only has to touch the buffer at points where it changes. And since it's deltas, it doesn't matter what order you do them, or whether they "overlap" (two deltas one clock apart). So you aren't keeping track of which sample you last ran to, and having to fill intermediate samples, or do special things when deltas occur really close together. So much less headaches and special cases where bugs can creep in. And on platforms where there's plenty of CPU, user code can be converted to use full band-limited synthesis by just calling blip_add_delta() (rather than the fast variant).
Doesn't any kind of interpolation introduce some latency into the audio output?
For instance, to add a band-limited step into the output, you need to copy more than just one sample to the output; you need to copy a couple of samples before the step, then the step itself, and then a couple of samples after the step. With linear interpolation, you just need to copy two samples.
The latency would probably be too small to be noticable, unless you're running at a super-low frequency rate. Plus, the size of the buffer itself would probably introduce the most significant latency.
This was an issue that was always interesting to me; is there any way to have zero-latency audio output? (Or close to it)
Interpolation requires you know the points to interpolate between before you can interpolate, yes.
Synthesis does not necessarily require this. I haven't tried blargg's library, so I don't know what requirements it has in terms of latency, but there are lots of audio processes that don't inherently require latency (IIR filters, for example).
However, on a PC, almost all audio is buffered anyway, so this shouldn't be too much of a concern. You can use ASIO to get very low latency buffered audio, if needed. I don't know of any zero-latency audio interfaces, but they could be built, in theory (though it is not good for multitasking, obviously).
Drag, a
minimum-phase FIR will give a similar impulse response as an IIR, giving only a few samples latency, even if the FIR is of a high order (64 or more). But this would only matter if you continuously outputted samples to the hardware as they were ready, and the hardware had a really small buffer. This would be really demanding since you would never be able to do much work between calculating samples. I doubt anything but a microcontroller these days even allows unbuffered audio.
As was mentioned already, you absolutely do not and should not need to process the APU stuff at 1.7 MHz. What I do is just determine and keep track of how often each channel's sample gets updated (in terms of CPU ticks) and after every CPU instruction emulated, I check if it's time to do so yet. Likewise, I determine how many samples to create per second by taking the number of CPU ticks per second and dividing by the sample rate. Whenever it's time to create a new sample and put it in the output buffer, all I have to do is mix the sample value from all channels into one sample. That's all.
The APU emulation takes nowhere near the CPU power of PPU emulaion. In fact, if you did only the APU stuff and ignored all graphics stuff, it would probably be fast enough to generate the audio full speed on a 386, or possibly even a high-end 286!
The APU can generate 1.79 MHz-accurate audio without any speed hit. In fact, it's more cumbersome to do it lower. There's no need to muddy the emulation code with consideration of output samples.
But you don't need to explicitly run some loop of code 1.78 million times per second, I think that's what's being said here.
Dwedit wrote:
But you don't need to explicitly run some loop of code 1.78 million times per second, I think that's what's being said here.
Yeah, exactly. The results are the same. If I did run a loop at 1.78 million times per second, almost all of that CPU time would be wasted. Might as well just calculate how many CPU ticks occur between an APU channel's tick and the next one, and only update the channel's sample that often and the resolution accuracy of it is still maintained because after every CPU instruction it checks if an update should be made and then it happens.
Turn it inside-out. Have your CPU keep track of the current time. When the APU is about to be written to, tell it to catch up to the current time, then do the write. The APU keeps track of the time of the next amplitude change, and just runs a loop skipping through time adding the appropriate deltas:
Code:
void apu_catchup_square( int end )
{
while ( time < end )
{
int new_amp = ...
int delta = new_amp - amp;
add_delta( time, delta );
amp = new_amp;
time = time + period;
}
}
Processor usage is based on the frequency of the square wave, rather than the clock rate.
People have claimed that it's inherently slow to render sound at 1.79 MHz as opposed to lower rates. Is there interest in submitting these algorithms to a performance test that can be run on various platforms to find out what differences there really are?
blargg: would that approach work with DPCM channel? As the data that its playing from memory could have changed, so you can't really make it "catch up" on register change. Or do games generally only play SFX from ROM? In which case you could also just do a catch-up on bank swap I suppose.
Zelex wrote:
DPCM [...] Or do games generally only play SFX from ROM?
Only the MMC5 supports mapping RAM to the area from where the DPCM fetches data. You probably don't need to worry about that.
Even then you only need to catch the DPCM to present before bank switches. Also, you pretty much have to do part of that for all memory accesses, to account for DPCM wait-states. It might be useful to treat the DPCM sample byte fetch as an interrupt-like hardware event. That way if your emulator uses an optimized scheme of keeping track of the next earliest interrupt/interrupt-like event that can happen, it only needs to do one timestamp comparison per cycle, rather than N comparisons where N is the total number of possible interrupt-like things that can happen (e.g. NMI, DPCM).
hmm, there's also accurate IRQ timing to think about as well. Which games use the APU IRQ interrupt?
lidnariq wrote:
Only the MMC5 supports mapping RAM to the area from where the DPCM fetches data. You probably don't need to worry about that.
FDS has RAM in the DPCM area as well, though I haven't seen anyone use this to modify samples after load.
For the DMC, you'd trigger a catchup whenever the DMC ticked over to a new byte. Seeing as how that can't be any less than 432 6502 cycles between them, that probably won't have serious performance implications.
If it did, you could also possibly get by with just doing the DMC memory access at that time, and stuff those bytes into a queue to be caught up when the rest of it needs it.
Zelex wrote:
Which games use the APU IRQ interrupt?
I can think of at least Fire Hawk, MiG, Time Lord, and a couple demos that I made.
One other thing to keep in mind: the highest frequency the APU will generate is ~447kHz. Therefore, instead of outputting a sample every cycle, you can output every other cycle (i.e., sample rate = 1.79MHz/2 = ~895kHz) and not exceed the Nyquist limit. This will reduce CPU requirements for mixing and filtering.
I haven't tried this yet, but you should be able to reduce this to every third cycle. There will be aliasing, but it won't affect frequencies below ~150kHz.
tepples wrote:
Zelex wrote:
Which games use the APU IRQ interrupt?
I can think of at least Fire Hawk, MiG, Time Lord, and a couple demos that I made.
Just to clarify, those games rely on DMC IRQs. If Zelex is referring to frame IRQs, Dragon Quest relies on them (will hang after battle otherwise).
James wrote:
One other thing to keep in mind: the highest frequency the APU will generate is ~447kHz. Therefore, instead of outputting a sample every cycle, you can output every other cycle (i.e., sample rate = 1.79MHz/2 = ~895kHz) and not exceed the Nyquist limit.
Yes. I did it in my emu.