Fast apu processing - NESdev BBS

Fast apu processing
by Zelex on 2012-12-07 (#104234)

Ok guys, I here's the problem. I am on a mobile device so performance is at a premium. I don't have enough cpu to emulate the apu at full 1.7mhz and then resample down to 44khz. What's the proper thing to do here? And how can I do it super fast?

I'm currently point sampling the waveform at regular intervals, and that works but occasionally does not sound very accurate.

Re: Fast apu processing
by blargg on 2012-12-07 (#104236)

No need to sacrifice quality, and this is super-fast (faster than anything else I've seen): blip_buf library. Explanation of how it works.

Re: Fast apu processing
by Dwedit on 2012-12-07 (#104237)

You don't need all 1.7mhz worth of samples, you just need to know when the volume level changes.

Here's an example of how to write a catch-up APU that uses linear interpolation.
You aren't running each cycle separately and accumulating the average volume, you are calling a function whenever the volume level of each channel would change.
This is in pseudocode. It's currently written for floating point, but it's easy enough to change to use fixed point math. (Best to change 'samples to run' into a 32.32 fixed point number on ARM, the UMULL instruction is a very fast way to divide.)
The noise channel will still run the slowest though, since it could change volume as quickly as 4 CPU cycles.

Code:

function run channel (new cpu cycle, new volume)
   cycles to run = new cpu cycle - last cpu cycle
   last cpu cycle = new cpu cycle
   samples to run = cycles to run / (cpu speed / sampling rate)
   //are we inside a fractional sample?
   if (position within sample > 0.0)
      //will we leave the sample?
      if (position within sample + samples to run >= 1.0)
         remaining within sample = 1.0 - position within sample
         accumulated volume += remaining within sample * old volume
         samples to run -= remaining within sample
         wave[sample position] = accumulated volume
         sample position += 1
      else //we don't leave the sample
         accumulated volume += samples to run * old volume
         
         position within sample += samples to run
         old volume = new volume
         return

   //run whole samples
   whole samples to run = int(samples to run)
   for i = 0, i < whole samples to run, i += 1
      wave[sample position] = old volume
      sample position += 1
   samples to run -= whole samples to run
   
   //entering a new partial sample
   if (samples to run > 0.0)
      accumulated volume = samples to run * old volume
      position within sample = samples to run
   else
      position within sample = 0.0
   
   old volume = new volume
   return

I see Blargg also posted his code. He also used band-limited sound (better quality, but slower) as an option instead of just linear interpolation.

Re: Fast apu processing
by blargg on 2012-12-07 (#104238)

Quote:

I see Blargg also posted his code. He also used band-limited sound (better quality, but slower) as an option instead of just linear interpolation.

It can also do linear interpolation, and in my tests, it's faster than even the style of approach you outline above (as well as not requiring any arcane sample calculation so that the APU code is very very clean). If you'd be willing, we can run some benchmarks on whatever portable platform your example is for. Essentially,

Code:

void run_channel( int new_cpu_cycle, int new_volume ) // your code
{
    int delta = new_volume - old_volume;
    old_volume = new_volume;
    blip_add_delta_fast( blip, new_cpu_cycle, delta ); // our waveform changed by delta units at new_cpu_cycle, simple
}

void blip_add_delta_fast( blip_t*, int time, int delta ) // library does this
{
    int f = time * factor + offset;
    short* b = buffer + (f >> 15);
    int d2 = (delta * (f & 0x7FFF)) >> 15;
    b [0] += delta - d2;
    b [1] += d2;
}

I've surveyed the alternatives and just can't see much reason to use something that complicates the emulator code and is slower as well.

Re: Fast apu processing
by tepples on 2012-12-07 (#104243)

Droidsound on my Archos 43 Internet Tablet (4.3" PDA running Android 2.2) and on my ASUS Nexus 7 (7" tablet running Android 4.2) uses blargg's GMEPlugin and runs at full speed.

Re: Fast apu processing
by Dwedit on 2012-12-07 (#104244)

I see the difference, your code puts in deltas, and my code calculates the final values of the samples (hence the for loop that fills in samples).
So the buffer needs to start as zeroes, and get converted into sample values before it's played back. Sample buffers for a frames worth of time are very tiny, 800 samples for 1/60s at 48000Hz, converting deltas to samples is really fast.
Your code is probably a lot better here.

All the division in my code is just really multiplication in disguise, so that part of the code is no different. But taking the fractional and whole parts of the fixed point number and directly using that as array indexes and values looks really spiffy.

Re: Fast apu processing
by blargg on 2012-12-07 (#104246)

Yes, exactly. The algorithm kind of turns everything inside-out, so that the waveform synthesis only has to touch the buffer at points where it changes. And since it's deltas, it doesn't matter what order you do them, or whether they "overlap" (two deltas one clock apart). So you aren't keeping track of which sample you last ran to, and having to fill intermediate samples, or do special things when deltas occur really close together. So much less headaches and special cases where bugs can creep in. And on platforms where there's plenty of CPU, user code can be converted to use full band-limited synthesis by just calling blip_add_delta() (rather than the fast variant).

Re: Fast apu processing
by Drag on 2012-12-09 (#104352)

Doesn't any kind of interpolation introduce some latency into the audio output?

For instance, to add a band-limited step into the output, you need to copy more than just one sample to the output; you need to copy a couple of samples before the step, then the step itself, and then a couple of samples after the step. With linear interpolation, you just need to copy two samples.

The latency would probably be too small to be noticable, unless you're running at a super-low frequency rate. Plus, the size of the buffer itself would probably introduce the most significant latency.

This was an issue that was always interesting to me; is there any way to have zero-latency audio output? (Or close to it)

Re: Fast apu processing
by rainwarrior on 2012-12-09 (#104356)

Interpolation requires you know the points to interpolate between before you can interpolate, yes.

Synthesis does not necessarily require this. I haven't tried blargg's library, so I don't know what requirements it has in terms of latency, but there are lots of audio processes that don't inherently require latency (IIR filters, for example).

However, on a PC, almost all audio is buffered anyway, so this shouldn't be too much of a concern. You can use ASIO to get very low latency buffered audio, if needed. I don't know of any zero-latency audio interfaces, but they could be built, in theory (though it is not good for multitasking, obviously).

Re: Fast apu processing
by blargg on 2012-12-09 (#104357)

Drag, a minimum-phase FIR will give a similar impulse response as an IIR, giving only a few samples latency, even if the FIR is of a high order (64 or more). But this would only matter if you continuously outputted samples to the hardware as they were ready, and the hardware had a really small buffer. This would be really demanding since you would never be able to do much work between calculating samples. I doubt anything but a microcontroller these days even allows unbuffered audio.

Re: Fast apu processing
by miker00lz on 2012-12-15 (#104741)

As was mentioned already, you absolutely do not and should not need to process the APU stuff at 1.7 MHz. What I do is just determine and keep track of how often each channel's sample gets updated (in terms of CPU ticks) and after every CPU instruction emulated, I check if it's time to do so yet. Likewise, I determine how many samples to create per second by taking the number of CPU ticks per second and dividing by the sample rate. Whenever it's time to create a new sample and put it in the output buffer, all I have to do is mix the sample value from all channels into one sample. That's all.

The APU emulation takes nowhere near the CPU power of PPU emulaion. In fact, if you did only the APU stuff and ignored all graphics stuff, it would probably be fast enough to generate the audio full speed on a 386, or possibly even a high-end 286!

Re: Fast apu processing
by blargg on 2012-12-15 (#104762)

The APU can generate 1.79 MHz-accurate audio without any speed hit. In fact, it's more cumbersome to do it lower. There's no need to muddy the emulation code with consideration of output samples.

Re: Fast apu processing
by Dwedit on 2012-12-15 (#104765)

But you don't need to explicitly run some loop of code 1.78 million times per second, I think that's what's being said here.

Re: Fast apu processing
by miker00lz on 2012-12-15 (#104767)

Dwedit wrote:

But you don't need to explicitly run some loop of code 1.78 million times per second, I think that's what's being said here.

Yeah, exactly. The results are the same. If I did run a loop at 1.78 million times per second, almost all of that CPU time would be wasted. Might as well just calculate how many CPU ticks occur between an APU channel's tick and the next one, and only update the channel's sample that often and the resolution accuracy of it is still maintained because after every CPU instruction it checks if an update should be made and then it happens.

Re: Fast apu processing
by blargg on 2012-12-15 (#104770)

Turn it inside-out. Have your CPU keep track of the current time. When the APU is about to be written to, tell it to catch up to the current time, then do the write. The APU keeps track of the time of the next amplitude change, and just runs a loop skipping through time adding the appropriate deltas:

Code:

void apu_catchup_square( int end )
{
    while ( time < end )
    {
        int new_amp = ...

        int delta = new_amp - amp;
        add_delta( time, delta );

        amp = new_amp;
        time = time + period;
    }
}

Processor usage is based on the frequency of the square wave, rather than the clock rate.

People have claimed that it's inherently slow to render sound at 1.79 MHz as opposed to lower rates. Is there interest in submitting these algorithms to a performance test that can be run on various platforms to find out what differences there really are?

Re: Fast apu processing
by Zelex on 2014-01-20 (#124251)

blargg: would that approach work with DPCM channel? As the data that its playing from memory could have changed, so you can't really make it "catch up" on register change. Or do games generally only play SFX from ROM? In which case you could also just do a catch-up on bank swap I suppose.

Re: Fast apu processing
by lidnariq on 2014-01-20 (#124252)

Zelex wrote:

DPCM [...] Or do games generally only play SFX from ROM?

Only the MMC5 supports mapping RAM to the area from where the DPCM fetches data. You probably don't need to worry about that.

Re: Fast apu processing
by blargg on 2014-01-21 (#124253)

Even then you only need to catch the DPCM to present before bank switches. Also, you pretty much have to do part of that for all memory accesses, to account for DPCM wait-states. It might be useful to treat the DPCM sample byte fetch as an interrupt-like hardware event. That way if your emulator uses an optimized scheme of keeping track of the next earliest interrupt/interrupt-like event that can happen, it only needs to do one timestamp comparison per cycle, rather than N comparisons where N is the total number of possible interrupt-like things that can happen (e.g. NMI, DPCM).

Re: Fast apu processing
by Zelex on 2014-01-21 (#124296)

hmm, there's also accurate IRQ timing to think about as well. Which games use the APU IRQ interrupt?

Re: Fast apu processing
by rainwarrior on 2014-01-21 (#124297)

lidnariq wrote:

Only the MMC5 supports mapping RAM to the area from where the DPCM fetches data. You probably don't need to worry about that.

FDS has RAM in the DPCM area as well, though I haven't seen anyone use this to modify samples after load.

Re: Fast apu processing
by ReaperSMS on 2014-01-21 (#124301)

For the DMC, you'd trigger a catchup whenever the DMC ticked over to a new byte. Seeing as how that can't be any less than 432 6502 cycles between them, that probably won't have serious performance implications.

If it did, you could also possibly get by with just doing the DMC memory access at that time, and stuff those bytes into a queue to be caught up when the rest of it needs it.

Re: Fast apu processing
by tepples on 2014-01-22 (#124317)

Zelex wrote:

Which games use the APU IRQ interrupt?

I can think of at least Fire Hawk, MiG, Time Lord, and a couple demos that I made.

Re: Fast apu processing
by James on 2014-01-22 (#124319)

One other thing to keep in mind: the highest frequency the APU will generate is ~447kHz. Therefore, instead of outputting a sample every cycle, you can output every other cycle (i.e., sample rate = 1.79MHz/2 = ~895kHz) and not exceed the Nyquist limit. This will reduce CPU requirements for mixing and filtering.

I haven't tried this yet, but you should be able to reduce this to every third cycle. There will be aliasing, but it won't affect frequencies below ~150kHz.

Re: Fast apu processing
by James on 2014-01-22 (#124320)

tepples wrote:

Zelex wrote:

Which games use the APU IRQ interrupt?

I can think of at least Fire Hawk, MiG, Time Lord, and a couple demos that I made.

Just to clarify, those games rely on DMC IRQs. If Zelex is referring to frame IRQs, Dragon Quest relies on them (will hang after battle otherwise).

Re: Fast apu processing
by Zepper on 2014-01-22 (#124322)

James wrote:

Yes. I did it in my emu.