APU mixer function efficiency

APU mixer function efficiency
by zeroone on 2016-10-11 (#180702)

The wiki contains an approximation that uses 2 lookup tables. However, since pulse, triangle and noise are 4-bit channels and DMC is a 7-bit channel, doesn't that mean a single lookup table with 2^23 = 8,388,608 entries could provide exact values? Assuming that 16-bit values are required, the lookup table only needs to be 16 MB. Are there any cache considerations for randomly reading from an array of that size?

Re: APU mixer function efficiency
by lidnariq on 2016-10-11 (#180704)

... Yes?
Enormous ones?

The largest in-die L3 cache I've heard of is on the order of 8 MiB, and you're talking about using more than twice that just for fixing APU non-linearity?

... That said, where did 2²³ come from? 4+4+7 = 15...

Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".

Re: APU mixer function efficiency
by rainwarrior on 2016-10-11 (#180705)

The lookup tables are only one way to implement the given non-linearity formulae, which themselves are only an approximation of the actual output (which still needs more research, I think).

I don't think the wiki really needs to prescribe how to implement the non-linearity; the formula is the important part. The lookup table is just an implementation detail. Whether or not it's worth doing depends entirely on the context of implementation.

...and yes, those tables are much, much smaller than what you're saying.

Re: APU mixer function efficiency
by zeroone on 2016-10-11 (#180706)

lidnariq wrote:

Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".

Let's put that claim to the test. Below is a Java application that compares 32-bit floating-point operations to lookup tables:

Code:

public final class BenchmarkApuMixers {
  
  private static final float[] pulseTable = new float[31];
  private static final float[] tndTable = new float[203];
  
  static {
    for(int i = pulseTable.length - 1; i >= 0; i--) {
      pulseTable[i] = 95.52f / (8128f / i + 100f);
    }
    for(int i = tndTable.length - 1; i >= 0; i--) {
      tndTable[i] = 163.67f / (24329f / i + 100f);
    }
  }

  public static float noTableTest(final float pulse1, final float pulse2, 
      final float triangle, final float noise, final float dmc) {
    return 95.88f / (8128f / (pulse1 + pulse2) + 100f) + 159.79f 
        / ((1f / (triangle / 8227f + noise / 12241f + dmc / 22638f)) + 100f);
  }
  
  public static float lookupTableTest(final int pulse1, final int pulse2, 
      final int triangle, final int noise, final int dmc) {
    return pulseTable[pulse1 + pulse2] + tndTable[3 * triangle + (noise << 1) 
        + dmc];
  }  

  public static void main(final String... args) throws Throwable {
    float result = 0;
    int x = 0;
    int y = 0;
    long startTime = System.nanoTime();    
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += noTableTest(x, x, x, x, y);
    }    
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime) 
        / 200_000_000.0);
    result = 0;
    x = 0;
    y = 0;
    startTime = System.nanoTime();
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += lookupTableTest(x, x, x, x, y);
    }
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime) 
        / 200_000_000.0);
    System.out.println(result);
  }
}

The results:

floating-point: 30.239289 nanos/iteration

lookup tables: 2.060711 nanos/iteration

That can't be right. The iterations are way too brief. This is only a 3 GHz machine.

Any suggestions for better ways to benchmark this?

Re: APU mixer function efficiency
by zeroone on 2016-10-11 (#180708)

I tried both ways in my emulator and I can't visually see a difference in the task manager. This is a i5-2400 @ 3.1 GHz 64-bit Win7 box from mid-2012. My emulator is not particularly efficient. One of the 4 cores that appear in the task manager bounces between 30% and 40% load. Overall, it uses up about 15% of the available CPU. Maybe that's fine considering today's desktops.

To my ears, there is no audible difference. But, my hearing is not that great

Re: APU mixer function efficiency
by lidnariq on 2016-10-11 (#180709)

A 203-entry LUT is short enough to fit into a few lines of L1 cache, so that might be the difference you're seeing.

A tiny LUT is definitely a very different story than a multi-megabyte one.

(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)

Re: APU mixer function efficiency
by zeroone on 2016-10-11 (#180710)

lidnariq wrote:

(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)

On what box?

Re: APU mixer function efficiency
by lidnariq on 2016-10-11 (#180711)

Haswell i3-4330. (nominally 3.5GHz peak, and 64B/line of cache)

Re: APU mixer function efficiency
by Dwedit on 2016-10-11 (#180712)

Huge lookup tables = tons of cache misses = slow

Re: APU mixer function efficiency
by zeroone on 2016-10-12 (#180717)

The reason that I am reviewing the mixer code is that I want to provide sound options that enable the user to adjust the mix via sliders similar to many other emulators. Using floats, the code is straightforward and clean. Using integer math, things become messy. For ints only, a pre-scaled LUT could be used for each channel. Or, I could multiply by a constant and right-shift. But, it's not clear if either is really faster than a multiplication by a 32-bit float. Or, in my case, if it is even worth it; after mixing, my emulator applies filtering and decimation, both of which use a ton of 64-bit float multiplications, additions and table look-ups. In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.

Re: APU mixer function efficiency
by tepples on 2016-10-12 (#180719)

zeroone wrote:

In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.

Perhaps it's a matter of whether you want to target PCs or smartphones. On the one hand, people tend to want to play simpler games on the go rather than at home, and mobile devices with ARM CPUs have tended to perform far worse than desktop PCs at floating-point division in the past. On the other hand, the mobile devices that people are carrying nowadays have a touch screen and tilting the whole device as their only input methods, and these don't map quite so well to the twitchy actions that NES games tend to require on the Control Pad and A and B Buttons. Clip-on gamepads for phones exist, but I've never seen one in use in person.

Re: APU mixer function efficiency
by lidnariq on 2016-10-12 (#180729)

I'm not certain exactly how the reciprocal-of-reciprocals model came about, but given how MOSFETs work it really "ought" to involve a square root instead. (wikipedia)

Re: APU mixer function efficiency
by zeroone on 2016-10-12 (#180734)

The LUT approximation deviates from the float version by more than 10% when DMC is not being used. For instance, plug-in the max values for pulse1, pulse2, triangle, and noise, and set dmc to 0. Maybe there is an audible difference.

Re: APU mixer function efficiency
by rainwarrior on 2016-10-12 (#180736)

zeroone wrote:

Yeah, it's unclear why the lookup table implementation on the wiki is using different numbers than blargg's formula, maybe mostly because it's using 3 magic values instead of summing triangle + noise + dmc. (Is the actual mix an analog combination of 3 DACs or a digital sum output to 1 DAC?)

I hinted at this earlier, but you should know that this formula is an old approximation that blargg made up years ago to fit the data available at that time. It's not really that accurate to begin with, but it's better than the linear approximation that was used before it. Between Visual2A03, and the wealth of knowledge and available expertise this community has accumulated since then I'm sure someone could do a lot better if they were to take a crack at it now. (I intend to try once I'm finished my game project, if nobody else gets to it before then.)

As for whether you can hear a difference, probably you won't. The analog mix between the two APU outputs is highly variable from system to system anyway; every release of NSFPlay I get complaints on both sides of "this channel is too loud, it's quieter on my NES" and the other way around. Every NES is a little different in that way, and it's a much stronger difference than the kind of thing you'll notice between these approximations of the same formula.

Re: APU mixer function efficiency
by zeroone on 2016-10-12 (#180737)

According to the wiki, "When the values for one of the groups are all zero, the result for that group should be treated as zero rather than undefined due to the division by 0 that otherwise results."

This is not the case. In the limit, as the pulse channels approach 0, pulse_out smoothly approaches 0. Similarly, tnd_out smoothly approaches a constant. Consequentially, the checks are not necessary if the formulas are reduced to the expression below:

Code:

  public static float mix(final int pulse1, final int pulse2, 
      final int triangle, final int noise, final int dmc) {    

    return (0.9588f * (pulse1 + pulse2)) / (pulse1 + pulse2 + 81.28f) 
        - 361.733f / (2.75167f * triangle + 1.84936f * noise + dmc + 226.38f) 
            + 1.5979f;
  }

Also, with fewer divisions, it should run a lot faster.

Re: APU mixer function efficiency
by lidnariq on 2016-10-12 (#180741)

rainwarrior wrote:

(Is the actual mix an analog combination of 3 DACs or a digital sum output to 1 DAC?)

Analog, for both output pins.

Effectively, the current out of a pin should be

I_D = K(V_G-V_S-Vth)²(1+λ(V_D-V_S-Vdssat))
and the voltage of the pin should be
V_S = 100Ω·I_D.

K is the sum of the digital values for each of the channels, each of which have their own linear scaling factors. (Because each of the MOSFETs making up the DACs in the 2A03 have different widths and lengths)

V_G and V_D are both 5V. Vth, λ, and Vdssat are very approximately 1V, 0, and irrelevant.

Solving for V_S—which is to say V_OUT—yields (1+800K-sqrt(1600K+1))÷200K ... but obviously actually sitting down and more exactly determining the other parameters would produce a better result.