SNES APU sample data

SNES APU sample data
by funnyguy on 2015-08-11 (#152907)

I had been working on an emulator and currently the DSP only produces popping sound
the SPC700 can KeyOn the voice channels ( any of the 8 ).

I had played with the opensource Tic-Tac-Toe program and the sound is messy.
The voice channels did KeyOn but the sound is only " Thud..Thud.. Thud "
instead of music.

Can anyone share some SPC700 and DSP register data information ?

It will be excellent if someone have a bit rate reduction data and the generated per cycle snap shot of the envelope, interpolation register, Gaussian Ring registers content.

I think my problem lies in misunderstanding of the BRR registers and the Gaussian interpolation.
It will be helpful if I can see some real example of how the data get converted and used.

Thanks

Re: SNES APU sample data
by funnyguy on 2015-08-11 (#152913)

I need to add, the volume and keyOn, KeyOff seems OK, because it gives the " Thud Thud Thud " sound in approximate loudness and start stop time as other emulators playing the music in the opensource tic-tac-toe game.

So I think the SPC700 and interface to the CPU 65816 is OK, the bug most likely lies in the interpolator which generates the tone frequency.

So if anyone knows any available SNES emulator that can print out snap shots of those registers will be of great help too.

Re: SNES APU sample data
by funnyguy on 2015-08-12 (#153019)

Let me highlight some descrepencies I found in various documents

The Gaussian interpolator

In Anomie's SPC_DSP document,

Quote:

// 4-point gaussian interpolation
i = voice[x].interpolation_index >> 12; // 0 <= i <= 4
d = (voice[x].interpolation_index >> 4) & 0xff; // 0 <= d <= 255
outx = ((gauss[255-d] * voice[x].BRRdata[i+0]) >> 11);
outx += ((gauss[511-d] * voice[x].BRRdata[i+1]) >> 11);
outx += ((gauss[256+d] * voice[x].BRRdata[i+2]) >> 11);
// The above 3 wrap at 15 bits signed. The last is added to that, and is
// clamped rather than wrapped.
outx = ((outx & 0x7FFF) ^ 0x4000) - 0x4000;
outx += ((gauss[ 0+d] * voice[x].BRRdata[i+3]) >> 11);
CLAMP15(outx);

In Higan's implementation,

Quote:

int DSP::gaussian_interpolate(const voice_t& v) {
//make pointers into gaussian table based on fractional position between samples
int offset = (v.interp_pos >> 4) & 0xff;
const int16* fwd = gaussian_table + 255 - offset;
const int16* rev = gaussian_table + offset; //mirror left half of gaussian table

offset = v.buf_pos + (v.interp_pos >> 12);
int output;
output = (fwd[ 0] * v.buffer[offset + 0]) >> 11;
output += (fwd[256] * v.buffer[offset + 1]) >> 11;
output += (rev[256] * v.buffer[offset + 2]) >> 11;
output = (int16)output;
output += (rev[ 0] * v.buffer[offset + 3]) >> 11;
return sclamp<16>(output) & ~1;
}

Furthermore, referring to another document, No$SNS by Martin

Quote:

4-Point Gaussian Interpolation
Interpolation is applied on the 4 most recent 15bit BRR samples (new,old,older,oldest), using bit4-11 of the pitch counter as interpolation index (i=00h..FFh):
out = ((gauss[0FFh-i] * oldest) SAR 10) ;-initial 16bit value
out = out + ((gauss[1FFh-i] * older) SAR 10) ;-no 16bit overflow handling
out = out + ((gauss[100h+i] * old) SAR 10) ;-no 16bit overflow handling
out = out + ((gauss[000h+i] * new) SAR 10) ;-with 16bit overflow handling
out = out SAR 1 ;-convert 16bit result to 15bit

Note the difference in the offset pointing to the v.buffer ( BRR sample buffer )
Anomie's did not add the voice pointer v.buf_pos that higan did.

Furthermore, since v.buf_pos is suppose to contain data 12 entries old ( which is going to be rewitten over with new BRR data )
why is the Gaussian interpolator getting data that far away ?

Referring to No$SNS, the oldest to new data should be ( BRRData[ v.buf_pos - 4] ); ( BRRData[ v.buf_pos - 3]); ( BRRData[v.buf_pos - 2] ) ; (BRRData[v.buf_pos-1]);
Also note that SNS is >> 10 instead of other 2 design that SAR 11 ( but it did SAR 1 once more at the final stage )

Can some expert explain to me ?

Furthermore, if only 4 recent samples are needed, why we need to store up to 12 samples.. ( In this manner, higan seems to be more correct that he may use up to 12 samples ), but I had tried all 3 implementations but they all gives me messy sound output btw..

Re: SNES APU sample data
by jwdonal on 2015-08-13 (#153184)

The absolute best way to start understanding the apu_dsp.txt document is by reading it along with Blargg's SNES SPC emulator (http://blargg.parodius.com/libs/snes_spc-0.9.0.zip). Nocash's document is also an excellent resource to go along with apu_dsp.txt. Nocash's document is sometimes written in more "plain english" than apu_dsp.txt so it helps a lot. But I still had to use both to implement my SPC emulator.

The only way I finally began to understand how the BRR sample buffer (or really most of the stuff in the DSP) worked is by inserting real-time debug statements into Blargg's emulator and comparing that with the description provided in apu_dsp.txt.

The BRR sample buffer is a ring buffer that holds 3 groups of 4 samples each (so it always holds 12 samples at a time). As the interpolation position increases (i.e. the point at which it exceeds 0x4000) you turn the ring buffer and decode the next group of 4 BRR samples.

...Uh...holy crap...it actually sounds like I know what the hell I'm talking about. Haha. Well, only took me 3 years to figure it out. :-P

Anomie's apu-dsp.txt document is without a doubt the most technically dense document that I've ever had to interpret. The information is all there but it's extremely "compressed" and the document NEVER repeats itself. So you have to decompress the information on the fly while reading it and also remember everything that you read before the next sentence until you will be able to understand the next sentence. It's a real pain in the ass, but I am so thankful to Anomie and Blargg for writing it!

In addition, Blargg's C code can be very difficult to read at times because he is very talented and likes to optimize things to the extreme. But if you put forth the effort in understanding his code you will end up a better programmer. I learned things from reading his code that I didn't even know were possible in C. It's pretty awesome...but also very time consuming to understand. Heh.

Re: SNES APU sample data
by funnyguy on 2015-08-16 (#153423)

Thanks jwdonal for the pointer, but I don't have the environment to compile the snes_spc-0.90.zip setup. Seems that is a library file only, Can you share the "skin" or application driver and the toolchain for that.

After last weekend reading of anomie's apudsp.txt again, I realized that he can access 8 entries of the BRRdata because

i = voice[x].interpolation_index >> 12; // 0 <= i <= 4

as variable i can be 0 to 4, it can access in the range BRRdata[0] to BRRdata[7]

I just need to confirm BRRdata[0] should be the first data to be replaced in the ring buffer when a new 4 entries set of BRR data is being decoded.

Re: SNES APU sample data
by jwdonal on 2015-08-16 (#153425)

All you need is Visual Studio Express which is a free download from Micro$oft. I'm using the 2012 version. Also, his emulator is not just a library (although it can be used that way), he actually has a full demo that you can run and specify any arbitrary SPC and it will generate a .wav file for you.

Regarding the ring buffer implementation it should be clear what you need to do once you get Blargg's example up and running and start inserting debug statements.

Re: SNES APU sample data
by funnyguy on 2015-08-19 (#153632)

Thanks jwdonal for the suggestion, I got a great leap forward and now instead of thud... thud.. thud.. I got music of the wrong tone

Anyway, my emulator is focusing on emulating the hardware as close as possible, and since I believe resources are expensive in that days, I had use 15 bit ring counter entry to emulate it and had use adder instead of multiplier in the BRR decoding filters, that comes another problem.

In many designs, the BRR encoding filters are written as

else // s += p1 * 0.8984375 - p2 * 0.40625
{
// s += (p1 * -13) >> 7; // <---- original design use a multiplier
// s += (p2 * 3) >> 4; // <------ another multiplier
s += (p1 * -1) >> 7; // *1 <--- adder 1 my design
s += (p1 * -1) >> 5; // *4 <--- adder 2
s += (p1 * -1) >> 4; // *8 <--- adder 3
s += p2 >> 4; // *1 <--- adder 4
s += p2 >> 3; // *2 <--- adder 5
}

Since I believe multiplier occupies too much silicon area compare with an adder, I guess the original design should use adder instead of multiplier, nevertheless, this will give a difference of 1 to 3 least significant bits compare between adder and multiplier

The question is : Did the original design actually use a multiplier ? Since there are rounding errors in multiplier, how do we ensure the hardware LSB is actually the same as the emulator's ?

PS : In case you are interested, the music SPC is King Arthur's World, the tic-tac-toe game is developed by Tim Soft

Re: SNES APU sample data
by jwdonal on 2015-08-19 (#153640)

If you read apudsp.txt it tells you precisely how they implemented it with adds and shifts only - which is most definitely how the original hardware did it.

Whether or not a multiply or shift+add is more efficient is totally dependent on the target system and what you're trying to do. If you're making a software emulator on a modern processor you can be pretty certain that a single multiply is going to be a whole heck of a lot faster than a bunch of shifts+adds.

Also, using multiplies vs shift+adds will not necessarily generate different results. It depends entirely on the input values and the precision you keep. In the case of the BRR decoding the fractional values can be exactly represented by a fixed-point number with enough bits of precision (can't remember how many, but it's not a lot). So using shift+add or multiplies will still give the exact same result in this case.

Re: SNES APU sample data
by funnyguy on 2015-08-20 (#153693)

I run the DSP_SPC filter design using just multiplier and shift and add and these are the following result of kaw-01.spc ( king arthur world ). For clarity, I only cropped out the first BRR decoder that shows some non-zero data

Using multiplier, the original implementation

Quote:

Voice 3C
Gaussian interpolate v->interp_pos [04358] v->bufpos[04] offset [35] ROM [00C4][04D7][025E][0007] IN [0000][0000][0000][0000]- out[0000]
Element 0 [0000][0000][0000][0000] [0000][0000][0000][0000] [0000][0000][0000][0000]
V3C output [00000] v-> env [07FF] m.t_output [00000]
V4 start
v->interp_pos [4358] m.t_pitch [003A6]
BRR_Decode v->buf_pos[4] header[9C] nybbles[1B0F]
Before Element [0000][0000][0000][0000] [0000][0000][0000][0000] [0000][0000][0000][0000]
Nym[1] S[00100] P1[00000] P2[00000] out[00200]
Nym[FFFFFFFB] S[FFFFFB00] P1[00200] P2[00000] out[FFFFF998]
Nym[0] S[00000] P1[FFFFF998] P2[00100] out[FFFFF2DC]
Nym[FFFFFFFF] S[FFFFFF00] P1[FFFFF2DC] P2[FFFFFCCC] out[FFFFEB96]
After Element [0000][0000][0000][0000] [0200][FFFFF998][FFFFF2DC][FFFFEB96] [0000][0000][0000][0000]
Voice out for voice [1] [00000]
V4 end v->interp_pos [06FE] m.t_pitch [003A6]
Voice out for voice [1] [00000]

The same data using shifter and many adders

Quote:

Voice 3C
Gaussian interpolate v->interp_pos [04358] v->bufpos[04] offset [35] ROM [00C4][04D7][025E][0007] IN [0000][0000][0000][0000]- out[0000]
Element 0 [0000][0000][0000][0000] [0000][0000][0000][0000] [0000][0000][0000][0000]
V3C output [00000] v-> env [07FF] m.t_output [00000]
V4 start
v->interp_pos [4358] m.t_pitch [003A6]
BRR_Decode v->buf_pos[4] header[9C] nybbles[1B0F]
Before Element [0000][0000][0000][0000] [0000][0000][0000][0000] [0000][0000][0000][0000]
Nym[1] S[00100] P1[00000] P2[00000] out[00200]
Nym[FFFFFFFB] S[FFFFFB00] P1[00200] P2[00000] out[FFFFF998]
Nym[0] S[00000] P1[FFFFF998] P2[00100] out[FFFFF2DA]
Nym[FFFFFFFF] S[FFFFFF00] P1[FFFFF2DA] P2[FFFFFCCC] out[FFFFEB90]
After Element [0000][0000][0000][0000] [0200][FFFFF998][FFFFF2DA][FFFFEB90] [0000][0000][0000][0000]
Voice out for voice [1] [00000]
V4 end v->interp_pos [06FE] m.t_pitch [003A6]
Voice out for voice [1] [00000]

Data in green are the actual content in the BRR ring buffer.

***********************

BTW, I have a question. I know the pitch adjustment by the channel above is

pitch += ((voice[x-1].outbuffer >> 5 ) * voice[x].PTICH) >> 10;

What if voice[x-1].outbuffer is negative (ie. have sign-bit MSB == 1) ?
Shall I just add the negative pitch to the interpolate position, thus "reduces" the interpolate position ?
or should I make the voice[x-1].output positive and compute the pitch addition, so that it is always positive ?

Re: SNES APU sample data
by jwdonal on 2015-08-21 (#153699)

If you are getting different results when multiplying vs shift+add I can promise that you are doing something wrong. I know because I've co-simulated my SPC emulator with Blargg's emulator for nearly 15,000 SPCs now and haven't had a single mismatch. His emulator uses multiplies, mine uses shift and adds like the real hardware would. Keep trying!

As for the pitch/interp calculation you can answer all of your questions purely by inserting debug statements into Blargg's code. Got a question about how pitch works? Debug statements! Got a question how BRR decoding works? Debug statements! Got a question about the gauss filter? Debug statements! Got questions about echo processing? Debug statements!

But a more direct answer to you question is that the pitch calculation is performed as a signed operation. But the interpolation index is unsigned. That is, you never "back up" the ring buffer. You're always moving forward in time. I checked and apudsp.txt does not state that the pitch calculation is signed and the interpolation calculation is unsigned. It really should, that's kind of bad that it doesn't. Heh. But fortunately with Blargg's emulator you can figure out anything. On the other hand, nocash's fullsnes.txt does actually tell you that one is signed and the other isn't if you look at it carefully enough. This is why I had to use both those docs to figure this stuff out. They complement each other very well.

Re: SNES APU sample data
by funnyguy on 2015-09-05 (#154679)

Finally figure it out that need to build adders exactly in the sequence described in anomie document
to achieve the same results as described by Bragg's program.

Even thought there is lost of precision and rounding error in the intermediate steps.

If you started by expanding the adder bits and improve the resolution, and build a big adder, sum everything
(ie make adders up to 25 bits and do lostless add ) and shift back to 15 bits, the answers are different after a few hundred iterations.