2A03-D+PCM - Method for high-quality low-CPU sample playback

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
2A03-D+PCM - Method for high-quality low-CPU sample playback
by on (#180636)
So, I was brainstorming yesterday, and had an idea of a method of sample playback on the NES, at high quality, with little CPU overhead and small size (2-bit/sample average).

And this is the result: https://github.com/VitinhoCarneiro/2a03-d-plus-pcm

Unfortunately I'm not any good at 6502 ASM programming (though I might give it a try), so I just did a simulation of the encoding/decoding process in C. So, no ROMs or demos yet, though I've uploaded a sample of the program output here: https://www.youtube.com/watch?v=qgnXLjX4EMI

This codec is based on ideas from za909's MintaBOOM sample engine, available here: viewtopic.php?f=22&t=14520

How it works:
-The codec combines 7-bit PCM with DPCM deltas and uses the DMC interrupt for triggering samples.
-The audio is divided into 2-byte blocks, containing 8 samples.
-The first byte is a 7-bit PCM value to be written to the DMC delta counter.
-The second byte represents a pointer to the DPCM sample table - it's basically a sequence from 0x00 to 0xFF, where the value of the pointer is equal to the value stored in the table.
-On playback, the PCM value is written to the delta counter, and then the 1-byte DPCM sample is triggered and the interrupt set to trigger on the sample end; the CPU then returns to execution of normal code until the interrupt fires.
-The sample rate can be varied by simply changing the DMC frequency value, since that's what triggers the sample playback - this also opens possibilities for sample repitching.

Based on these ideas, this method of sample playback should be very easy to integrate into games/demos (as long as they don't require cycle-accurate execution, since the interrupt will disrupt their timing), since it should take very few CPU cycles to trigger a new block of samples every time the interrupt is fired.

Besides from low overhead, this codec also has a high compression rate (2 bits per sample, as opposed to the usual 8-bit padded PCM), and still high quality (even at ~32kbps ($C playback rate), it sounds way better than plain DPCM at $F).

I might try making a simple NES decoder for sample playback as a proof-of-concept for this method. But anyone is free to study and modify the source code, improve on it, and even making demos out of it.

I'd love to hear your feedback on this. I feel like this could be a game-changer for sample playback on the NES.

(PS: I'm thinking about a version of this that will use only 1 byte per 8 samples, using 4-bit ADPCM and vector-quantized DPCM sample blocks. I might have to test if it will decompress well, though...)
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180641)
So... this is the same as the regular 1-bit DPCM format but augmented by letting it do an arbitrary jump every 8 samples?

Hmm, I would guess that the jumps at ~4kHz would split the bandwidth, giving more fidelity to lower freqencies (under ~2kHz) via the added jumps, and possibly make more headroom for higher frequencies in the 1-bit DPCM stream? I could see this being a significant improvement, though it would be easier to hear in example if you made a video/recording comparing plain 1-bit DPCM encoding to your method, side by side.


Some thoughts about implementing it:

DPCM samples have a 16 64 byte memory alignment, so a table of 256 1-byte DPCM samples actually requires 4k 16k of space. Something with RAM in the DPCM area (e.g. FDS) could avoid having a table, though. (You could stick other data in the 15 63 bytes between samples, though, if you needed to take up that space, but it's quite inconvenient.)

Also the IRQ happens when the DPCM sample byte is fetched, not when it's finished playing, so the stream should have the DPCM data shifted ahead by one. (This might make using the IRQ trickier.)

Your IRQ will happen every 432 cycles, so I'd guess this would take up at least 10% of the CPU?

Games also have the problem of needing to do an OAM DMA once per frame, which will take 514 cycles, overlapping at least 1 sample, so there's a problem to solve regarding that interruption. (If not accommodating sprite animation, might be acceptable to use a lot more CPU?)

Edit: samples are 64 byte aligned, not 16, as tepples points out below.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180643)
It's worse than that. DPCM is 64-byte aligned, meaning a 256-entry table would use the first byte of the whole fixed bank, reducing the whole fixed bank to carefully chopped up 63-byte segments. (The lengths are 1 plus a multiple of 16 bytes.) This makes the vector quantization angle even more attractive, as only the last kilobyte needs the 63-byte treatment.

Still, 10% of the CPU is a lot better than 99%.

And I wonder if VQing DPCM alone could be a good way to save some space.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180645)
rainwarrior wrote:
Hmm, I would guess that the jumps at ~4kHz would split the bandwidth, giving more fidelity to lower freqencies (under ~2kHz) via the added jumps, and possibly make more headroom for higher frequencies in the 1-bit DPCM stream? I could see this being a significant improvement, though it would be easier to hear in example if you made a video/recording comparing plain 1-bit DPCM encoding to your method, side by side.


As originally posted, I've uploaded a sample of the program output here: https://www.youtube.com/watch?v=qgnXLjX4EMI

Definitely much better than plain DPCM.

EDIT: I've uploaded a plain DPCM version encoded by a modified version of my program: https://drive.google.com/file/d/0B4aSs6 ... sp=sharing
The quality difference is pretty noticeable, especially in the snare drums, which sound pretty muffled.

Quote:
It's worse than that. DPCM is 64-byte aligned, meaning a 256-entry table would use the first byte of the whole fixed bank, reducing the whole fixed bank to carefully chopped up 63-byte segments. (The lengths are 1 plus a multiple of 16 bytes.) This makes the vector quantization angle even more attractive, as only the last kilobyte needs the 63-byte treatment.


Well, that sucks, but there can always be some use for 256 63-byte segments (who knows... I could even interleave the samples there if I do it right)... Or I could just go with VQ, since there's probably a lot of redundancy within the delta blocks.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180663)
VitinhoCarneiro wrote:
The quality difference is pretty noticeable, especially in the snare drums, which sound pretty muffled.

Yeah, thanks for the DPCM example to compare against.

Less muffled snares makes sense to me. One of 1-bit's DPCM's big failings is that loud low frequencies tend to mask higher frequencies, a problem which I believe your method solves very well (what I meant about headroom before).
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180668)
Interesting idea.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180672)
Cool idea indeed. I wonder if this can be done without stealing too much CPU time.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180685)
Very cool idea, but

tepples wrote:
It's worse than that. DPCM is 64-byte aligned, meaning a 256-entry table would use the first byte of the whole fixed bank, reducing the whole fixed bank to carefully chopped up 63-byte segments. (The lengths are 1 plus a multiple of 16 bytes.) This makes the vector quantization angle even more attractive, as only the last kilobyte needs the 63-byte treatment.

For code, I think you can manage to deal with it by using branch/jump instructions accordingly, but for lookuptables bigger than 63 entries... it's really unusable. And I fear lookup tables are necessary to a lot of applications.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#180686)
Storing the PCM data in there would be no problem though. Very easy to just skip every 64th byte.
Re: 2A03-D+PCM - Method for high-quality low-CPU sample play
by on (#182141)
Is it feasible to consider multiple virtual channels for sample playback in the vein of SuperNSF using this technique with 2 byte blocks of 8 samples each?

I guess for sample retriggering, and sample offset commands the chunks would be a lot longer and you'd have less control.. Also an instance of the "decoding" would be run every sample initialization, so that if you were running 2-4 additional virtual channels mixed over one another you'd also have to deal with volume control and the additional CPU overhead...