Like on NES some time back, here's some data for which things to pick on N64. I plan to bench some audio and video codecs next. All on gcc 8.2 -O3.
Code:
Results from cen64, which slightly differs from hw (~5%).
Text decompression, source LGPLv3 7.5kb, speed in kb/s
Algo | Ratio | Speed | License, comments
-------------------------------------------
zstd | 0.333 | 1457 | BSD, requires ~160kb RAM
zlib | 0.343 | 2823 | zlib, requires ~4kb RAM (tinfl)
lzo | 0.402 | 4773 | GPL, no RAM required
lz4hc | 0.475 | 10471 | BSD, no RAM required
lzjb | 0.591 | 4998 | CDDL, no RAM required, nemequ github version
Audio, 10s 44100 Hz mono clip, % realtime
Algo | Ratio | Speed | License, comments
-------------------------------------------
Speex | 0.038 | 208 | BSD, fixed point
Vorbis 128 | 0.158 | 410 | BSD, tremor lowram, measured ~35kb
Vorbis 96 | 0.122 | 458 |
Vorbis 64 | 0.089 | 498 |
Vorbis 48 | 0.068 | 498 |
Opus 64 | 0.099 | 215 | BSD, fixed point, measured ~95kb
Opus 48 | 0.075 | 229 |
Opus 32 | 0.049 | 252 |
MP3 128 | 0.131 | 215 | PD, no RAM required, lieff/minimp3
MP3 96 | 0.109 | 215 |
MP3 64 | 0.087 | 219 |
MP3 32 | 0.044 | 430 | Lame chose to downsample to 22kHz and mpeg-2l3
isac 56 | 0.105 | 234 | 32 kHz, ~400kb RAM usage
Audio, 10s 16000 Hz mono clip, % realtime
Algo | Ratio | Speed | License, comments
-------------------------------------------
Speex | 0.071 | 582 | BSD, fixed point
Vorbis 64 | 0.173 | 1066 | BSD, tremor lowram, measured ~32kb
Vorbis 48 | 0.142 | 1165 |
Vorbis 32 | 0.111 | 1206 |
Opus 64 | 0.266 | 252 | BSD, fixed point
Opus 48 | 0.199 | 264 |
Opus 32 | 0.135 | 276 |
Video, 5s 320x136 25fps clip, xvid simple profile L3, 247 kbps
libxvidcore (GPL) decoding to I420: 98% realtime
Zstd is pretty disappointing given how hyped it is. Barely better compression than zlib and much slower, with huge RAM usage.
How does Zstandard compare to implementations of Deflate other than zlib's, such as 7-Zip's or Google's Zopfli? I ask because I'm familiar with these two in particular from the advzip and advpng tools in AdvanceCOMP. Decompression speed probably wouldn't differ much, but compression would probably be slower, and the rate might differ.
Would you be interested in results for DTE and Huffword codecs as a low water mark? But I'll admit these may not be quite as useful on Nintendo 64 as they are on a small-RAM, execute-in-place environment like the NES.
Zstd on modern computers beats even the best zlib implementations, according to all reports. The small size of the test data here probably hinders it a bit, and it's not very speedy on an old MIPS like this. I'll probably use zlib for everything, it hits the sweet spot here.
As for any additional codecs, sure, I'll add any data points.
Results for lzjb? (Yes I'm aware lz4 is a faster implementation/replacement for lzjb)
I wonder if Zstd is here specifically crippled by the N64's tiny cache.
lzjb and speex added. Speex compresses quite well, but at these speeds it's not that suitable for many voices at once. For cutscenes or RPG-style talking to one character, it should work great.
lzjb was fast to test, but I had never even heard of it. Why do you find it interesting? ZFS was a Sparc thing, no MIPS relation.
SILK, the low-rate voice mode in Opus, is similar to Speex in several ways. Is Opus on the whole too slow for the N64? Or Codec 2?
It's on the list to test, along with vorbis and mp3.
calima wrote:
lzjb was fast to test, but I had never even heard of it. Why do you find it interesting? ZFS was a Sparc thing, no MIPS relation.
ZFS isn't a "Sparc" thing, it was originally a "Solaris thing". Around the time of the Oracle buy-out of Sun, OpenIndiana/Illumos happened (think: open-source Solaris), which then resulted in parts of ZFS becoming open-source (though under CDDL), which resulted in it being imported into FreeBSD and a fusefs version for Linux (slow). This all later resulted in OpenZFS and ZFS on Linux -- so now FreeBSD, Linux, and OpenIndiana/Illumos all have ZFS (regardless of arch; x86, amd64, aarch64/ARM, etc.).
Why I found it interesting: because I've known it to be faster than gzip, faster than zlib, but slower than lz4 (which is extremely new), and wanted to see how it performed on the N64.
I don't know if there's some "easy to add" code that would be testable, but gzip and bzip2 (for text) might be interesting as well. I've seen many cases where text compresses better with gzip than bzip2, and in other cases the exact opposite. Another one to consider might be some bare-bones native Huffman implementation, although I wouldn't be surprised if one of the previously-tested algorithms dynamically implements something like that.
For audio, you might look into
Codec 2 which is know for being OSS and having extremely high compression rates, but again I have no idea how easy this would be to add/test.
PKZIP and Gzip use the same Deflate algorithm as zlib, and bzip2 is
very RAM-intensive on the order of 1 MB, which is one-fourth of the N64's RAM.
How much of the decoding for MP3, Vorbis, Speex, Opus, Codec 2, or FLAC can be done on the RSP? FLAC can be turned into a time-domain lossy codec using the
LossyWAV preprocessor, which bit-crushes each 512-byte block with noise-shaped dithering so that FLAC has fewer significant bits to code.
Audio codecs generally don't benefit from SIMD, there isn't any vectorizable processing going on. The RSP's SU (scalar unit) lacks multiply instructions and 64-bit instructions, so even using that as a "second core" would be slower than the main core. I expect graphics processing to take the most of RSP's frame time anyway.
calima wrote:
Audio codecs generally don't benefit from SIMD, there isn't any vectorizable processing going on.
Not even inverse FFTs or MDCTs or filtering?
The amount of data in one audio packet is so small, that the overhead usually kills any speedup. FFT/DCT for images is a different case, if you process a mb at once instead of a hundred bytes.
... How bad is the overhead? I haven't yet found any documentation on how writing code for the RDP/RSP works.
Well, I was speaking in general, as in even on x86_64 you won't get much speedup if any from vectorizing parts of audio decoding. You have to load data to the vector, often from unaligned or scattered addresses, do the calculation, and store. The load/permute/store steps may make the 8x/16x processing step speedup worthless if you don't have much data. More so if the vectorizable parts alternate with non-vectorizable.
RSP specifically: vector loads have three delay slots, meaning effectively it takes four instructions worth for an aligned, perfect load. Then you have to DMA in and out of the 4kb memory, giving further overhead. "SGI_Nintendo_64_RSP_Programmers_Guide.pdf" is available on the ultra64.ca site, as well as a RDP register doc.
I've read pretty much all N64 docs by now. In some ways it's better and in others worse than expected. For example there is no flipped Z comparison mode, and rendering triangles is very much a PITA, but on the other hand the RSP will allow many kinds of software pixel effects. Gaussians, additive rendering, better scaling algos, maybe even some form of shadow mapping.
Vorbis is quite fast, much more than on the Rockbox codec benchmark page.
Run your benchmarks against the samples actually provided by RockBox ?
http://download.rockbox.org/test_files/Your 96kbit/sec sample corresponds to needing a 20.47MHz MIPS, which puts it MHz-for-MHz better than about 3/4 of the tests there. (Kinda surprising, but maybe not comparing apples to apples?)
Too much trouble IMHO. The clip I used (Nightwish), the bitrate, and being mono are representative of my future workloads.
Rockbox's vorbis_96.ogg is stereo, while I used mono. So a near-2x difference is plausible.
Opus and MP3 added. Both far too slow sadly. I also tried MP2 since minimp3 supports it and the Rockbox page says it's faster, but it was exactly the same speed. Minimp3 looks like it's slower than mpg123 and libavcodec anyway, but it was the only freely licensed decoder I found. AC3 and Cook are also listed as fast on that page, but AC3 has no freely licensed decoder, and Cook lacks both a decoder and an encoder. So looks like it's Vorbis, probably at a lower sample rate than 44.1KHz that works best.
calima wrote:
Minimp3 looks like it's slower than mpg123 and libavcodec anyway, but it was the only freely licensed decoder I found.
How is libavcodec not "freely licensed"? The
libavcodec/mpegaudio* files in
FFmpeg's repository appear to be LGPLv2.1, which
is a free software license. Use of LGPLv2.1 software in a statically linked program requires providing object code files at cost to anyone who owns a lawfully made copy of the program for relinking with a modified library. Even recent Nintendo consoles' system software includes LGPLv2.1 software.
calima wrote:
AC3 and Cook are also listed as fast on that page, but AC3 has no freely licensed decoder
The
libavcodec/ac3dec* files are LGPLv2.1 as well. AC3 was standardized as part of the U.S. TV system ATSC, under the name "ATSC A/52". This turns up
liba52, but that's copylefted (GPL version 2 or later) and thus not very practical for commercial games in the present market.
You know very well what was meant, and why LGPL code is not viable for commercial games on many platforms.
The fact that I wrote "Even recent Nintendo consoles' system software includes LGPLv2.1 software." indicates that I don't know "why LGPL code is not viable for commercial games on many platforms", or at least that I can't explain adequately to others following this topic. What makes Nintendo 64 in particular among the "many platforms" to which you refer?
It is a long topic that I'm not going to discuss now.
In other news, I stumbled upon a voip codec called isac. Google bought it and released under BSD. No data anywhere, not on performance, not on audio quality, but the wikipedia page makes it sound worth a try.
It seems Factor 5's MORT codec has been cracked, but only for decode. This happened about a year ago, so I doubt there's an encoder unless somebody's found some Factor 5 tools lying around...
isac sounds ok at its 56kbps rate, but it's slower than Vorbis. 234% realtime, but not directly comparable since it's 32 kHz.
It really looks like minimp3 is a bit of a lemon performance-wise. Most other metrics I've seen everywhere show vorbis at equal bitrate being about 120% the computational cost of mp3...
Yep.
I thought about trying musepack, but a listening test at 96 kbps says it sounds terrible, and its recommended rate of ~170 would take too much space. I guess audio codecs are now about exhausted. FLAC and ADPCM halve the size, making them suitable for sfx but not for music. Heavier ADPCM that go 1/4 or 1/8 sound too bad. Tremor may be somewhat optimizable, but this MIPS does not have any special instructions, limiting the potential. I guess I'll need to add profiling support to cen64.
Kinda disappointed that nothing was suitable as-is, I don't want to spend over 10% of a frame on audio, total. Oh well.
What does F-Zero X do? I believe the music is 32 kHz, and in the expansion kit it's in stereo. I'm pretty sure compute time was a priority for that game - actually, wasn't that why they went with streaming in the first place?
Also, how does BRR sound? I seem to recall getting decent results on whole tracks... Is VAG a noticeable improvement?
BRR, VAG, and LossyWAV+FLAC are forms of ADPCM, whose bitrate is too high for music.
The objection to 4:1 compression was not that it was too big, but that it sounded too bad. BRR is not that far off 4:1 and should sound better than vanilla ADPCM.
Then again, 8:1 was mentioned, and 96 kbps as a target suggests a very high desired compression ratio.
My first question remains: what did F-Zero X do? (Whatever it did, it seems it took too much space on a cartridge for stereo, but on a bulky disk it was fine. This suggests an acceptable processor load and a potentially borderline compression ratio.)
I actually want to target 64 kbps or lower, 96 was just the lowest listening test that included musepack.
F-Zero X seems to be a form of 16-bit ADPCM:
https://fzerocentral.org/viewtopic.php?t=14170https://github.com/jombo23/N64-Tools/tr ... oolUpdatededit: and apparently it spent 35mb on music alone.
What's the computational load for a softsynth vs a decoder?
MIDI playback can be fast, but at least timidity is about 10x slower vs vorbis decoding on x86_64. Lots of official N64 games used MIDI, but it's quite a lot easier to find musicians for normal formats vs MIDI. I'm not otherwise opposed to MIDI or tracker approaches, though finding optimized, liberally licensed decoders and sample banks might also be an issue.
cen64 is now able to profile.
ERP, a commercial N64 dev, mentions on beyond3d that the RDRAM random latency is 64 cpu cycles. Putting it here too to spread the info. That's almost modern-cpu level cost for a cache miss, damn.
My understanding is that this wasn't just the inherent RDRAM latency, but the total cost of the CPU having to interface with the RAM via the memory controller on the RCP. I also didn't get the impression that it was always exactly 64 cycles; that number might be an average or a worst-case scenario.
But yes, it's been noted that optimizing for size often gives the best speedup with N64 code...
I had tried that with some of the audio codecs, and -Os was something like 20% slower than -O3.
Obviously it depends on the code, and presumably the compiler too. I just meant that cache misses are known to be a fairly big deal on N64, and since the cache is not huge, it's something to watch out for.
calima wrote:
I had tried that with some of the audio codecs, and -Os was something like 20% slower than -O3.
I didn't try with MIPS, however something that I observed is that gcc often generates smaller code with -O3 than with -Os, at least when compiling for ARM Cortex-M3 devices. That alone may explain the difference.
Did a video test, just for fun and to see what the scale would be. I couldn't find any liberally licensed mpeg1, mpeg2, or mpeg4 decoders, so I just did one test with Xvid (which is GPL).
The results are rather astounding. Even though it's the Simple profile, lacking in quality compared to xvid with its full features, it should still look better than mpeg1 and many mpeg2 streams. 98% realtime and quite watchable quality. The color conversion can provably be done by the RSP, and the texture filter supports YUV natively too. All just with C, xvid has no mips optimizations.
At the bitrate used, 247 kbps, add 64 kbps for audio, and that makes 40 kb/s, 2.4 mb/min of video. That's plenty of FMVs.
Sorenson Spark and Xvid are codecs in the H.263 family. Their R/D is a step above MPEG-2 but just below H.264. How well does Theora run? It's BSD licensed, and its R/D should be comparable to Xvid.
Theora is several times slower on x86, so not even worth trying. Its quality is also lower.