93143 wrote:
You can of course change up to 8 arbitrary CGRAM entries per scanline with HDMA. A few years ago I wrote a scheduler in Matlab that starts with the top 256 colours (or fewer if I need space for something else) and goes down the picture changing palette entries that aren't currently needed when out-of-palette colours show up. If it can't find a free HDMA slot, or a single line has too many colours for the palette, it fails and I have to go re-quantize to a lower total colour count and try again. Obviously it would be better to have a tool that simultaneously optimizes the image and schedules the HDMA, but that seems like a lot of work...
I actually got a pretty decent processor going with several days of work that does that harmonious quantization you're wondering about.
The idea I went with was to originally quantize images based on bucketing all of the pixels, determining a bounding volume across the RGB 'axes', and then splitting the bucket with the widest bounding volume into two buckets. Repeat until the max palette size is hit. More or less how most modern palettizing processes operate these days.
The extension to support HDMA I did was to continue splitting buckets, sort of treating the scanline as an additional axis to split on:
-Once the number of buckets is more than the size of the base palette, identify the first and last scanline that each bucket of pixels has
-Across the current list of buckets, identify all potential HDMA candidates by seeing if one bucket's first scanline was greater than some other bucket's last scanline, flagging each one as such
-If the number of buckets that don't get roped in via HDMA are less than the max palette size, then split based on colour, and start again, to re-evaluate potential HDMA candidates (i.e. splitting on colour may have resulted in new HDMA possibilities)
-If we're at the max palette size, though, then over the full list of buckets, find one with a gap of scanlines where a pixel does not exist, and find some other bucket that could be a candidate for hdma, e.g. checking the following
Code:
// check if hdmaCandidate's final scanline is before our first possible scanline
// this only takes into account bucket pairs that are like so:
// -|||-------|||----- <- Bucket - split this along the scanline
// ------|||---------- <- hdmaCandidate?
// check if hdmaCandidate's first non-gap sequence (that we can see) is within
// bucket's gap. this takes into account bucket pairs like so:
// -|||-------|||----- <- Bucket
// ------|||------|||- <- hdmaCandidate?
...if an hdmaCandidate was found, then the bucket being evaluated is split across the scanline. Afterwards, start this loop again.
And repeat until no more hdma candidates are discovered.
It's still a little slower than I'd like, in particularly due to one chunk of the algo that basically runs O(n^3) with number of buckets, but with maxcolours of 256 and maxhdmachannels of 8, I'm able to run it over the
Kodak image library at an average of ~1.9 seconds per img (single-threaded; ofc processing them in bulk, it's trivial to parallelize). Also, the calculation of the colour deltas for buckets could be a lot better, e.g. the colour deltas are all computed in 8b per channel, not 5b, which is why even for the "0 hdma" case shown below, there aren't actually 256 colours in use, because there were, apparently some duplicates after final quantization, so we must have done some bucket splits that were redundant and may have been better suited on some other stuff.
The following PSNR values are compared against versions of the source images that were scaled down to SNES' max height or width, corrected for the different pixel size, and direct quantization to R5G5B5. The colour counts are all measuring unique quantized colours as well; there's almost assuredly some cases where one colour may get HDMA'd out and then get HDMA'd in again, but that still counts as one colour.
and some of the more notable comparisons (left-to-right: 'original', 15bpp quantized original, 8-channel hdma, 0-channel hdma):
#03:
#15:
#23:
If anyone is interested, I'm willing to post the source code up somewhere for perusal. There are some bits of ISPC code and the afore-mentioned parallelism is provided via Msft's ConcRT library, so it's not super platform agnostic, but if anyone is interested in a different take on this problem, I'd be happy to oblige.