Caching Optimization for CHR Rendering

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Caching Optimization for CHR Rendering
by on (#3974)
CHR data is rendered to the screen vastly more often than it is changed, so it makes sense for a NES emulator to store it in a format best suited to rendering. In my experimental NES emulator I've optimized CHR rendering (background and sprites) by caching an optimized representation of CHR data and updating it whenever CHR data changes. The cache has two main aspects: determining when it needs updating, and the format of the cached data.

The cache stores a transformation of the CHR data, so it needs to be updated whenever CHR data changes. This is handled by keeping a flag for each tile and setting it when that tile's data is written to. A global flag is also kept to indicate whether any CHR changes have occurred since the cache was last updated. Then when CHR rendering is about to occur the cache is updated if necessary.

For CHR ROM the cache only needs to be generated once since the data will never change. Bank switching in the CHR area (PPU $0-$1FFF) doesn't matter since the cache is of the actual CHR data in VRAM or VROM, not of what the PPU sees with current mapping.

In my emulator I render graphics to an offscreen graphics buffer with 8 bits per pixel. I use palette entries 32-63 for the 32 NES palette entries, leaving room for the host system at the beginning of the palette. The cached CHR data is just a reordering of the original data to allow shifting and masking to quickly generate the 8-bit-per-pixel format. This keeps the cached data to a minimum size, lessening impact on the host's processor cache. There is also a separate cache with pixels horizontally flipped.

Each cached tile consists of four pairs of lines, and each pair is stored in a 4-byte integer. The 2-bit pixels for the two lines are reordered in the cache to allow for quick extraction:

Code:
12345678          CHR pixels (2 bits per pixel)
ABCDEFGH

A1E5B2F6C3G7D4H8  Cache (4 bytes)
-1---2---3---4--  Masked pixels
---5---6---7---8
A---B---C---D---
--E---F---G---H-

uint32_t* pixels = ...                       // pixel buffer to draw into
uint32_t mask = 0x03030303;                  // mask to extract pixels
int attrib = 2;                              // attribute bits (0-3)
uint32_t offset = (8 + attrib) * 0x04040404; // distribute to 4 pixels

uint32_t pair = *cache++;                    // read pair of lines from cache
pixels [0] = ((pair >> 4) & mask) + offset;  // extract pixels 1234
pixels [1] = ((pair >> 0) & mask) + offset;  // extract pixels 5678
pixels += pitch;                             // next line
pixels [0] = ((pair >> 6) & mask) + offset;  // extract pixels ABCD
pixels [1] = ((pair >> 2) & mask) + offset;  // extract pixels EFGH
pixels += pitch;                             // next line


On a 64-bit CPU, groups of four lines (rather than two) could be stored in each cache word, doubling the performance.

To handle masked graphics, the mask can be efficiently calculated from the pixels by subtracting the base pixels (before offset) from 0x80808080 and shifting right by 2. The result will have the lower 5 bits clear for transparent pixels and set for opaque pixels; the upper bits don't matter because those are always zero. For example, (0x80808080 - 0x02030001) >> 2 = 0x1F9F601F.

Code:
uint32_t bg = *pixels;                           // get background pixels
uint32_t sp = (line >> 4) & cache_mask;          // extract sprite pixels
uint32_t mask = 0x80808080 - sp;                 // calculate mask
*pixels = ((sp + offset) & mask) | (bg & ~mask); // combine sprite and background


My emulator only handles PPU changes every 8 scanlines. Relaxing this to each scanline wouldn't be too complex. Handling mid-line changes would probably be simpler to handle without using the cache at all.

For completeness, here is a function that converts CHR data to the cached format:

Code:
// Expands each of the 8 bits in n into separate nybbles of result.
// In: 12345678  Out: 0x15263748
uint32_t expand( uint32_t n )
{
    //                         12345678
    //                  12345678
    //           12345678
    //    12345678
    // ---1---5---2---6---3---7---4---8
    return ((n << 21) | (n << 14) | (n << 7) | n) & 0x11111111;
}

void convert_tile( const uint8_t* chr, uint32_t* cache )
{
    // convert one chr tile to a cached tile
    for ( int n = 4; n--; )
    {
        *cache++ =  (expand( chr [0] ) << 0) |
                    (expand( chr [8] ) << 1) |
                    (expand( chr [1] ) << 2) |
                    (expand( chr [9] ) << 3);
        chr += 2;
    }
}

by on (#3978)
I did something similar back in the stone age. I found it worked beautifully for most of the CHR-RAM games.

I put off predecoding the CHR-ROM, as I worked in 32bpp.

The crazy idea I was kicking about at the time involved caching entire screen lines, to try and take advantage of the fact that most of the time the BG lines aren't changing that often...

Never really got around to implementing it though.

by on (#3980)
Perhaps one could convert each tile to a DirectDraw surface and render them even faster... Sprite priority, sprite #0 and the sprite overflow flag would be somewhat tricky to implement though.

by on (#3984)
Well, RockNES uses CHR caching since year ~2000 I guess. ^_^;; Like blargg wrote, CHR-ROM requires full decoding once, CHR-RAM is dynamic. Of course you could map out the ROM, but it would be a pain in the neck. :) For every tile written into VRAM space, it's decoded & stored. I don't know if this is the best/alternate way, but it saves CPU time.

by on (#4061)
ReaperSMS wrote:
The crazy idea I was kicking about at the time involved caching entire screen lines, to try and take advantage of the fact that most of the time the BG lines aren't changing that often...


Interesting. With this you'd invalidate the cache when the nametable or pattern table was changed.

I've found that drawing tiles is actually faster than a simple pixel copy, due to the higher memory bandwidth used by the latter and modern processors being "faster" than memory. With tiles you get the advantage of cached access for often-used tiles and smaller source data (if tiles are packed). Caching the background might be faster if the graphics card is used to compose the screen.

by on (#4140)
It definitely needed some help from the graphics card to run well. The data stays planar all the way to the backend, and the sprite/bg interaction is dealt with 32 pixels at a time in a tight logic op loop.

As it turns out, this just results in most of the time getting spent converting from planar to chunky. I was trying to get the transparency handling as slim as possible, and also try and eliminate a lot of the BG fetch logic for the trivial cases. If I went back to do it all over, I'd probably just use OpenGL to handle things.