CHR data is rendered to the screen vastly more often than it is changed, so it makes sense for a NES emulator to store it in a format best suited to rendering. In my experimental NES emulator I've optimized CHR rendering (background and sprites) by caching an optimized representation of CHR data and updating it whenever CHR data changes. The cache has two main aspects: determining when it needs updating, and the format of the cached data.
The cache stores a transformation of the CHR data, so it needs to be updated whenever CHR data changes. This is handled by keeping a flag for each tile and setting it when that tile's data is written to. A global flag is also kept to indicate whether any CHR changes have occurred since the cache was last updated. Then when CHR rendering is about to occur the cache is updated if necessary.
For CHR ROM the cache only needs to be generated once since the data will never change. Bank switching in the CHR area (PPU $0-$1FFF) doesn't matter since the cache is of the actual CHR data in VRAM or VROM, not of what the PPU sees with current mapping.
In my emulator I render graphics to an offscreen graphics buffer with 8 bits per pixel. I use palette entries 32-63 for the 32 NES palette entries, leaving room for the host system at the beginning of the palette. The cached CHR data is just a reordering of the original data to allow shifting and masking to quickly generate the 8-bit-per-pixel format. This keeps the cached data to a minimum size, lessening impact on the host's processor cache. There is also a separate cache with pixels horizontally flipped.
Each cached tile consists of four pairs of lines, and each pair is stored in a 4-byte integer. The 2-bit pixels for the two lines are reordered in the cache to allow for quick extraction:
On a 64-bit CPU, groups of four lines (rather than two) could be stored in each cache word, doubling the performance.
To handle masked graphics, the mask can be efficiently calculated from the pixels by subtracting the base pixels (before offset) from 0x80808080 and shifting right by 2. The result will have the lower 5 bits clear for transparent pixels and set for opaque pixels; the upper bits don't matter because those are always zero. For example, (0x80808080 - 0x02030001) >> 2 = 0x1F9F601F.
My emulator only handles PPU changes every 8 scanlines. Relaxing this to each scanline wouldn't be too complex. Handling mid-line changes would probably be simpler to handle without using the cache at all.
For completeness, here is a function that converts CHR data to the cached format:
The cache stores a transformation of the CHR data, so it needs to be updated whenever CHR data changes. This is handled by keeping a flag for each tile and setting it when that tile's data is written to. A global flag is also kept to indicate whether any CHR changes have occurred since the cache was last updated. Then when CHR rendering is about to occur the cache is updated if necessary.
For CHR ROM the cache only needs to be generated once since the data will never change. Bank switching in the CHR area (PPU $0-$1FFF) doesn't matter since the cache is of the actual CHR data in VRAM or VROM, not of what the PPU sees with current mapping.
In my emulator I render graphics to an offscreen graphics buffer with 8 bits per pixel. I use palette entries 32-63 for the 32 NES palette entries, leaving room for the host system at the beginning of the palette. The cached CHR data is just a reordering of the original data to allow shifting and masking to quickly generate the 8-bit-per-pixel format. This keeps the cached data to a minimum size, lessening impact on the host's processor cache. There is also a separate cache with pixels horizontally flipped.
Each cached tile consists of four pairs of lines, and each pair is stored in a 4-byte integer. The 2-bit pixels for the two lines are reordered in the cache to allow for quick extraction:
Code:
12345678 CHR pixels (2 bits per pixel)
ABCDEFGH
A1E5B2F6C3G7D4H8 Cache (4 bytes)
-1---2---3---4-- Masked pixels
---5---6---7---8
A---B---C---D---
--E---F---G---H-
uint32_t* pixels = ... // pixel buffer to draw into
uint32_t mask = 0x03030303; // mask to extract pixels
int attrib = 2; // attribute bits (0-3)
uint32_t offset = (8 + attrib) * 0x04040404; // distribute to 4 pixels
uint32_t pair = *cache++; // read pair of lines from cache
pixels [0] = ((pair >> 4) & mask) + offset; // extract pixels 1234
pixels [1] = ((pair >> 0) & mask) + offset; // extract pixels 5678
pixels += pitch; // next line
pixels [0] = ((pair >> 6) & mask) + offset; // extract pixels ABCD
pixels [1] = ((pair >> 2) & mask) + offset; // extract pixels EFGH
pixels += pitch; // next line
ABCDEFGH
A1E5B2F6C3G7D4H8 Cache (4 bytes)
-1---2---3---4-- Masked pixels
---5---6---7---8
A---B---C---D---
--E---F---G---H-
uint32_t* pixels = ... // pixel buffer to draw into
uint32_t mask = 0x03030303; // mask to extract pixels
int attrib = 2; // attribute bits (0-3)
uint32_t offset = (8 + attrib) * 0x04040404; // distribute to 4 pixels
uint32_t pair = *cache++; // read pair of lines from cache
pixels [0] = ((pair >> 4) & mask) + offset; // extract pixels 1234
pixels [1] = ((pair >> 0) & mask) + offset; // extract pixels 5678
pixels += pitch; // next line
pixels [0] = ((pair >> 6) & mask) + offset; // extract pixels ABCD
pixels [1] = ((pair >> 2) & mask) + offset; // extract pixels EFGH
pixels += pitch; // next line
On a 64-bit CPU, groups of four lines (rather than two) could be stored in each cache word, doubling the performance.
To handle masked graphics, the mask can be efficiently calculated from the pixels by subtracting the base pixels (before offset) from 0x80808080 and shifting right by 2. The result will have the lower 5 bits clear for transparent pixels and set for opaque pixels; the upper bits don't matter because those are always zero. For example, (0x80808080 - 0x02030001) >> 2 = 0x1F9F601F.
Code:
uint32_t bg = *pixels; // get background pixels
uint32_t sp = (line >> 4) & cache_mask; // extract sprite pixels
uint32_t mask = 0x80808080 - sp; // calculate mask
*pixels = ((sp + offset) & mask) | (bg & ~mask); // combine sprite and background
uint32_t sp = (line >> 4) & cache_mask; // extract sprite pixels
uint32_t mask = 0x80808080 - sp; // calculate mask
*pixels = ((sp + offset) & mask) | (bg & ~mask); // combine sprite and background
My emulator only handles PPU changes every 8 scanlines. Relaxing this to each scanline wouldn't be too complex. Handling mid-line changes would probably be simpler to handle without using the cache at all.
For completeness, here is a function that converts CHR data to the cached format:
Code:
// Expands each of the 8 bits in n into separate nybbles of result.
// In: 12345678 Out: 0x15263748
uint32_t expand( uint32_t n )
{
// 12345678
// 12345678
// 12345678
// 12345678
// ---1---5---2---6---3---7---4---8
return ((n << 21) | (n << 14) | (n << 7) | n) & 0x11111111;
}
void convert_tile( const uint8_t* chr, uint32_t* cache )
{
// convert one chr tile to a cached tile
for ( int n = 4; n--; )
{
*cache++ = (expand( chr [0] ) << 0) |
(expand( chr [8] ) << 1) |
(expand( chr [1] ) << 2) |
(expand( chr [9] ) << 3);
chr += 2;
}
}
// In: 12345678 Out: 0x15263748
uint32_t expand( uint32_t n )
{
// 12345678
// 12345678
// 12345678
// 12345678
// ---1---5---2---6---3---7---4---8
return ((n << 21) | (n << 14) | (n << 7) | n) & 0x11111111;
}
void convert_tile( const uint8_t* chr, uint32_t* cache )
{
// convert one chr tile to a cached tile
for ( int n = 4; n--; )
{
*cache++ = (expand( chr [0] ) << 0) |
(expand( chr [8] ) << 1) |
(expand( chr [1] ) << 2) |
(expand( chr [9] ) << 3);
chr += 2;
}
}