Strategies when developing an emulator

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Strategies when developing an emulator
by on (#40506)
I've been passively working on a NES emulator. I find it a very useful way to expand on my knowledge of NES programming. I've read documents here and on the wiki,but I am essentially still in the design phase. (I have the cpu done, and am ready to start the ppu)

I have some questions for the people on here who have built and rebuilt their own emulators on some of the strategies they used, and on what they wish they had done differently.

How do your organize the management of the memory accessable by the CPU? What I had first done was to simply allocate 0x10000 bytes and allow the CPU to directly access it through array accesses. As I move toward the PPU, I see that this is not a good approach since asking for something like the joypad IO mapped memory needs to be handled completey differently.

Is the smarter approach to keep all the CPU memory in a MemoryManager class, and have it intercept all the reads and writes. (I think yes). Then subclass or augment this when mappers are introduced.

Should PPU memory, palette, etc.. be managed by a memorymanager class, or is it fine just to control this via a PPU class?

I hve a ton more questions, but they can wait until I've gotten further in my design and implementation.

Al

by on (#40510)
For memory, I think my first approach was to break memory into equal-sized blocks, each with a potentially different pair of read/write handler functions (implemented as an array of function pointers). At some point I determined that a simple if/then cascade was faster, because some blocks of memory were accessed much more often than others (RAM is used much more than I/O). It also allows the most often-accessed areas to be handled inline, with the rest handled in a separate function. However, this was if/else approach less-modular, since this single handler had to know about all hardware. So, I finally adopted a hybrid strategy, with an if/else cascade handling memory and the core hardware (PPU, APU), and an array of function pointers for external hardware like mappers and other sound chips.

The above covers reads/writes that could have side-effects. For reads of instruction bytes, I bypass the above and just use an array of pointers to banks of memory, with a bank size the smallest of what a mapper can switch. I don't think any NES program would ever try to execute from an I/O address.

Code:
// Only writes are shown; reads would be handled in a similar way.
const int shift = 12;
typedef void (*write_func)( int addr, int data );
write_func write_funcs [0x10000 >> shift];

void write_outline( int addr, int data )
{
    // write_ppu and write_apu could themselves be inline
    if ( addr >= 0x2000 & addr <= 0x3FFF )
        write_ppu( addr, data );
    else if ( addr >= 0x4000 && addr <= 0x4017 )
        write_apu( addr, data );
    else
        write_funcs [addr >> shift]( addr, data )
}

inline void write( int addr, int data )
{
    if ( addr < 0x800 )
        low_ram [addr] = data;
    else
        write_outline( addr, data );
}

by on (#40518)
I found it better to have all memory in its own namespace / memory manager, even when its internal memory for various processors (eg palette RAM.)

It does tend to violate OO public / private access levels, but it's very helpful for things like a debugger to have a consistent, unified architecture for all memory.

It also makes it easier when you have ambiguous ownership of shared memory (eg both CPU and APU can access this external RAM chip -- which class allocates the memory for it?)

I'm the absolute last person to talk to about efficiency, but I ultimately went with a table of function pointers. One for reading, one for writing.

Code:
memory::bus.write(addr, data) { table[addr >> n]->write(addr, data); } // n = table granularity
memory::oam.write(addr, data) { oam[addr] = data; }


You will suffer a significant speed penalty, but it allows for some nice things later on. Eg you can easily chain memory accesses for special hardware or debugging purposes.

Code:
void (*writefn)(unsigned, uint8_t) = table[addr];
table[addr] = &hookfn;
void hookfn(unsigned, uint8_t) { ... writefn(addr, data); }


A major benefit to blargg's approach is not having a fixed granularity size to your table. For instance:

Code:
void write(uint16_t addr, uint8_t data) {
  //order these based on access frequency, most common first
  if((addr - 0x2000) & 0x1fff == 0) write_ppu_2k_gran(addr, data);
  else if(addr - 0x4000 < 0x18) write_apu_24byte_gran(addr, data);
  else if(addr & 0x8000) (*mapper_write)(addr, data);
}


Now you only have to change a single function pointer to re-map the cart address write handler.

Quote:
I don't think any NES program would ever try to execute from an I/O address.


No commercial one, but I've done that a lot for testing. Fun stuff.

by on (#40657)
And according to a conversation i had with Burger Bill some years back, DOOM for the SuperNES executed code from register space.

by on (#40663)
Well, for the NES, the only readable registers are $2002, $2004, $2007, $4015, $4016 and $4017.
Since the first 3 are not consecutive, one could only have an instruction in $4015-$4017, and I guess it would be possible to rely on that.
Reading $4016 or $4017 would exectue typically the RTI instruction or EOR [$xx,X]. Reading $4015 could do many different opcodes in function of the sound's state. In occurence if only Triangle is enabled, it would do the PHP instuction, that could be of some usefullness before a RTI in $4016. So yeah a programm could rely on that if it wants to be insane.

by on (#43651)
blargg wrote:
The above covers reads/writes that could have side-effects. For reads of instruction bytes, I bypass the above and just use an array of pointers to banks of memory, with a bank size the smallest of what a mapper can switch. I don't think any NES program would ever try to execute from an I/O address.


I have a suggestion for handling that. If you have an array of "regular" memory and you can arrange so that "special" memory accesses (such as IO ports) always load and store from somewhere else and don't depend on or change the value stored in the array for the "regular" memory at their address, then...

(1) Choose an instruction that is not executed very often (a halt instruction or something), and use its opcode byte as a "marker" byte
(2) Store this marker byte into all "special" memory addresses in the array for the "regular" memory (anything that could not just be executed directly)
(3) Change your CPU interpreter so that when it executes this particular opcode byte, it goes back and does another ifetch using whatever memory handler is necessary to read an IO port, or whatever (and be a little careful, because you might have counted its cycle already). If the "proper" read handling returns the same byte again, *then* you fall through to the actual handler for it. Otherwise, dispatch to the handler for the real instruction.

This lets you run like you currently do, doing ifetches straight out of the array, but if you ever hit something that's marked as "this is special memory" then it will repeat the ifetch in the proper way. There is basically no slowdown because the only opcode that gets slower under normal circumstances is the one rarely-executed instruction whose opcode byte you chose as the marker value. It works because ifetches see the marker value, and other kinds of read/write access never see it because they handle the "special" access without using that memory.

Of course, if your read/write handlers rely on also being able to use the memory address in the array, then it could be inconvenient to change them all to point somewhere else. YMMV.