6502 emulation optimization

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
6502 emulation optimization
by on (#24945)
I have a working 2A03 emulator, but would like to optimize it as much as possible. Blargg's website has some good information, but there are a couple of points that are still unclear.

*The same addressing modes are re-used numerous times. For instance, LDA ($nn),Y will use the same effective address calculation as ORA ($nn),Y. In general, on an x86 platform, is it faster to inline the effective address calculation (thus minimizing CALL/RET overhead), or is it faster to use subroutines for each form of calculation (thus minimizing code size and making better use of the L1 cache)?
*One suggestion I've heard is to not calculate the N and Z flags on every opcode that sets them (almost all of them), but instead to simply keep a variable that contains the last data byte that affected N/Z, and only parse the flags when needed. Therefore, BEQ/BNE would simply check whether the last data byte was 0, and BMI/BPL would check whether bit 7 was set, and it would only be necessary to change the flags into 2A03 format for PHP or interrupts. But, if this method is used, how can the emulator handle setting N and Z simultaneously via BIT, PLP, or RTI?

by on (#24946)
Inline is generally faster, although you might want to use a profiler to test different kinds of optimization. If you don't know how to use a profiler, find out.

To handle BIT, PLP, and RTI, you need a way to set N and Z arbitrarily. One way I can think of is to use a 16-bit (or larger) variable to hold the N/Z result. Assume the N flag set if bits 7 OR 15 are set in the variable (nz & 0x8080), and assume the Z flag set if bits 0-7 are zero ((nz & 0xFF) == 0). For most opcodes, simply set the N/Z value to the operation result (making sure bit 15 never gets set accidentally), and for opcodes such as BIT, store the N result in bit 15 and the inverse of Z in bit 0, while leaving the other bits clear. (This is not necessarily the best way to do it, but it's one way I can think of that it can be done.)

by on (#24965)
If you're coding in assembly, your 6502 core should probably be within L1 size even after inlining everything.

by on (#25101)
The CPU portion of the 2A03 cores used in modern emulators is easily one of the fastest components in the emulation performance-wise, and isn't likely to really impact speed enough to be worth optimizing beyond the [obvious] basics.

by on (#25112)
I don't know if this method is the fastest, or even 'compiler-friendly', but I use jumps, or the goto. Firstly, I wrote my core instruction by instruction, separated by addressing mode. Later, I started to optimize them, as removing redundant code (blocks) because they could be executed using a similar block (instruction). It's something like...

Code:
//ADDRESSING #1 (offset)
//Parameter: offset (unsigned short)

CPUOP(ORA1)
  value = readvalue(offset);
  _doCPUOP(ORA0);

CPUOP(ASL1)
  value = readvalue(offset);
  ASL(value);
  writevalue(offset, value);
OPEND


- It starts inside a case statement for the addressing mode. Once the argument is done (like immediate, byte, word...), it jumps into the proper block to execute the instruction. The CPUOP() is a jump label, and OPEND is a goto op_end. If you're good enough, you notice this might work for a giant case statement, but I never tried out.

by on (#25143)
randilyn wrote:
The CPU portion of the 2A03 cores used in modern emulators is easily one of the fastest components in the emulation performance-wise, and isn't likely to really impact speed enough to be worth optimizing beyond the [obvious] basics.

Unless your target platform contains hardware that accelerates the PPU and half the PSG. Loopy would be familiar with these.