Theoretical maximum speed for cycle-accurate NES emulation

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Theoretical maximum speed for cycle-accurate NES emulation
by on (#11276)
The most accurate NES emulators execute one opcode at a time and also update the PPU/APU/MMC states at each clock cycle. Nintendulator falls into this category. (What, if any, other emulators do so?) I can't determine what Nintendulator's full speed is on my system, since I can't find any option to turn off throttling. From what I've heard, though, it runs into speed problems with any system much below 1 GHz. Nintendulator appears to be written in a mixture of C and inline assembly. I don't know how much, if any, speed optimization has been done.

If an accurate NES emulator were written in pure, optimized assembly, how fast do you think it would run? Would it be possible to obtain 60fps on a 400 MHz or so Celeron, or is there just too much computational work to be done? I've been trying to get a NES emulator off the ground (I currently have a working cycle accurate CPU core in C), but I'm still a bit undecided about which language to use. Many emulators are written in C++, which makes the coding easier in some aspects; how much of a speed hit does this incur? Would assembly mean a major speedup, or would it be fairly subtle? What area of Nintendulator is the biggest bottleneck?
Re: Theoretical maximum speed for cycle-accurate NES emulati
by on (#11277)
Josh wrote:
I can't determine what Nintendulator's full speed is on my system, since I can't find any option to turn off throttling.


Turn off sound playback (Ctrl+S) and it'll run as fast as it can. You might also want to turn off auto-frameskip (and set frameskip to zero) if you want to measure its speed.

Josh wrote:
I don't know how much, if any, speed optimization has been done. What area of Nintendulator is the biggest bottleneck?


Reasonable optimization in the CPU core, and some very small bits in the PPU.
Overall, the PPU is probably the biggest bottlenecks, with the APU close in second place.

by on (#11279)
Ah. You're using the sound buffer callback for timing. That makes sense, it's about the easiest way to do it on a Win32 platform. Well, I'm seeing from 90-95fps with throttling disabled. This is on a 1.8 GHz P4 @ 2.4 GHz.

by on (#11280)
If you want to do the same with RockNES, set the sound switch to 0 and the blitter to default (256x240 NES screen size), in the config file. Last measure was around 135FPS on my Celeron D 2.66GHz.

by on (#11281)
Josh wrote:
Ah. You're using the sound buffer callback for timing. That makes sense, it's about the easiest way to do it on a Win32 platform.


Actually, I'm not using the callback - I'm repeatedly polling the buffer via IDirectSoundBuffer_GetCurrentPosition (with a sleep thrown in the loop for good measure).

by on (#11285)
josh

i think c is fast enough for teh nes, thats what i am coding mine in. to get faster you might want to use inline asm. i will do that on my emulator, but last with the c souce still as an option to maintain platform indepence.

if you are working on an emulator then you need to learn how to profile and analysis your code. that is what i am doing now. (no asm)

x86
cpu ~ 100 Mhz
ppu ~ 400 Mhz
sdl drawing ~ 200 Mhz

and less on ppc, but i think the x86 has cache problems where the ppc has a larger cache.

and all i have working now is cpu and partial ppu.

matt

by on (#11292)
Currently, WedNESday's stats are; (Pure C++) CPU: 25Mhz, PPU: 450Mhz. On my P4 2.2Ghz I get about 180FPS.

100Mhz does seem rather a lot for your CPU core. Why is it so slow?

by on (#11293)
not sure.... still profiling

by on (#11299)
Could you post your source code? Or show us the basics of your CPU emulator. My CPU emulator requires 25Mhz for every 60FPS for a 1.8 Mhz 6502.

by on (#11310)
the source will be posted. not ready to do that yet.... let me work on it some more. and i dont even have a good name for it yet.

matt

by on (#11405)
i think my cpu is fast now, still have to profile that.

ppu was slow till i did one simple change... i was accessing the pallet with the same function as the ppu memory reads, switched to direct reading and gained 25 % cpu on p3 800. wow ! got that idea from valgrind with cache misses and the fact that its a function call and gets called over 60 000 times a ppu frame.

matt