emulator performance?

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
emulator performance?
by on (#69794)
(In the interest of not turning this into a speed vs. accuracy debate, let's just assume that any cycle accurate emulator that can handle traditionally tough to emulate games without hacks is equal).

I've been working on optimizing my emulator's code recently and have bumped up the performance of the cycle accurate mode to >500fps (Intel i7 920 w/ Nvidia 9800 GTX). Just wondering how this compares to other emulators out there. In other words, have I been successful or do I have a lot of room for improvement?

James

by on (#69797)
It depends on how you define success. What requirements do you need to meet to succeed? Put another way, what does this performance allow that the previous less-optimized one didn't? Faster fast-forward? More simultaneous emulators at once?

by on (#69799)
Quote:
What requirements do you need to meet to succeed?...More simultaneous emulators at once?

My emulator drops to a scanline-accurate emulation mode in the selection menu because cycle-accurate is too slow. Ultimately, I'd like to run the cycle-accurate mode throughout. Still not there today (at 60 fps, at least), but there is some point where software efficiency and hardware speed will allow it. I can help one of those along.

Beyond that, there isn't a particular target in mind. I like the optimization process and am just curious as to how my work compares to others.

by on (#69802)
Sounds extremely good to me. My emulator is nowhere near that.

On my machine, I typically get 40 fps, and my emulator does not yet support sound, has major PPU issues (like SMB title screen) and only about 6 mappers implemented.

For me, I've learned a ton and have enjoyed a lot of the time spent on the emulator, so even if I can never get Battletoads working and playing at full rate, I'm still happy.

Al

by on (#69804)
I think Nesticle and Famtasia are the fastest Windows-based emulators right now, mainly because nobody ever ported LoopyNES to Windows and brought its accuracy up a few notches.

What do you consider to be "cycle-accurate"? Does that mean that it would simulate explicit reads and writes for each cycle within the instruction, and possibly execute something triggered for each access? Does that mean merely getting page-crossing timing correct?

What do you consider to be "hacks"? Detecting a game and tweaking the timing slightly? Idle loop skipping?

Idle loop skipping is some really good stuff, especially when you don't need to emulate the PPU.

by on (#69805)
Dwedit wrote:
What do you consider to be "cycle-accurate"?

Yeah, I guess that's a little vague. PPU cycle accurate. Enough for mid-scanline effects to work properly.

Dwedit wrote:
What do you consider to be "hacks"?

For example (from this thread: http://nesdev.com/bbs/viewtopic.php?t=6736), detecting Battletoads and forcing sprite 0 hits at a specific time to work around timing issues.

by on (#69812)
albailey wrote:
For me, I've learned a ton and have enjoyed a lot of the time spent on the emulator, so even if I can never get Battletoads working and playing at full rate, I'm still happy.

That's the attitude that's kept me going all these years. It was a long time before I could get Battletoads working, but all I learned along the way was the real reward. Keep it up!

by on (#69825)
James wrote:
albailey wrote:
For me, I've learned a ton and have enjoyed a lot of the time spent on the emulator, so even if I can never get Battletoads working and playing at full rate, I'm still happy.

That's the attitude that's kept me going all these years. It was a long time before I could get Battletoads working, but all I learned along the way was the real reward. Keep it up!


I couldn't agree with this more. My emulator is getting more and more accurate as the days go by--141 of 163 test roms passing! At least for me it runs sufficiently fast but I am having problems with others who use Win7 64-bit having sub-par performance.

The quest for accuracy and performance is most of the fun!
Re: emulator performance?
by on (#70021)
James wrote:
I've been working on optimizing my emulator's code recently and have bumped up the performance of the cycle accurate mode to >500fps (Intel i7 920 w/ Nvidia 9800 GTX). Just wondering how this compares to other emulators out there. In other words, have I been successful or do I have a lot of room for improvement?
James

My emulator is not exactly cycle-accurate (though it can handle most mid-frame PPU effects) and it runs at > 1000 FPS on an Intel i5-760 processor, for what its worth. (This is without actually copying the PPU/APU output to the screen/sound card; i.e. just calling my "calc frame" function inside a timed loop.)

What areas of your code have you been optimizing? Find any good tricks? I've been working on speeding up my emulation core over the past month and have made about a 20% improvement. I still have some more areas I want to look into, but when I'm done I was planning on posting a list of things that happened to boost performance for my particular emulator implementation. For example, I profiled a lot of games and found that LDA (zero page) was by far the most frequent instruction (accounting for about 16% of all instructions) and added a special case for that particular opcode which sped things up. Not exactly ground-breaking stuff, but it was helpful to me so maybe it will be helpful for someone else. :)


Quote:
At least for me it runs sufficiently fast but I am having problems with others who use Win7 64-bit having sub-par performance.

I just bought a new computer with Windows 7 64-bit and was disappointed to see that my emulator ran significantly worse than on a lesser machine running XP. Very frustrating. I think it is because I only have GDI and DirectDraw-based renderers, and neither appears to be hardware accelerated in Windows 7. Hopefully a Direct2D renderer will perform better.

by on (#70045)
Quote:
My emulator is not exactly cycle-accurate

What method are you using? Looks like it might be scanline-based and, if so, I'm interested in hearing about how you handle mid-frame effects. My scanline based rendered is a lot faster than the cycle accurate one, but it can't handle, for example, Marble Madness.

Quote:
What areas of your code have you been optimizing? Find any good tricks?

Nothing especially fancy. I've been doing stuff like using look up tables where it makes sense (pattern bit interleaving, attribute table stuff, etc.), and, in general, just running under a profiler and focusing on hot spots. The biggest improvements have come from rethinking stuff that's specific to my implementation.

Quote:
DirectDraw-based renderers, and neither appears to be hardware accelerated in Windows 7.

This was why I switched from DirectDraw to Direct3D -- not just for performance reasons, but also because blits on Vista+ are no longer bilenearly filtered (yeah, I could roll my own, but...). With Direct3D, I'm simply rendering a texture mapped quad and it's quite fast, I haven't tried Direct2D.

by on (#70057)
James wrote:
What method are you using? Looks like it might be scanline-based and, if so, I'm interested in hearing about how you handle mid-frame effects. My scanline based rendered is a lot faster than the cycle accurate one, but it can't handle, for example, Marble Madness.

My approach is almost tile-based; I try to do the cycle-accurate "catch-up" design but I only sync between CPU instructions; I do not sync between all of the individual stages of a single instruction. I also do some cheating in the PPU emulation to try to make the code run a little faster. It's good enough to run games like Marble Madness and Rad Racer but it's definitely a step below the most accurate emulators out there now. A re-design is probably about 6 years overdue. :D


Quote:
This was why I switched from DirectDraw to Direct3D -- not just for performance reasons, but also because blits on Vista+ are no longer bilenearly filtered (yeah, I could roll my own, but...). With Direct3D, I'm simply rendering a texture mapped quad and it's quite fast, I haven't tried Direct2D.

That's encouraging to hear that you are getting good performance with Direct3D. As I understand it Direct2D is just a wrapper on top of Direct3D so it should perform similarly well.

by on (#70064)
Hmm... it would be easy enough to convert my scanline engine into a tile-based one. Might give that a try for the boost in compatibility.

Quote:
I try to do the cycle-accurate "catch-up" design but I only sync between CPU instructions; I do not sync between all of the individual stages of a single instruction.

FWIW, I'm using PPU cycles as my timebase and am calling my CPU code every 3 ticks (NTSC only). It was easy to implement and, while I could probably get the biggest boost in performance by converting this to a catch-up design, it's not as slow as I thought it would be (heck, I think it's actually pretty fast).

Quote:
That's encouraging to hear that you are getting good performance with Direct3D. As I understand it Direct2D is just a wrapper on top of Direct3D so it should perform similarly well.

Yeah, I'm sure it will work well. My benchmarks are done with rendering enabled and I'm getting >1700 fps with the scanline engine. It's definitely not a bottleneck!

by on (#70081)
James wrote:
FWIW, I'm using PPU cycles as my timebase and am calling my CPU code every 3 ticks (NTSC only).


- Odd. I though you should run 1 CPU cycle, then call the PPU to run 3 dots (pixels). You do the reverse... :) Interesting, anyway.

- My emu gets around 120FPS in my Core2Duo 2GHz. In a Pentium 4, it doesn't run at full speed if I use the blitter to double the image size & stretch it.

by on (#70162)
Zepper wrote:
James wrote:
FWIW, I'm using PPU cycles as my timebase and am calling my CPU code every 3 ticks (NTSC only).


- Odd. I though you should run 1 CPU cycle, then call the PPU to run 3 dots (pixels). You do the reverse... :) Interesting, anyway.

- My emu gets around 120FPS in my Core2Duo 2GHz. In a Pentium 4, it doesn't run at full speed if I use the blitter to double the image size & stretch it.


I also do it by PPU cycles, running one CPU and APU cycle every third PPU cycle...seems the most logical way. :shock:

by on (#70163)
NESICIDE wrote:
I also do it by PPU cycles, running one CPU and APU cycle every third PPU cycle...seems the most logical way. :shock:


- You mean after the third PPU cycle...?

- Why "most logical way"? Indeed, I use PPU cycles to control the emulation timing. The only cycle counter used here is for PPU: from 0 to 341, plus the scanline counter, obviously.

I smell an offtopic discussion

by on (#70165)
I switched PocketNES over from a scanline number system to a "Total PPU cycles since prerender line" system. Whenever I actually needed the scanline number, I just do a multiplcation of the current timestamp by a fixed point fraction 1/341, and subtract 1.

I managed to get a speed boost by eliminating all the times it switch out of the CPU core, increment the scanline number, set the next timeout, then resume the CPU core. Note that that is not likely to be a bottleneck for other emulators, but PocketNES has free PPU rasterization from the GBA's graphics hardware, so anything to speed up the CPU speeds up the emulator dramatically.

The biggest speedup came from idle loop detection by identifying which branches or jumps were endless loops, and skipping ahead until the next event happens, (events like an Interrupt, sprite 0 hit, etc). Code that skips ahead divides by the number of CPU cycles one iteration of the branch takes, so there is no timing lost.

by on (#70166)
Zepper wrote:
- You mean after the third PPU cycle...?


Aren't the two statements equivalent?

After the third:

p p pc p p pc

Every third:

p p pc p p pc

Maybe I'm not reading you right?

Zepper wrote:
- Why "most logical way"? Indeed, I use PPU cycles to control the emulation timing. The only cycle counter used here is for PPU: from 0 to 341, plus the scanline counter, obviously.


Most logical because the PPU is the fastest running thing in the system, aside from the system clock which doesn't drive much worth actually emulating [I may be wrong here but I believe it only drives dividers for other clocks].

My PPU cycle counter goes from 0 to 89341 or 89342 for NTSC.

by on (#70172)
NESICIDE wrote:
Zepper wrote:
- You mean after the third PPU cycle...?

Aren't the two statements equivalent?
[snip diagram]
Maybe I'm not reading you right?

Some people from Brazil might not be familiar with some English idioms. "Every $ordinal $event" (e.g. "every third pixel") refers to what we'd call the effect of a clock divider, and "every other $event" means "every second $event".

Quote:
Most logical because the PPU is the fastest running thing in the system, aside from the system clock which doesn't drive much worth actually emulating [I may be wrong here but I believe it only drives dividers for other clocks].

The master clock drives the PPU's color generation. Four clocks make a pixel, and six clocks make a color subcarrier cycle. This is useful to know when implementing NTSC filtering, though one can always generate RGB video (like a PlayChoice or FC Titler) and use the SNES NTSC filter instead.

by on (#70176)
Dwedit wrote:
I managed to get a speed boost by eliminating all the times it switch out of the CPU core, increment the scanline number, set the next timeout, then resume the CPU core. Note that that is not likely to be a bottleneck for other emulators, but PocketNES has free PPU rasterization from the GBA's graphics hardware, so anything to speed up the CPU speeds up the emulator dramatically.


How are you handling rendering of the background and foreground to determine sprite0 detection from code running on the 6502 if the CPU emulation doesn't switch back/forth with the PPU emulation?