Overall status of NES emulation

Overall status of NES emulation
by -_pentium5.1_- on 2006-06-10 (#14042)

What remains between the current state of the art (i.e., Nestopia and Nintendulator) and truly perfect NES emulation? (Don't count GUI issues, weird pirate mappers, or other such features that aren't vital to the core emulation.)

by Dwedit on 2006-06-10 (#14045)

A good GUI, and lots of little features that people like. Nesticle had them, and they still haven't been duplicated well yet.

And word of mouth to push Nesticle off of mainstream websites like somethingawful.com would be nice too...

by tepples on 2006-06-10 (#14046)

Something significantly more accurate than Nesticle yet runs on older PC hardware.
Something with a built-in tile editor for games that use CHR ROM.
Something with a built-in debugger.
Something with Internet play, running a copy of the emulator on each side and exchanging controller data.

by WedNESday on 2006-06-11 (#14057)

tepples wrote:
Something significantly more accurate than Nesticle yet runs on older PC hardware.
Something with a built-in tile editor for games that use CHR ROM.
Something with a built-in debugger.
Something with Internet play, running a copy of the emulator on each side and exchanging controller data.

1. Nesticle is not accurate. It uses a scanline emulator to draw instead of all of those complicated PPU fetches.

2. Most emulators do have a built in debugger anyway.

3. I am not too keen on tile editors myself and you wouldn't want to accidentally screw up the graphics on a NES ROM by mistake.

4. As for internet play, finding someone else in the world who wants to play the same game as you at the same time for a NES emulator would be virtually impossible. There just aren't enough people playing NES emulators to get the thing going.

by tepples on 2006-06-11 (#14058)

WedNESday wrote:
1. Nesticle is not accurate. It uses a scanline emulator to draw instead of all of those complicated PPU fetches.

True, but an inaccurate emulator that runs at full speed is better for many users than an accurate emulator that runs at 20 percent speed even at maximum frameskip.

Quote:
3. I am not too keen on tile editors myself and you wouldn't want to accidentally screw up the graphics on a NES ROM by mistake.

That's why you don't write the changes to the file unless the user chooses "Save changes" when unloading the ROM.

Quote:
4. As for internet play, finding someone else in the world who wants to play the same game as you at the same time for a NES emulator would be virtually impossible. There just aren't enough people playing NES emulators to get the thing going.

Two words: Tecmo Bowl. I seem to remember a post on this board (not the old board) asking for a test ROM that displays whether turbo is turned on in order to detect cheating when used with a Nesticle movie in an online Tecmo Bowl league. I could link it if Memblers would rebuild this board's search index.

by WedNESday on 2006-06-11 (#14059)

If you only get 20%, then you need a new PC. Tecmo Bowl sounds good, but that is only one game out of thousands.

by tepples on 2006-06-11 (#14060)

WedNESday wrote:
If you only get 20%, then you need a new PC.

Are you buying?

by -_pentium5.1_- on 2006-06-11 (#14069)

Um... I was thinking about timing details and other "technical stuff." I have edited my first post accordingly.

by kyuusaku on 2006-06-11 (#14070)

Optimization and features like tile editors and debuggers are things that need to be taken into consideration with future emulators, there's only so far that you can push vital "core emulation" accuracy before you're wasting the users cycles.

by tepples on 2006-06-11 (#14073)

I'd like to see separation of the backend from the front end. This way, the backend can range in accuracy and CPU use from Nesticle to Nintendulator, while the frontend can range from spartan to Nesticle.

The backend's job: Given a PRG ROM, a CHR ROM, starting state of hardware, and a stream of controller input, generate a sequence of 256x240 pixel frames and a stream of audio.

The frontend's job: Translate DirectInput into NES controls, pass them to the backend, and translating NES video and audio into a DirectX video stream and audio stream. Share NES controls over a network (for LAN play). Edit parts of the ROM (tiles, cheats, even levels in specialized emulators). An NSF player can use the same backend, just telling it not to generate video.

by blargg on 2006-06-11 (#14076)

Code:
class Nes_Cart {
public:
error_t resize_prg( long );
error_t resize_chr( long );
byte* prg();
byte* chr();
...
};

class Nes_Emu {
public:
error_t set_cart( Nes_Cart const* );
error_t set_sample_rate( long );
void set_pixels( void* pixels, long pitch );

void save_state( Nes_State* out ) const;
void load_state( Nes_State const& );

virtual error_t emulate_frame( int joypad1, int joypad2 );
virtual long read_samples( short* out, long max_samples );
...
};

(lifted directly from the interface to the NES emulator library I'm working on)

by Dwedit on 2006-06-11 (#14077)

I just used NHC to slow my laptop down to 100MHz. The only emulator to run smoothly at 60FPS in that configuration was Famtasia.

by Quietust on 2006-06-11 (#14078)

Dwedit wrote:
I just used NHC to slow my laptop down to 100MHz. The only emulator to run smoothly at 60FPS in that configuration was Famtasia.

...probably because it was running at higher process priority than your slowdown utility.

by Anes on 2006-06-11 (#14079)

Quote:
Something significantly more accurate than Nesticle yet runs on older PC
hardware.

Thats true i remember running nesticle in a 486 dx-4 100, but im agree with Wednesday, today people have at last a 700 mghz celeron (and im talking from a country where pcs are not cheap.

Quote:
Something with a built-in tile editor for games that use CHR ROM.

I dont want to hurt anybody, but that is for kids, its a "toy" in a emulator. Maybe its interesting for someone who wants to try to change the graphics in the game and make a hacked version, but most users dont use that future.

Quote:
Something with a built-in debugger.

Here i think there is one issue: Will my emulator be for entertaiment/nostalgia/etc or for Nes development?
I think most users just want to see/play the old games they had and maybe they dont care (or dont want to take care of) tech issues.

Quote:
Something with Internet play, running a copy of the emulator on each side and exchanging controller data.

I think i totally agree with Wednesday here, and arcade emulator (as callus had) maybe, but a NES emulator?.

Anyway, that's my opinion.

by Dwedit on 2006-06-11 (#14081)

Quietust wrote:
Dwedit wrote:
I just used NHC to slow my laptop down to 100MHz. The only emulator to run smoothly at 60FPS in that configuration was Famtasia.

...probably because it was running at higher process priority than your slowdown utility.

NHC (notebook hardware controls) actually underclocks the CPU clock down to 100MHz.

by byuu on 2006-06-12 (#14085)

Quote:
1. Nesticle is not accurate. It uses a scanline emulator to draw instead of all of those complicated PPU fetches.

Ouch. I can't think of a single system more complex than the NES that has a dot-based renderer implemented for it in any available emulator :/

I imagine video circuitry / memory fetches will continue to increase in complexity for each new system, too. So the majority of accurate NES emulators really render dot-by-dot, eh? Interesting. What kind of speed loss does this incur from scanline-based renderers?

Quote:
Optimization and features like tile editors and debuggers are things that need to be taken into consideration with future emulators, there's only so far that you can push vital "core emulation" accuracy before you're wasting the users cycles.

Why is that? As real NES hardware continues to become more and more scarce, eventually there will be no one left to verify things on the real system. This will probably be a good 10-30 years from now.

After this happens, I expect future emulators to become more and more sketchy on how they implement things. Causing some games (especially PD/fanmade ones) to have increasingly varying behavior between emulators.

Perfect documentation would avoid this, but let's face it: nobody can write crystal clear documentation that everyone will understand.

Likewise, computing power will continue to increase, making this less and less of an issue.

The NES simply isn't complicated enough to warrant sacrifing accuracy on. We could eventually create a BSD/PD library such as blargg's, and eliminate the need for developing a "perfectly"-accurate NES emulator, excepting for sheer fun. And then everyone else could just add on all of their tile editors, debuggers, GUIs, etc to make it do whatever they want.

by Zepper on 2006-06-12 (#14086)

Libraries? No, not the best idea. I wouldn't like to build up a new emulator using libraries just because they're supposely accurate, so giving me less headaches. If I can write something really good OR something as exercise, so this is nice. In other words...

- I was using an 'accurate' CPU emulator years ago... Wrote my own.
- I was using an 'accurate' sound emulator... Wrote my own.
- I started my emu using an old project... Wrote my own.
- What's next? ^_^;;..

by kyuusaku on 2006-06-12 (#14089)

byuu wrote:
Why is that? As real NES hardware continues to become more and more scarce, eventually there will be no one left to verify things on the real system. This will probably be a good 10-30 years from now.

I don't understand why everyone thinks this, in 30 years there will probably be 1 hour-IC decappers who for an extra fee can derive perfect logic from die photos while you wait. If there is such a concern about every minutia, the obvious answer is to get the real thing which software can never replace.

by mozz on 2006-06-12 (#14092)

tepples wrote:
I'd like to see separation of the backend from the front end. This way, the backend can range in accuracy and CPU use from Nesticle to Nintendulator, while the frontend can range from spartan to Nesticle.

The backend's job: Given a PRG ROM, a CHR ROM, starting state of hardware, and a stream of controller input, generate a sequence of 256x240 pixel frames and a stream of audio.

The frontend's job: Translate DirectInput into NES controls, pass them to the backend, and translating NES video and audio into a DirectX video stream and audio stream. Share NES controls over a network (for LAN play). Edit parts of the ROM (tiles, cheats, even levels in specialized emulators). An NSF player can use the same backend, just telling it not to generate video.

I had a split like this in mind for my multi-emulator project (if I ever get going on it... I'm still working on source code generation for the cores, in my copious free time). In my case, the front-end / back-end split is necessary because there's only one front end, but 3+ backends planned (NES, SNES and some day Gameboy).

I'm hoping to make some inroads on the "cycle-accurate but still efficient" emulation front, too. There's no reason we can't have the accuracy of Nintendulator / BSNES with performance in the same ballpark as Nesticle / ZSNES. Its not impossible, its just hard. I don't have anything real yet, my project is still in the hopes-and-dreams phase.

by tepples on 2006-06-12 (#14094)

Anes wrote:
Thats true i remember running nesticle in a 486 dx-4 100, but im agree with Wednesday, today people have at last a 700 mghz celeron (and im talking from a country where pcs are not cheap.

True, but is it handheld? Getting an emulator to run on a 100 MHz PC is the first step toward getting an emulator to run on a 100 MHz handheld.

Anes wrote:
tepples wrote:
Something with a built-in tile editor for games that use CHR ROM.

I dont want to hurt anybody, but that is for kids, its a "toy" in a emulator.

Users demand toys, and users demand inaccurate backends if toys are available only in frontends for emulators with inaccurate backends.

Anes wrote:
Here i think there is one issue: Will my emulator be for entertaiment/nostalgia/etc or for Nes development?

Separate frontends for developers and end users.

Fx3 wrote:
- I was using an 'accurate' CPU emulator years ago... Wrote my own.
- I was using an 'accurate' sound emulator... Wrote my own.
- I started my emu using an old project... Wrote my own.
- What's next? ^_^;;..

Separating your emulator into a backend and a frontend so that others can use your emulator as a starting point before writing their own backend and frontend, and then so that users can pick and choose which backend to mate with a given frontend. In fact, an appropriate front/back split could allow the same frontend to be used with backends that represent completely different consoles, as mozz pointed out. Think MAME or MESS.

by byuu on 2006-06-12 (#14095)

kyuusaku wrote:
I don't understand why everyone thinks this, in 30 years there will probably be 1 hour-IC decappers who for an extra fee can derive perfect logic from die photos while you wait.

I'll believe that when I see it.

Quote:
If there is such a concern about every minutia, the obvious answer is to get the real thing which software can never replace.

The hardware will fail. Even ROMs only have a rated shelflife of ~30 years or so.

We can just say that about emulation as a whole and we can all quit working on our projects, too. So your point is that you're happy that people are willing to emulate as much as you, but mad that people are willing to emulate more than you? It's a hobby. To some, getting the games playing is the hobby. To people like me, getting perfect digital results and hardware-perfect H/V counters to the real hardware is the hobby. You make the compromise emulator between totally accurate and NESticle, and I'll go for perfection.

Quote:
There's no reason we can't have the accuracy of Nintendulator / BSNES with performance in the same ballpark as Nesticle / ZSNES. Its not impossible, its just hard. I don't have anything real yet, my project is still in the hopes-and-dreams phase.

At any rate, I look forward to the competition :)

by Disch on 2006-06-12 (#14096)

byuu wrote:
What kind of speed loss does this incur from scanline-based renderers?

What it boils down to is how fine your "catch up" system has to be. Per-cycle renderers have to be fine and do the certain operations on specific cycles, whereas scanline renderers can mesh everything together and do it all at once. Per-cycle renderers need to be able to enter and exit on any given cycle within the scanline (which requires many additional time checks) whereas scanline renderers know they will always start and stop at specific times.

The performance difference varies by game, but if you employ a hybrid scheme in your emulator (using the scanline renderer whenever possible and only using the fine renderer when the game does something mid-scanline) I don't think the speed impact is all that much of a difference. It really does depend on the game, though. Games which poll for sprite-0 hit, for example, might trick the emu into catching up the PPU every 2 instructions (making it use the fine renderer for several scanlines, and constantly entering/exiting)... whereas a game which uses an IRQ counter might only use the fine renderer twice (on the scanline being split by the split-screen function), and the other scanlines can all be quickly slopped together.

Quote:
There's no reason we can't have the accuracy of Nintendulator / BSNES with performance in the same ballpark as Nesticle / ZSNES. Its not impossible, its just hard.

Nope -- I'm sure it's impossible.

The degree of complexity is proportional to the CPU power required. Programming can only be optimized to a point... then it's pretty much as fast as it's going to get. I'm not saying that you can make a "perfect" fully optimized emulator -- but let's face it, once you reach that plateau, futher optimizations are going to be relatively minor.

Doing a hybrid emu may be possible... like a hack-and-slash NESticle-esque emu which is built optimized for specific games (via a hash check or something) -- and a slower, more precise, more generic emulator for games which aren't recognized. But it's arguable whether this is really getting the best of both worlds, or if it's just falling back to speed hacks.

by baisoku on 2006-06-12 (#14099)

Fx3 wrote:
- I was using an 'accurate' CPU emulator years ago... Wrote my own.
- I was using an 'accurate' sound emulator... Wrote my own.
- I started my emu using an old project... Wrote my own.
- What's next? ^_^;;..

I'd say you should have a go at writing your own 'accurate' GUI.

by tepples on 2006-06-12 (#14101)

Disch wrote:
The performance difference varies by game, but if you employ a hybrid scheme in your emulator (using the scanline renderer whenever possible and only using the fine renderer when the game does something mid-scanline) I don't think the speed impact is all that much of a difference. It really does depend on the game, though. Games which poll for sprite-0 hit, for example, might trick the emu into catching up the PPU every 2 instructions

But if the PPU code can compute in advance the lines on which $2002 may change, then it can run the scanline renderer for all scanlines that do not contain sprite 0 or eight sprites.

Quote:
The degree of complexity is proportional to the CPU power required.

But how much complexity is there really inside an NES CPU and PPU?

Quote:
Doing a hybrid emu may be possible... like a hack-and-slash NESticle-esque emu which is built optimized for specific games (via a hash check or something) -- and a slower, more precise, more generic emulator for games which aren't recognized.

Such speed hacks are backends, and the frontend can turn them on or off.

by Disch on 2006-06-13 (#14104)

tepples wrote:
But if the PPU code can compute in advance the lines on which $2002 may change, then it can run the scanline renderer for all scanlines that do not contain sprite 0 or eight sprites.

Right. That's actually a great optimization to employ in such an emu. The trickiest part of that is sprite-0, since that's the only thing you can't exactly predict until it's actually being rendered (since it can change very quickly by something as simple as a CHR swap, or by a pattern table flip).

Even if you optimize it down to the scanline that Sprite 0 is on... checks will still be causing the emu to split those 8 scanlines every 2 instructions. Super Mario Bros for example, has it's sprite 0 hit around the middle of the screen and on the 6th scanline of the screen, which means you'll have at least 6 or 7 scanlines going in and out of the fine renderer several times instead of just 1 like you'd have with a game which is driven by an IRQ.

Quote:
But how much complexity is there really inside an NES CPU and PPU?

I am constantly being suprised. When I look at the documentation for things like Pirate Mapper 90, which has an IRQ counter which can be driven by PPU reads and CPU writes... efficiently tracking all of that inside an emulator can make things extraordinarily complicated extremely fast.

There's CPU quirks with Read/Modify/Write instructions where the dummy value being written isn't what is expected, it seems like there's dozens of insane APU quirks (I mean just look at all of blargg's tests -- and that's only the stuff that's been discovered), and it seems like there's endless PPU quirks. Getting it all "accurate" will lead to a very complicated program. And there's no way such a program will ever be in the same performance ballpark as something that simplified and optimized it all as much as NESticle did.

by mozz on 2006-06-13 (#14105)

mozz wrote:
There's no reason we can't have the accuracy of Nintendulator / BSNES with performance in the same ballpark as Nesticle / ZSNES. Its not impossible, its just hard.

I stand by this statement, but feel free to disagree until someone manages to build one and prove me right =)

Take BSNES for example. The fundamental reason its way slower than ZSNES is not because it has to do more work than ZSNES does to emulate accurately. (It does have to do more work, but probably less than 2x or 3x). The main reason BSNES is so slow is because it was written to be clear and maintainable, rather than to be high-performance. An emulator engineered from the ground up to be high-performance could be much closer to ZSNES in performance, while still including all the work that BSNES does to emulate accurately (such as certain dots in the middle of scanlines being 6 clocks instead of 4).

Having looked through ZSNES code, I don't even think a lot of it is optimal---the assembly cores for example, contain a lot of 8-bit register use and look like they would incur plenty of partial register stalls on P2/P3 hardware, and could probably be speeded up 10-20% or more simply by rewriting them to avoid that. (Disclaimer: its been months since I looked at that code and I know the ZSNES team has been actively reworking stuff, so my claim might no longer apply).

Anyway, there is precedent for stuff like this. Consider microkernels. Research on the L4 microkernel proved that you could get performance matching and in some cases even exceeding monolithic kernels, if you paid very careful attention to the engineering of the microkernel and engineered it for a specific processor (but then the rest of the system was largely processor-independent). To offer its benefits (separation/protection/whatever) over monolithic kernels, the microkernel had to "do more work" in the sense that it had to transition between user mode and supervisor mode more often, and move data between the layers more often. Through careful engineering, they "trimmed the fat" from these operations and made them extremely efficient, so the "more work" part did not actually lead to worse performance. I think this is similar to the situation with today's emulation scene. We can build emulators that "do more work" than Nesticle or ZSNES, and if we design them right and engineer them very carefully we can probably get their performance close to that of Nesticle/ZSNES. (It might even be possible to exceed them but I'm not counting on that. If I could get an emulator within 1.5x of ZSNES while still having the accuracy of BSNES, I would count that as a whopping success).

by byuu on 2006-06-13 (#14106)

Quote:
Take BSNES for example. The fundamental reason its way slower than ZSNES is not because it has to do more work than ZSNES does to emulate accurately. The main reason BSNES is so slow is because it was written to be clear and maintainable, rather than to be high-performance.

Right. I intentionally implement things in a way I know won't be as fast sometimes. The APU is the "perfect black box" that all other emus enslave to the CPU clock. I treat it like real hardware would and make them independant of each other. I don't optimize based on guessing/probability/etc, eg I won't emulate the same thing two ways and switch between the more accurate and faster version depending on whether or not the game needs it at a certain point. That gets sloppy real quick, and you waste too much time maintaining rather than discovering new hardware quirks and such.
Not to mention the code is very object-oriented. Lots of virtual function calls abound. But at the same time, I can swap between two CPU cores by modifying a single word in the main header file. Or do it at runtime.

Quote:
Having looked through ZSNES code, I don't even think a lot of it is optimal---the assembly cores for example, contain a lot of 8-bit register use and look like they would incur plenty of partial register stalls on P2/P3 hardware, and could probably be speeded up 10-20% or more simply by rewriting them to avoid that.

I'm sure you're aware why it is "slow", but for those who aren't: it was written nearly ten years ago. Processors continue to get more and more psychotic. Nowadays, you execute an xchg opcode and it's sixteen times slower than three mov opcodes on Pentium IVs. That kind of thing didn't happen on the 386es ZSNES was designed for.

Optimizing for a generic x86 target is a journey into madness. It's best to just follow a few simple rules (don't use obscure opcodes, try and use the full register sizes whenever possible, etc) and go with that.

Quote:
If I could get an emulator within 1.5x of ZSNES while still having the accuracy of BSNES, I would count that as a whopping success.

I'd be happy to see that. Keep in mind you have another challenge for newer versions, splitting each opcode cycle in half for bus-accurate CPU<>APU<>PPU communication :)

I'll still continue to go for simplicity, and my emulator has always been more of a self-documenting reference platform than a true user-friendly emu. And I anticipate it always remaining easier to implement new findings into my design than any emu aiming for speed. I just added on a bunch of UI stuff since I had about 10,000 people using it anyway. Heck, I use it myself since it runs at 2-3x speed on my PC, so why not?

I really think this will become less and less of a concern in the future. Once even Pocket PCs run at 3ghz, who will care if an emu eats up 1% or 10% when the backlight eats 20x the battery life either way? Certainly less people than will care that the emus source code is nigh unmaintainable and looks like it belongs in the IOCCC.

by mozz on 2006-06-13 (#14107)

byuu wrote:
I'd be happy to see that. Keep in mind you have another challenge for newer versions, splitting each opcode cycle in half for bus-accurate CPU<>APU<>PPU communication

My strategy for that is to count time in CPU cycles but whenever two different devices have to interact (e.g. CPU writes to a port that will be read from the APU, or CPU writes to some PPU register), only then will it figure out the actual "tick" (machine cycle) during which the write occurred, and translate it onto the clock of the APU and figure out what tick the APU will first see the effects of that write at. (in theory writes are less common than reads; there's no reason why you couldn't choose to do this on reads instead). For the majority of reads and writes, the CPU is the only thing that can interact with that memory so there's no need to calculate the exact time the read or write would occur at.

by byuu on 2006-06-13 (#14108)

Yeah, I'm pretty sure that method would work for CPU<>APU. Maybe keep 64 before<>after mirrors for CPU<>PPU as well. You'll still need to split the CPU core up into cycles in order to get DMA synchronization timing correct, though.

by blargg on 2006-06-13 (#14111)

Sorry for the long post.

Disch wrote:
Even if you optimize it down to the scanline that Sprite 0 is on... checks will still be causing the emu to split those 8 scanlines every 2 instructions.

OK, so a simple version's worst-case performance could be 16 scanlines out of 240 rendered lines, 7% of the total. That leaves 93% of the scanlines potentially rendered in an optimal way. What if you predict exactly when the sprite 0 hit will occur? Then you don't have to switch to fine rendering mode unless this prediction gets invalidated by PPU writes, something that is uncommon.

Quote:
[Pirate Mapper 90] has an IRQ counter which can be driven by PPU reads and CPU writes... efficiently tracking all of that inside an emulator can make things extraordinarily complicated extremely fast.

This is a problem; even simple hardware driven by odd sets of signals can be very difficult to optimize. The general divide-and-conquer optimization strategy will probably help as usual: optimize for the common case and the few exceptions will just be slower.

Quote:
Getting it all "accurate" will lead to a very complicated program. And there's no way such a program will ever be in the same performance ballpark as something that simplified and optimized it all as much as NESticle did.

The question is, are these quirks actually invoked in common NES programs? It would be an interesting study to seek out consequential quirks invoked by NES programs, as a way to prove that such an emulator is impossible (rather than speculate as we are).

byuu wrote:
I won't emulate the same thing two ways and switch between the more accurate and faster version depending on whether or not the game needs it at a certain point. That gets sloppy real quick, and you waste too much time maintaining rather than discovering new hardware quirks and such.

What's being proposed here by tepples and others is not this; the proposal is to use optimized code where it has no side-effects. For example, if a game doesn't touch the PPU registers for the entire frame, you can use optimized tile and sprite rendering, without any effect on accuracy.

There seems to be significant negativity towards designs that optimize performance of an emulator. In the past people had to focus on efficiency, and they often did this in ways that unnecessarily sacrificed accuracy. I think the activity is enjoyable, though it has nothing to do with emulation in specific. In my emulator I've had fun keeping it efficient while still passing some of my most rigorous test ROMs (and not just "passing" them in a hacky way). It's a more general activity of doing software engineering and examining possible tradeoffs.

Quote:
Processors continue to get more and more psychotic. Nowadays, you execute an xchg opcode and it's sixteen times slower than three mov opcodes on Pentium IVs. That kind of thing didn't happen on the 386es ZSNES was designed for. Optimizing for a generic x86 target is a journey into madness. It's best to just follow a few simple rules (don't use obscure opcodes, try and use the full register sizes whenever possible, etc) and go with that.

Correction: the x86 architecture is psychotic. If you've used other architectures, you'd find them infinitely refreshing in their regularity and efficiency (the same way the 6502 and 65816 are). I take it using a compiler for x86 these days is generally a win?

Quote:
I'll still continue to go for simplicity, and my emulator has always been more of a self-documenting reference platform than a true user-friendly emu. And I anticipate it always remaining easier to implement new findings into my design than any emu aiming for speed. I just added on a bunch of UI stuff since I had about 10,000 people using it anyway. Heck, I use it myself since it runs at 2-3x speed on my PC, so why not?

There's nothing wrong with an emulator design that favors ease-of-maintenance over ultimate efficiency. In these discussions there seems to be the notion that only one design is right, and the others are wrong and should be avoided. All designs involve tradeoffs and each one emphasizes some things over others, like programming skill needed to implement it, target platforms, clarity, language of implementation, efficiency, etc. There's no need to trivialize other designs as a way to justify your decisions; the fact that there is a tradeoff means that neither can meet all the demands equally and that each design has its merits and is worth being implemented by someone.

Quote:
I really think this will become less and less of a concern in the future. Once even Pocket PCs run at 3ghz, who will care if an emu eats up 1% or 10% when the backlight eats 20x the battery life either way?

What about special features that require a fast emulator, like arbitrary seeking in a movie, real-time reverse playback, or showing a wall of emulators all running at full speed? (all of which are quite cool features to see)

by tepples on 2006-06-13 (#14112)

blargg wrote:
byuu wrote:
Processors continue to get more and more psychotic.

Correction: the x86 architecture is psychotic.

Where can one buy an affordable desktop machine that 1. uses an architecture other than x86 and 2. does not require executables to be signed by the machine's manufacturer?

by Dwedit on 2006-06-13 (#14113)

You mean like an old iMac?

by byuu on 2006-06-13 (#14114)

Quote:
What's being proposed here by tepples and others is not this; the proposal is to use optimized code where it has no side-effects. For example, if a game doesn't touch the PPU registers for the entire frame, you can use optimized tile and sprite rendering, without any effect on accuracy.

And when you get a game that does touch the PPU registers mid-frame, you need to switch to a scanline renderer. And when they touch PPU regs mid-scanline you need a dot renderer. I find it easier to just implement one thing (the dot renderer) and get consistent performance for all games.

Quote:
There seems to be significant negativity towards designs that optimize performance of an emulator. In the past people had to focus on efficiency, and they often did this in ways that unnecessarily sacrificed accuracy. I think the activity is enjoyable, though it has nothing to do with emulation in specific. In my emulator I've had fun keeping it efficient while still passing some of my most rigorous test ROMs (and not just "passing" them in a hacky way). It's a more general activity of doing software engineering and examining possible tradeoffs.

This is my fault, and I apologize. I was defending my approach in response to comments here and at another board that something is "seriously wrong" if an emulator has high requirements (2-3ghz). The attitude seemed to be because ZSNES can do something with 20x the power of the SNES, an accurate emulator should not require more than 30x or it is flawed in some way, so I was defending my viewpoint.

I did not mean to convey this was the only way it could be done. In fact, I agree with your viewpoint and I eagerly await to see what mozz and similar programmers come up with for the SNES :D

But I still stand by my statement that there's nothing wrong with requiring a modern mid-range PC for an emulator, regardless of what system it's targeting, so long as you aren't unnecessarily being stupid. eg writing the whole thing in Visual Basic and plotting each pixel for HQ4x using GDI's SetPixel API call. My pure algorithmic code is actually quite well optimized (especially the add/sub code, thanks blargg!). It's just my general "reference" approach that makes it so slow. And it runs fullspeed on a processor that can be had for $60 brand new from major retailers (3ghz Celeron).

Quote:
Correction: the x86 architecture is psychotic. If you've used other architectures, you'd find them infinitely refreshing in their regularity and efficiency (the same way the 6502 and 65816 are). I take it using a compiler for x86 these days is generally a win?

I can usually double performance of critical routines by rewriting them in x86 assembler. And I optimize both the c++ and assembler code pretty well. Any algorithmic defencencies I have are reflected in both versions of the code, so you can't blame that.

Still, I'd love to get a system with a really good processor (G5 maybe?) that runs FreeBSD and all of the packages I need. But I won't pay over $200 for a complete tower, and it'd better be new, too.

Quote:
There's nothing wrong with an emulator design that favors ease-of-maintenance over ultimate efficiency. In these discussions there seems to be the notion that only one design is right, and the others are wrong and should be avoided.

Yes there is, and that's why I followed suit. Sorry once again.

Quote:
What about special features that require a fast emulator, like arbitrary seeking in a movie, real-time reverse playback, or showing a wall of emulators all running at full speed? (all of which are quite cool features to see)

Your wall emulator and rewind features are indeed very cool :D

by mozz on 2006-06-13 (#14115)

blargg wrote:
I take it using a compiler for x86 these days is generally a win?

Compilers do a good enough job for many tasks... but for emulating a 6502 or a 65816, having the x86 flags available probably helps a lot.

Regarding scanline-based vs. pixel-based PPU rendering, I've been assuming all along that my PPUs will be a "parallel task" (i.e. a separate green thread that is co-operatively multitasked with the PPU and SPC700) to the CPUs and will batch up as many pixel computations as it can -- but not necessarily an entire scanline. It will only have to stop when the CPU is stuffing new values into PPU registers. Then it would get caught up on the old part of the scanline, write the new value, update internal caches (or whatever) and be ready to resume computing pixel values in its new state (which it would not actually do until later). The key here is that something like sprite 0 collision does not actually affect PPU state used for rendering, so it doesn't have to stop on an event like that. Unless you actually alter display-related PPU registers in the middle of a scanline, my thing would be kinda equivalent to a scanline-based renderer (maybe with a little extra bookkeeping).

Edit: by the way, I have a lot of respect for byuu and the great progress he's made with BSNES. I did not mean any of my comments as a criticism of the design of BSNES. His design is elegant and functional and achieves his stated goals. There is definitely a tradeoff between performance and other factors (maintainability, for example).

Myself, I am interested in trying to build an emulator with BOTH high-performance and high accuracy. Mostly because its about the hardest possible thing to build that I can think of!

by blargg on 2006-06-13 (#14117)

tepples wrote:
blargg wrote:
Correction: the x86 architecture is psychotic.
Where can one buy an affordable desktop machine that 1. uses an architecture other than x86 and 2. does not require executables to be signed by the machine's manufacturer?

Are you suggesting that we not call a spade a spade if it's the only one generally available and affordable?

byuu wrote:
And when you get a game that does touch the PPU registers mid-frame, you need to switch to a scanline renderer. And when they touch PPU regs mid-scanline you need a dot renderer. I find it easier to just implement one thing (the dot renderer) and get consistent performance for all games.

It's not just you; everyone finds that easier. There's not much point in discussing what things people prefer, whether simple and accurate or complex and fast. The point of a discussion like this is to explore the inherent aspects of designs, leaving it up to the participants to make their own choices based on the facts.

Quote:
I was defending my approach in response to comments here and at another board that something is "seriously wrong" if an emulator has high requirements (2-3ghz).

Many people in the emulator "scene" who don't actually write emulators are generally consumerist and stupid and better to ignore. I really am put off by that kind of attitude. They have no grasp of engineering and the existence of multiple yet conflicting aspects of things. Authors who cater to people with these attitudes throw away the variety possible with emulation. Emulation can be a very creative field if the competition for monoculture is avoided. End of rant.

mozz wrote:
Myself, I am interested in trying to build an emulator with BOTH high-performance and high accuracy.

Just to preempt criticism here, the tradeoff you make if you want a high-performance and high-accuracy emulator is a much larger investment in design, debugging, and more debugging. Which is fun if you want an engineering challenge.

by Disch on 2006-06-13 (#14119)

This is a rather late reply but...

byuu wrote:
Optimizing for a generic x86 target is a journey into madness. It's best to just follow a few simple rules (don't use obscure opcodes, try and use the full register sizes whenever possible, etc) and go with that.

Which is exactly why I just don't understand the appeal to coding straight x86. The whole point of sacrificing high level abstraction and portability is to get the performance boost of being able to write code which will be relatively consistent. But x86 seems to always be changing from chip to chip... new versions behave differently, AMD operates differently than Intel.... sometimes even making programs written in assembly practically as slow as they would be if written in a higher level language.

But there's pros and cons to everything. I mean no language is the best -- it ultimately comes down to programmer preference.

by byuu on 2006-06-13 (#14120)

mozz wrote:
Regarding scanline-based vs. pixel-based PPU rendering, I've been assuming all along that my PPUs will be a "parallel task" (i.e. a separate green thread that is co-operatively multitasked with the PPU and SPC700) to the CPUs and will batch up as many pixel computations as it can -- but not necessarily an entire scanline. It will only have to stop when the CPU is stuffing new values into PPU registers.

Exactly how I plan on doing it. But I used the "green threads" for CPU<>APU syncing too. That way nothing needs to be "broken up" from its linear execution and no timers are needed anywhere. You switch to another thread only when you need to (eg the CPU talks to the APU or PPU, which is behind the CPU at present time). And of course, forcing a switch when one gets dramatically ahead of the other.
But I won't lie, context switching has its performance hit. You really have to carefully limit them to gain significant speedups.

I don't know that I will be able to perfect the order of PPU bus reads and internal variable manipulation, but I'll do as much as I can and hope that in the future people as intelligent as those here will move on to the SNES hardware and do what I could not :)
At least it isn't as important as it is in NES emulation. I know of maybe five games that write mid-scanline by accident (and all by overshooting hblank during an IRQ).

Speaking of which, you're free to use my cothreading library in your emulator if you want to save yourself some time. Or if you're writing your app in pure assembly I'm sure that's not even a concern to you.

And you will lose savestates with your approach unless you use some sort of hybrid state-machine-cothreading.

Quote:
Which is exactly why I just don't understand the appeal to coding straight x86.

I agree. ZSNES used it because it was needed to run at full speed on computers of that era. And now they are regretting it and constantly porting the code to c. Yes, now they insist on c and won't use c++, heh. Man, I can't even read assembly code as it is written by most people. I don't see why assembly coders don't indent their loops like c++ coders do. It makes a hell of a difference in readability.

eg
Code:
main() {
xor eax,eax
mov edx,256
.loop {
mov edi,dword[ebp]
add ebp,4
mov ecx,65536
- rep
stosd
dec ecx
jnz -
dec edx
jnz .loop
}
ret
}

-vs-

main:
xor eax,eax
mov edx,256
.loop
mov edi,dword[ebp]
add ebp,4
mov ecx,65536
.loop2
rep
stosd
dec ecx
jnz .loop2
dec edx
jnz .loop
ret

by tepples on 2006-06-13 (#14125)

Dwedit wrote:
You mean like an old iMac?

I meant new, not used. Case in point: If x86 is so much more psycho than other architectures, then why is Apple switching to x86? I guess that shows empirically that x86 is more 最高 (saikou) than psycho.

by Disch on 2006-06-13 (#14130)

tepples wrote:
Case in point: If x86 is so much more psycho than other architectures, then why is Apple switching to x86?

Because Windows has dominated the market and Mac was simply alienating itself by being 100% incompatible. They're taking steps to make it easier for developers to code for both Windows and Mac so that Windows doesn't have the market on 2nd and 3rd party developers by the balls. (something like Wine could now be made for newer Macs -- whereas before it would be pretty much impossible)

There's no other reason. x86 provides no special benefits or anything -- it's just been popular for too long to easily change.

by mattmatteh on 2006-06-13 (#14132)

i always indented my asm code and made it readable.

and i dont think apple switched to intel because thats what PC uses. i think thats the last reason. the power consumtion by the g5 was way too much. and i think the speed was falling behind compared to the intel x86. i also think ibm pissed off apple by putting more work into the cell processor.

correct me if i am wrong.

personally i like the g4 better. dont have a g5.

back on topic... right now my emu is coded in C. i will probably keep it readable for the part. and use inline asm where needed. (start with x86, and later ppc as i havent learned that yet)

matt

by mozz on 2006-06-14 (#14133)

byuu wrote:
Speaking of which, you're free to use my cothreading library in your emulator if you want to save yourself some time. Or if you're writing your app in pure assembly I'm sure that's not even a concern to you.

Thanks byuu. I planned to try and use your library to get started, but I might end up replacing it later (dunno). I'm not that far along yet. Last night I rewrote my code-emitting and pretty-printing infrastructure, scrapping all my codegen code in the process (but I expect I will be much happier with it from now on). In fact... I think I'll go home and work on it right now!

byuu wrote:
And you will lose savestates with your approach unless you use some sort of hybrid state-machine-cothreading.

I'm not worried about savestates. I can save the state of each parallel task separately, and just use whatever simulated time they are currently at when I save them. As long as you save that time as well, you can resume it accurately. (If you try to get the times as close together as you reasonably can before saving, it allows less precise emulators to import your savestate and ignore the fact that the state was actually from different times, and they can mostly get away with it). I will only capture state at predefined "safe" points (such as between instructions). At these "safe" points, all of the state will be up to date in my context structures (i.e. there won't be any values needed from host machine registers, or anything like that). There was a lengthy discussion about this awhile ago. There was some question about whether what I'm describing was general/reliable enough, but I think that it is (and I hope I get far enough one day to prove it).