What determines whether an emulator is "fast" or "slow"?

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
What determines whether an emulator is "fast" or "slow"?
by on (#118082)
I've looked through the sources of emulators, and some of them say stuff like:

- PPU uses scanline-based rendering, very fast

or

- Cycle-precision CPU, very slow

I get the terminology of scanline-based rendering and cycle-precision and whatnot, but how is one able to determine whether his emulator is fast or slow? Is there a certain time frame in which a CPU needs to execute, let's say, 100k cycles, which decides whether it's fast or slow? For instance, and these are completely random numbers, if a CPU takes less than 50 ms to execute said amount of cycles, it's fast, 50 - 70 it's ok, and above 70 ms is slow? Or is this just based on a programmer's intuition?
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118086)
Video game console emulation is a soft real-time application. A "fast enough" emulator can reliably finish 60 frames of CPU and PPU time in one second on the target platform. An emulator supporting fast-forward is "fast enough" when it can hit 240 fps, possibly with skipping some PPU output stages for undisplayed frames. You gave an example of your CPU executing 100K emulated cycles. On the NES, that's about 55 ms worth of cycles. Your emulator needs to execute those cycles within 55 ms, plus the PPU and APU activity that goes along with them.

If your emulator is designed for a PC running Windows or GNU/Linux, it needs to run at 60 fps on the slowest machine that meets the current Windows version's system requirements. Since Windows 7 was released in the fourth quarter of 2009, this has meant a 1 GHz Pentium 4 or Atom. (This reflects Windows 7's use on netbooks and Windows 8's use on x86 tablets.) FCEUX does; few other popular emulators do. An emulator that emulates multiple machines at once, such as nemulator's Wii-inspired menu, has to be even faster, but multi-core machines alleviate this somewhat because emulating multiple machines is an embarrassingly parallel workload.

If your emulator is designed for a device running Android OS, it needs to run at 60 fps on that device. Faster is better because whatever CPU cycles your emulator doesn't use now can be used for making urgent telephone calls later.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118087)
Wow, that's almost embarrassing. My CPU is much less efficient than that. That 100K thing was just a random number, but mine is nowhere near it, and I'm not even halfway done with my PPU (let alone the APU). Then again, it's written in Java with a ton of object overhead and it's so focused on readability that I can fully understand its slowness. Thanks for the response, your explanation sounds pretty logical.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118091)
Maybe a more simple way to put it, a fast emulator is efficient in how is calculates and processes everything it needs to do to avoid wasting CPU time. As well as just how much it needs to process. For example your emulator would be much faster emulating games if it did so by being less accurate. For example it's faster to just draw the nametables at their scrolled position all at once at the end of the frame. This will work for simple games too.

I think alot of what causes extra processing load in a NES emulator is CPU and PPU synchronization. Ofcourse a modern PC has a huge amount cache but on older PC's the cpu had much smaller cache so code size would be more of a concern I believe. The point is that lets saw you emulator one CPU cycle, or CPU instruction, and then catch up the PPU. You'll be constantly shifting between your CPU and PPU cores which causes alot of overhead. To avoid losing accuracy some people use catch-up type methods. With a catch-up method you save on overhead since you only switch between the CPU and PPU cores when you really have to do so. The main reason you'll have to I believe is $2002 register reads for sprite zero or other flag detection. Otherwise you can log PPU register writes/states for rendering later I believe. But this adds complexity to your emulator.

You have to look at what platform you are going to be on and then decide what's important. You may not need to optimize performance much or at all. As long as you don't make any errors, a modern PC CPU is going to be sufficient for a high degree of accuracy and 60fps. Newer systems ofcourse probably need optimizations.

If it's your first emulator, just focus on maintainable code, and getting things working.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118094)
MottZilla wrote:
You have to look at what platform you are going to be on and then decide what's important. You may not need to optimize performance much or at all. As long as you don't make any errors, a modern PC CPU is going to be sufficient for a high degree of accuracy and 60fps. Newer systems ofcourse probably need optimizations.

If it's your first emulator, just focus on maintainable code, and getting things working.

Pretty much exactly this.

I wrote mine a year ago with only PC's in mind. It had horribly naive bankswitching and occasionally silly switches that could be computed in a single line, but that didn't really matter because I could run it on any desktop as intended. Few weeks ago somebody asked me about porting to Android so I started going hog wild on optimizing for ARM. It's far easier to optimize for that kind of performance once your emulator is stable and mature and you're not fighting with getting the basic functionality in place. Rewriting all of my memory mappers was way easier knowing that as long as I was swapping banks correctly the image would render, and it wasn't a case of the PPU potentially being broken.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118104)
At one end, you have your emulators written in Javascript in a web browser that tug along slowly on a dual core machine.
At the other extreme, you have PocketNES, written almost entirely in ARM assembly language, which runs at real time on a 16MHz GBA (using hardware accelerated graphics).
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118126)
Ah, I see. My emulator is written in Java anyway, so processor-specific optimizations really aren't that viable. To add to that, this is the first thing I've done in my life that is so low-level, I have +- 2 years experience with Java (my first and currently only programming language), and I literally did not know what a CPU did until about 5 months ago (what registers were, what an instruction was, or whatever its general purpose was anyway), so even getting the CPU to work in the first place was a complete and utter disaster, so optimizing it to a point where it can run at X times 60 fps is going to be a living hell.

Thanks for the answers and suggestions. While optimization is going to be put off for a while, the answers did give me some things to think about for the future of the emulator.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118162)
Writing your first CPU emulator is one hell of a learning experience for sure. My first was the 8086. That was.... taxing.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118173)
miker00lz wrote:
Writing your first CPU emulator is one hell of a learning experience for sure. My first was the 8086. That was.... taxing.


Haha I have to give it to you, you've got some guts to do that as your first emulator. I think I would've thrown away my laptop and lived in the woods after becoming permanently paranoid if I had to do that with the knowledge I had prior to making the 6502. Even now, I think the 8086 would be a major, several-month challenge to me that would have a pretty big chance of failure to be honest. I do, however, plan to emulate one at some point. The NES is just a start in my journey (hopefully) :D
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118185)
ArsonIzer wrote:
miker00lz wrote:
Writing your first CPU emulator is one hell of a learning experience for sure. My first was the 8086. That was.... taxing.


Haha I have to give it to you, you've got some guts to do that as your first emulator. I think I would've thrown away my laptop and lived in the woods after becoming permanently paranoid if I had to do that with the knowledge I had prior to making the 6502. Even now, I think the 8086 would be a major, several-month challenge to me that would have a pretty big chance of failure to be honest. I do, however, plan to emulate one at some point. The NES is just a start in my journey (hopefully) :D


Well, if you can do the 6502 you can definitely do the 8086. The most confusing thing on the 8086 is understanding the addressing mode byte (aka mod/reg/rm byte)... once you get that it's really not much harder. There are more opcodes and more addressing modes, so it will take more time. There are also "group" opcodes where the first instruction byte indicates which group, then it has a modregrm byte where one of it's fields indicates the exact operation. It can get pretty weird. Oh and there are segments to worry about too, but that's not so bad. Yeah, it DID take a few months before I was able to boot DOS.

If you get around to doing it, let me know if you want a little help. Once you understand the couple confusing aspects, it does become about as easy as the 6502. :)
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118187)
Are 8086 and 65816 of similar complexity?
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118192)
I don't know 65816 assembly, but reading it's specs on Wikipedia I would hazard a guess that it's roughly the same. If anything, likely a complexity edge to the 8086 because in some cases, the encoding can be pretty strange looking. There are also repetition and segment override prefixes that must be parsed and honored. I'm not sure if the 65816 has anything like that.

This is my main CPU module for it: https://sourceforge.net/p/fake86/code/c ... ke86/cpu.c
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118194)
The one thing that bit me in the ass with x86 all the time, compared to 65xxx CPUs, was how the x86 handles status flags when doing mov or equivalent operations. Unlike the 65xxx many flags do not get set when moving a value or contents into a register, requiring you to use test or other crap. Can't tell you how often that caused bugs in my code when doing x86 nearly 20 years ago.

65xxx really spoils you in a lot of regards. Honestly the 65816 is a great piece of engineering (IMO), I just wish it had opcodes for multiplication/division (even just integer would be fine); sure, SNES/SFC has memory-mapped registers for that, but we're talking CPUs here.

P.S. -- NSFW, but: fuck x86. Awful processor if you ask me. Been around too long and way too many hacks and extensions (there are I believe two 32-bit feature flags that define what each CPU is capable of, including MMX, SSE and its bazillion derivatives, blah blah). Just so damn nasty. Kudos to all the guys who can do x86 on today's x86 CPUs; I'm glad I stopped with the 486.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118196)
Yeah, it's not the cleanest arch out there. That's for damn sure. I have no problem with the flags not being modified from a mov though. In fact, I like it and it makes more sense IMO. It's intended to move data, not make calculations. I might be annoyed in a lot of cases if I wanted my flags to stay as they are, but move some data before looking at them. It lets you be a lot more flexible with your code order and not have to worry about storing/restoring flags, and the cost of a test op here and there is negligible. It's only necessary when the value you want the flags to be set from is sitting in memory pre-calculated. I think usually decisions based on flags are going to be from real-time calculations.

Also, I may be wrong since I've never used 65816, but I would bet the 8086 spanks it when it comes to block moves and string operations using rep prefixes.

You're right though, modern x86 is NASTY!!!! And old x86 was still pretty quirky right from the beginning, but that doesn't always mean bad.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118198)
miker00lz wrote:
Also, I may be wrong since I've never used 65816, but I would bet the 8086 spanks it when it comes to block moves and string operations using rep prefixes.
Block moves are pretty fast, but decompressing RLE is faster. :lol: (Scroll down to the "results" section.)
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118204)
Joe wrote:
miker00lz wrote:
Also, I may be wrong since I've never used 65816, but I would bet the 8086 spanks it when it comes to block moves and string operations using rep prefixes.
Block moves are pretty fast, but decompressing RLE is faster. :lol: (Scroll down to the "results" section.)


That number looks bad, but Trixter is measuring an RLE bitmap RE-compressed with LZ4 and PK, and then decompresing that! :) It would be awful slow on any CPU really.

Decompressing a true RLE sequence directly would be lightning on an 8086. For each byte run, put the byte value in AL and the run length in CX, then issue a REP STOSB operation. Very, very fast. If it's an even-length sequence you could even have the value in both AH and AL, shift CX right by one and issue a REP STOSW instead. It then can write two bytes in the same number of clock cycles. Come to think of it, you can do it with odd-length sequences too as long as you check the carry flag after the right shift and add a single byte extra into the output when it's set, after the REP STOSW.

Consider this code. Assume the byte value is in AL, and the run length is in CX:
Code:
mov ah, al
shr cx, 1
rep stosw
jnc done
stosb
done:


The 8086 can be very efficient for a lot of workloads.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118205)
But when you factor in the HORRID cycle-per-cycle performance compared, is it really faster at all? Not really.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118206)
I think if running at the same clock speeds, it would depend on the workload. In a lot of cases, no probably not. In others (say RLE decompression) I think it can, especially when 16-bit memory operations are in heavy use. It would actually be pretty interesting to see numbers. Can anybody write a good, optimized equivalent of the code I gave for the 65816? Then we can check some timing tables and do the math. :)

Other workloads would be fun to check out too, no doubt there are cases where the 65816 would win... it wouldn't fare too well against the 8086 for division or multiplication, heh. Again though, I've never coded 65816 so these are my guesses. I'm just going by what I've read on it. I could be proven wrong.

(Remember, I'm talking 8086 here.. not 8088! The 8088 just sucks.)

The 8086 may take more cycles to read opcodes and fetch memory data, but it has some more complex instructions that can be taken advantage of in some situations that can take care of things all at once where the 65xxx needs multiple instructions. My guess is on average, things may work out around the same. I didn't go into this trying to say x86 is a clock-per-clock monster, just was arguing that writing in it's assembly code isn't too bad.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118207)
Is rep stosw even used today? Most things I see use MMX extensions or SSE or higher, since they tend to operate on 64 (or is it 128?)-bit values and thus can push data around faster. Hell if I know, once the extensions started coming in is where I jumped ship (and am glad I did so).

Regarding the equivalent on the 65816: you're thinking of mvn / mvp (for "big chunks of data") or pei / pea (stack-based) for copying data around. No, these are not particularly fast, but as with everything in computing, it's conditional -- particularly how much data you're moving, from and where to, and so on. In many cases mvn / mvp are slower than writing the loop yourself, but there are cases where they're faster. Have I ever used these opcodes in real-world stuff? pei / pea yes, mvn / mvp no (manual loops in the code I was writing were faster the easy majority of the time). This was back when I was part of a demo and utility group.

Moving tons of data around is not particularly what any member of the 65xxx family is designed for. The processor was not designed for this kind of task, because during the era it was designed in, copying craploads of data around was not commonplace; in many cases a program designed with that method in mind was considered sloppy or badly-designed by the programmer. Yes every situation is different, yadda yadda, so please (I'm looking at you Tepples) don't get nitpicky over what I say. Any old 65xxx programmer knows what I say is true/fair in this regard. Usually this was rectified by using a DMA chip of sorts (hint: PCs had these too you know).

And for what it's worth, native x86 (meaning without extensions) is not particularly impressive either in this regard. Yes, 386 and above have instructions that can iterate or repeat something on a series of bytes, often useful for things like Pascal-style strings (where you know the length in advance) or C-style strings (if you know the length ahead of time), but they're "generic" operations -- meaning they're not intended for strings specifically, for example.

For things like string manipulation, go look at some of IBM's mainframe processors like the IBM 360/370 -- the bloody thing has an opcode called CUUTF which converts a string somewhere from Unicode (this would mean UTF-16 I believe) to UTF-8, amongst other things.

If you want to split hairs and compare x86 cycle counts to 65816 cycle counts, please do -- I used to do this all the time back in the very early 90s while developing stuff for both my IIGS as well as my PC -- and in a lot of cases the 65816 cycle-count-wise blew the x86 out of the water. But some iterative operations the x86 performed better. After I went pure PC I stopped caring, but x86 opcodes like imul / idiv were a dream come true to a 65xxx person.

I also can't really stand x86 assembly because of it's variable-length opcodes (meaning they're not all 3 characters in length); code just gets ugly and hard to read. Didn't like 68K for the same reason. I know that's a trivial argument, but for me, it was always a pain in the ass to read x86 because of this. Don't tell me "so just indent/add spaces so it looks good" -- maybe for code you're writing, but for disassemblers or any kind of real-time work, nope, nobody cares. Just dump a bunch of text at you, enjoy parsing it with your eyeballs. And don't even get me started about Intel vs. AT&T syntax (FYI: I hate AT&T syntax, utter backwards nonsense). We had none of this syntactical-difference nonsense on the 65xxx (nor the 68K from what I understand).

But -- none of this really matters now because the 65xxx series at this point is no longer a mainstream processor. So really we're just bitching/crying over shit that doesn't matter at the end of the day. *makes jacking-off motions* :-) I have no experience with ARM so I don't know how it compares, but it's at least a contender against x86, so that might make a more fair present-day comparison.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118208)
Yeah, but we (or at least I) were talking about 65816 and 8086, not modern x86 with a 65816. It's not that big a deal to me either. I'm not going to go cry myself to sleep if we don't count clock ticks. 3gengames just brought up cycle timing and I thought it would be interesting. I'm not nerd-raging or anything here. BTW, when you say "386 and above have instructions that can iterate or repeat something on a series of bytes" ... if you were talking about REP STOSW, etc. that was actually present in the original 8086.

Maybe none of this DOES matter at the end of the day either, but hey this is a NES emudev forum. We talk about nothing but ancient technology so it fits right in. :)

EDIT: Wow, this thread got off-topic... mostly my fault.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118210)
I didn't spend much time with the original 8086 or 8088. The first x86 CPU I began programming for was the 386, where I spent 90% of my time pissing and moaning about the segmented memory model. It wasn't I was introduced to protected mode and DOS extenders that I really started "enjoying" x86 due to the linear memory model and 32-bit registers with int 10h hooks (dos32 extender FTW!).

Proof of this fact is you correcting me -- specifically that the REP(etition) prefix on STOxx ops were available dating back to the original 8086. See, I just assumed they were introduced with the 386, because every programmer I worked with at the time was like "ugh no you don't want to use 286 or 8086 and especially not 8088", so I never bothered.

I can do the cycle counting for your loop (65816 vs. 8086) tomorrow, or maybe one the non-US folks on the forum can do so in the interim.

The code you pasted is kind of half-ass anyway. No offence, I'll explain:

Code:
mov ah, al
shr cx, 1
rep stosw
jnc done
stosb
done:


  • I don't see what the point of the mov is here, other than to initialise AH from whatever AL is. Why not just mov ax,1234h or something literal?
  • stosw stores the 16-bit contents of AX at [ES:DI], but there's no setting of ES nor DI anywhere. It matters if we're comparing cycles, since setting ES and DI should be included in the total number of cycles. If you feel this doesn't matter, no problem -- then really all we need to compare is the cycle count of rep stosw followed by a stosb. Which leads me to...
  • rep stosw repeats the stosw operation CX number of times, decrementing DI by 2 every iteration. But shr cx, 1 shifts-right the CX register by 1, i.e. divides whatever CX is by 2 (65xxx equivalent is lsr). We don't know what CX is initially so the number of times the rep stosw runs is undefined. It matters because as I said earlier, sometimes mvn/mvp can be faster than manual loops.
  • I can't find docs quickly that say anything about what the state of the carry is with stoXX, so I don't know what the jnc does; is it to help handle even vs. odd counts of data moves? That's all I can think of
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118212)
koitsu wrote:
Code:
mov ah, al
shr cx, 1
rep stosw
jnc done
stosb
done:


  • I don't see what the point of the mov is here, other than to initialise AH from whatever AL is. Why not just mov ax,1234h or something literal?
  • stosw stores the 16-bit contents of AX at [ES:DI], but there's no setting of ES nor DI anywhere. It matters if we're comparing cycles, since setting ES and DI should be included in the total number of cycles. If you feel this doesn't matter, no problem -- then really all we need to compare is the cycle count of rep stosw followed by a stosb. Which leads me to...
  • rep stosw repeats the stosw operation CX number of times, decrementing DI by 2 every iteration. But shr cx, 1 shifts-right the CX register by 1, i.e. divides whatever CX is by 2 (65xxx equivalent is lsr). We don't know what CX is initially so the number of times the rep stosw runs is undefined. It matters because as I said earlier, sometimes mvn/mvp can be faster than manual loops.
  • I can't find docs quickly that say anything about what the state of the carry is with stoXX, so I don't know what the jnc does; is it to help handle even vs. odd counts of data moves? That's all I can think of


It's half-ass because it's not a whole program. Of course CX, ES:DI need to be set. I was only trying to show joe the meat of it, but that was before I suggested cycle counting so I didn't write an entire program. BTW, I really don't care if we count cycles. It was just a thought. I wasn't clear though, I know. I was imagining an arbitrary RLE compressed data stream to just look like this:

Code:
XY XY XY XY ............ <end of file>


Where in each XY pair, X = byte value of run, Y = length of run. Like I said in that post, the byte value goes (X) into AL and then the run length (Y) goes into CX. I would have copied AL into AH just because it wouldn't have made sense to duplicate a byte in the stream itself for no reason if you're attempting to compress something. Then shift CX right by one since we're writing the output run two bytes at a time. If the run we decompressed was even in length, carry is now clear after the shift and we're done... otherwise if it was set, the run length was odd so we need to append that remaining byte to the end with the single STOSB.

I did explain in the original post what AL and CX are going into it, but not what I was imagining as an input stream.

Also, the STOxx ops don't affect flags. STOSW is equivalent in function to: MOV ES:[DI], AX ... but then after that yes if the direction flag is set it decrements SI by 2... or increments by 2 if the flag is clear. (which of course it should be for decompressing this. it can be forced with a CLD at the start of the program.)



JNC is Jump when Not Carry.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118213)
Found a better opcode list (thank you people who actually document how opcodes affect CPU flags, sheesh). So:

* The mov ah,al now make sense because you're essentially just wanting to fill memory with a value (16 bits at a time). Got it.

* The jnc done is indeed to deal with even/odd number of bytes, and in the case of an odd number, to store the last byte (since stosw copies 16 bits at a time). The carry being set or clear comes from the preceding shr cx, 1. So okay, got it. Easy enough to do in 65816, but again the length (number in CX) actually matters because of the whole mvn/mvp vs. manual loop situation.

For shits and giggles lets just say we're trying to fill 513 bytes of data (an odd number) with the value $ffff, but only a byte is specified ($ff). Cycle counts per opcode are in brackets ([]).

Code:
  sep #$30             ; [3] Set A/X/Y to 8-bit size
  lda #$ff             ; [2] $ff = upper and lower half of the 16-bit value we want to fill
  tay                  ; [2] Copy A ($ff) into Y
  xba                  ; [3] Swap upper and lower bytes of A (yes you can do this when A size is 8-bit!)
  tya                  ; [2] Copy Y ($ff) into lower byte of A; 16-bit A now contains $ffff
  rep #$30             ; [3] Set A/X/Y to 16-bit size
  tay                  ; [2] Y=$ffff
  lda #513             ; [3] 513 bytes to transfer (odd number)
  lsr                  ; [2] Divide by 2, A will now contain 256 ($100) with carry set to 1
  bcs OddNum           ; [!] If carry set, then lsr had a leftover (i.e. odd number)
  tax                  ; [2] X = number of words (16-bits) to write
  tya                  ; [2] A=$ffff
- sta >ESDILocation,x  ; [6] Write $ffff to ESDILocation+X offset (full 24-bit address) -- LOOP
  dex                  ; [2] ...repeat -- LOOP
  bne -                ; [!] ....until X=0 -- LOOP
  sep #$30             ; [3] Set A/X/Y to 8-bit
  bra Done             ; [3] GTFO
OddNum:
  tax                  ; [2] X = number of words (16-bits) to write
  tya                  ; [2] A=$ffff
- sta >ESDILocation,x  ; [6] Write $ffff to ESDILocation+X offset (full 24-bit address) -- LOOP
  dex                  ; [2] ...repeat -- LOOP
  bne -                ; [!] ....until X=0 -- LOOP
  sep #$30             ; [3] Set A/X/Y to 8-bit -- LOOP
  sta >ESDILocation    ; [6] Write that odd byte ($ff) to ESDILocation (same as ESDILocation,x in this case) -- LOOP
Done:


The cycle count labelled [!] is 2 cycles if the branch IS NOT taken, otherwise is 3 cycles. In most cases in the above code, it will be taken.

Is this optimised? No. More on that in a moment.

Is it ugly and long? Yes, because most of the work stems from the "setup" -- first the need to take an 8-bit accumulator value and copy the value into the high byte of the accumulator, and then dealing with the number of bytes and so on. If I had made this in a "general subroutine" where you'd simply push the byte you wanted filled onto the stack, followed by another push of the length (in bytes), the routine would be a lot shorter and maybe use less cycles. Not sure.

The reason the code looks doubled (more or less) is because I did not care to elegantly handle the situation where the number of bytes being transferred was odd vs. even (i.e. storing something in a temporary variable to indicate such, etc.). My focus was not on code length but a bit more on cycles. It's totally possible to make this routine shorter and handle the odd vs. even thing in a more sane manner.

ANYWAY... ;-)

The most important part here is the cycle counts within the loops (I've labelled them with -- LOOP, including the odd-byte count where the sep #$30 is actually needed (to write an 8-bit value vs. 16-bit) and the actual write itself).

So let's do the math of the loops, given the number of iterations within the loops that we know:

513 bytes = $100 (256) loop iterations + 9 extra cycles (sep + sta at the end)

Loop itself: 6 cycles for the sta + 2 cycles for the dex + 3 cycles for the bne (except the last iteration which will be 2 cycles)

So: 255 * (6 + 2 + 3) = 2805 cycles
2805 cycles + 2 cycles for the non-branch-taken bne = 2807 cycles
2807 cycles + the 9 extra cycles (sep + sta at the end) = 2816 cycles

To repeat: is it possible to optimise this routine? Absolutely. Getting rid of the sta >ESDILocation,x (24-bit addressing, STA Absolute Long Indexed X) and turning it into sta ESDILocation,x (16-bit addressing, STA Absolute Indexed X) is the best choice, since you save 1 cycle per every sta in that case. The catch is that you have to know in advance what the bank of the 24-bit address ESDILocation is, and that's doable in lots of ways (again: how a programmer chooses to design things).

The reason I chose to use a 24-bit addressing STA is because [ES:DI] -- it's been a while, so if I get this wrong, I apologise -- allows you to write to any segment (ES) starting at offset DI, up to 65536 bytes. I wanted to make this "easy", so I just went with a linear 24-bit addressing store. More realistically though, as I said, setting B (bank) to the destination followed by a 16-bit store would save a cycle per every loop iteration, and is "more akin" to the segmented memory model (hard to explain what I mean by this, sorry, it's late and I'm tired (see below)) in this way, along with the "up to 65536 byte" limitation too.

Another optimisation possibility is to do self-modifying code. I'm not sure how much time this would save so this is speculative. I've written such code, but I really don't enjoy it and try to stay away from it on the 65816. On 6502, with only 64K of addressing space (and often even less RAM and ROM) it's more common.

There is also the possibility of using a stack-based copying method, which may save other cycles. I will admit it has been a long time since I've used pea/pei to do this, so my familiarity there is a bit rusty, but I'm sure I could figure it out. Note: I used this methodology in the aforementioned link (re: demo/utility group) to do some text scrolling. There's a point where the stack-based method is faster than the manual loop.

So I imagine it is very possible to get the cycle count down to the low-2000s for the same number of bytes (words) transferred.

Now let's bring mvn / mvp into the picture (rather than a manual loop), but speaking on a general level because I took diphenhydramine and melatonin earlier and I'm really getting dozy (plus angry that I've spent time doing this rather than reading my awesome scifi book in comfy bed, but hey that's my own fault not anyone elses):

The mvn / mvp opcodes take a whopping 7 cycles PER BYTE moved. Not per word (16-bit), but per byte. The CPU does the move itself (entire CPU blocks/waits until the opcode is done, obviously). Whether it operates internally on a per-byte basis or per-word, I do not know, but I bet you it's per-byte given the nature of how it works. Yeah, it's expensive.

I hope someone else here (probably byuu) comes along and says "holy shit dude, that routine is crap, you are REALLY out of practise" and optimises the hell out of it. That's how the guys at MIT did things on the old PDPs -- they would literally revamp or optimise each other's code to shave off cycles or bytes here and there. Hell, Bill Gates did this too.

Now you see why a little inexpensive DMA chip alongside the 65816 for memory transfers really, really helps.

P.S. -- Despite my sort of annoyed/irritated tone (just how I am, nothing personal), this is probably the most thorough 65816 I've done since working on the SNES/SFC many years ago and my days in the IIGS scene. BIG nostalgia for me here. Really brought back memories -- especially since my IIGS is sitting in a storage bin behind me. I also had to bust out ActiveGS (didn't have KEGS laying around) to run the IIGS mini-assembler (call -151 ; !) to write some code to test some of my theories about certain opcodes behaving certain ways in 8-bit vs. 16-bit mode. So thank you, miker00lz, for the trip back in time. I do appreciate it.

P.P.S. -- I edited the initial code at the start to remove use of a temporary DP variable because I realised I could safely use other registers for it.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118220)
miker00lz wrote:
ArsonIzer wrote:
miker00lz wrote:
Writing your first CPU emulator is one hell of a learning experience for sure. My first was the 8086. That was.... taxing.


Haha I have to give it to you, you've got some guts to do that as your first emulator. I think I would've thrown away my laptop and lived in the woods after becoming permanently paranoid if I had to do that with the knowledge I had prior to making the 6502. Even now, I think the 8086 would be a major, several-month challenge to me that would have a pretty big chance of failure to be honest. I do, however, plan to emulate one at some point. The NES is just a start in my journey (hopefully) :D


Well, if you can do the 6502 you can definitely do the 8086. The most confusing thing on the 8086 is understanding the addressing mode byte (aka mod/reg/rm byte)... once you get that it's really not much harder. There are more opcodes and more addressing modes, so it will take more time. There are also "group" opcodes where the first instruction byte indicates which group, then it has a modregrm byte where one of it's fields indicates the exact operation. It can get pretty weird. Oh and there are segments to worry about too, but that's not so bad. Yeah, it DID take a few months before I was able to boot DOS.

If you get around to doing it, let me know if you want a little help. Once you understand the couple confusing aspects, it does become about as easy as the 6502. :)


The thing that scares me about the 8086 is the huge and more complex x86 architecture. The few dozen 6502 opcodes compared to a washing list of instructions seems like hell to me, and all the terminology you're using while arguing with the other guys is like a foreign language. I had literally no experience with lower level stuff like I mentioned before, so to get to the point of knowledge you guys have would take me many years, and that's how long I'm guessing it would take me to get an 8086 running, let alone have it run DOS (which is actually a long-term goal I have for emulation). Another thing, doesn't the 8086 have an FPU? I can only assume that for the mathematical dimwits like myself, it's a hell to implement. Of course, I'm assuming this, maybe the FPU isn't hard at all to implement.

Another thing would be the graphics. If I actually wanted to RUN DOS like DosBox does (running actual games as well), I'd have to implement some kind of complex graphics card (relative to the NES or SNES). Did you manage to do that too?

PS: I will definitely take you up on that. The moment I start on an 8086 is the moment you might receive 200 messages a day of me flipping out because I can't get another opcode right. Watch your spambox my friend.

PPS: Just kidding, but seriously, if I ever get started on that I might ask you some questions if that's ok :P
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118224)
ArsonIzer wrote:
Another thing, doesn't the 8086 have an FPU? I can only assume that for the mathematical dimwits like myself, it's a hell to implement. Of course, I'm assuming this, maybe the FPU isn't hard at all to implement.

http://en.wikipedia.org/wiki/Intel_8086#Floating_point

TL;DR -- 8086 has support for an FPU add-on chip (mainboards would have sockets for it), usually the Intel 8087 (but some others made their own), which you'd buy, plug in, and get a bunch more opcodes and FPU-specific registers. I imagine emulating this really isn't that bad, considering that any present-day language you'd use (C, etc.) could "emulate" this using standard mathematical operations within the programming language itself, along with some limitations you'd have to write in/code in yourself. I wouldn't imagine it'd be all that bad.

Just reviewing that page, particularly all the qword ptr [edi+ebx] crap, is just further stuff that makes me hate x86. It's so easy to get lost in the syntax and "addon words" to the assembly language portion. I swear, when/when not to use things like dword ptr and brackets for certain addressing modes/methods, was such a pain in the ass. I remember doing 320x200 graphics (segment A000 (or A0000 in protected mode)) and was saying things like "Why the hell do I have to use brackets here? I don't get it, I'm not wanting indirect addressing... or am I? GRAAAHHHH!!!" and "Why can't I just hard-code the value of the address I want to use in the instruction? Why must I use a register?"

Sorry, 65xxx just makes all this stuff really bloody obvious when you look at it. Not to a newbie, no, but there's a lot LESS to grasp overall. I find the syntax to be easier to understand. Friends of mine who have tried repeatedly to learn 65xxx (I think 6502 in particular) but fail, for example, have no problem learning LISP. This still baffles my mind to no end, from a syntactical standpoint anyway. It's one of those things where if I could sit down with them for a week and step them through basic assembly programming I think they'd understand it, but languages that add tons of crap on top -- or abstraction of any sort -- make it harder for the person to actually know what's going on under the hood. Off-topic big time, but this is exactly why I loathe things like Java. Too many layers of crap that can go wrong between you and the CPU. :-)
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118225)
koitsu - I'm dead tired and typing this from my laptop in bed, but good work. That was fast. I'll check it out in more detail tomorrow.

ArsonIzer - Nope, the 8086 on it's own has no FPU at all fortunately. What you're thinking of is the 8087 co-processor (edit: koitsu beat me to it) which is required by virtually no software out there. I don't really understand the FPU stuff in detail either. As for graphics, emulating CGA and 320x200 256-color MCGA is dead simple to emulate compared to the NES. Not even close, really. They're simple bitmaps (CGA is interlaced, but it's still simple) mapped in the CPU's address space. Couldn't be any easier. If you really want to be super-accurate there is scanline timing involved, but for 99.999% of programs it doesn't matter.

Yes, I implemented the graphics in my emu. (everything except EGA and 640x480 4-color VGA, but I'll get to that eventually)

There are some screenshots from it here: http://sourceforge.net/projects/fake86/
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118233)
Now attempting a segue back to topic.

We've been discussing aspects of an emulated CPU that can help determine whether an emulator is fast or slow:
  • how complex the emulated CPU is
  • instructions per clock (higher can sometimes be emulated faster)
  • clock speed (higher usually means emulated slower)
Instructions in the 8080 family (8080, Z80, Game Boy, 8086) tend to take far more cycles than instructions on a 6502. This makes a 6502 as fast as an 8080 family CPU at a much higher clock rate. This is how the NES and Atari 800 got away with a 1.8 MHz 6502 when the ColecoVision, MSX, Master System, and Game Gear were using a 3.6 MHz Z80: the higher IPC cancels out the lower clock rate. The same is true of 65816 in the Super NES and the 68000 in the Genesis, much to the chagrin of Sega "blast processing" fanboys. But an emulator is mostly concerned with the externally visible behavior: 1. how many instructions it can run, and 2. whether the reads, writes, and interrupts happen on the correct cycle relative to the other devices on the bus.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118908)
miker00lz wrote:
ArsonIzer - Nope, the 8086 on it's own has no FPU at all fortunately. What you're thinking of is the 8087 co-processor (edit: koitsu beat me to it) which is required by virtually no software out there. I don't really understand the FPU stuff in detail either. As for graphics, emulating CGA and 320x200 256-color MCGA is dead simple to emulate compared to the NES. Not even close, really. They're simple bitmaps (CGA is interlaced, but it's still simple) mapped in the CPU's address space. Couldn't be any easier. If you really want to be super-accurate there is scanline timing involved, but for 99.999% of programs it doesn't matter.

Yes, I implemented the graphics in my emu. (everything except EGA and 640x480 4-color VGA, but I'll get to that eventually)

There are some screenshots from it here: http://sourceforge.net/projects/fake86/


Damn, I'm impressed, but I have a somewhat newb-ish question. What determines the memory map of an 8086 combined with, for instance, a CGA? Let's say I want to emulate them like I'm emulating the NES: I know that the NES has the NMI, RESET and IRQ vectors at $FFFA-$FFFF (in that order), the PPU registers in $2000-$20007, controller registers in $4016 and $4017, and so on. Obviously, the 8086 doesn't work exclusively with CGA or MCGA or whatever, so what determines where the locations of, for instance, the CGA's registers are (if it has any). I've tried finding documents about it but there's nothing but a mediocre description on Wikipedia of the CGA. Let's say you run MS-DOS, and DOS writes to a certain memory address expecting a CGA register, but instead an MCGA card is present, what happens? While I understand the basic premise of hardware and emulation, I don't yet understand how various pieces of hardware with different specifications can just work together like that.

PS: Sorry for asking this so late, but I just today remembered your reply.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118909)
ArsonIzer wrote:
Obviously, the 8086 doesn't work exclusively with CGA or MCGA or whatever, so what determines where the locations of, for instance, the CGA's registers are (if it has any). I've tried finding documents about it but there's nothing but a mediocre description on Wikipedia of the CGA.
Fortunately, IBM actually released a set of incredibly complete documents for the IBM PC—the "IBM Personal Computer XT Technical Reference manual"—and thus also the CGA. These documents explain exactly what and how everything is. It's basically the equivalent of Disch's documentation for nintendo mappers, but written in schematics and datasheets instead of prose and tables. (Also consider looking for "Ralf Brown's Interrupt List".)

(For reference, the CGA has the MC6845 at I/O addresses 0x3D4 and 0x3D5, mirrored across 0x3D0-0x3D7. Later compatible things didn't mirror that full range. It also has a bunch of other control registers from 0x3D8-0x3DC.)


Quote:
Let's say you run MS-DOS, and DOS writes to a certain memory address expecting a CGA register, but instead an MCGA card is present, what happens?
The same badness that happens if you're (trying to) using the video driver for an ATI video card but you actually have an Intel card instead. Some certain features are completely compatible (e.g. "VGA compatible"), many aren't. In the bad old days, you'd sometimes have quite a dance of moving things around to get everything to play together.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118911)
lidnariq wrote:
Fortunately, IBM actually released a set of incredibly complete documents for the IBM PC—the "IBM Personal Computer XT Technical Reference manual"—and thus also the CGA. These documents explain exactly what and how everything is. It's basically the equivalent of Disch's documentation for nintendo mappers, but written in schematics and datasheets instead of prose and tables. (Also consider looking for "Ralf Brown's Interrupt List".)

(For reference, the CGA has the MC6845 at I/O addresses 0x3D4 and 0x3D5, mirrored across 0x3D0-0x3D7. Later compatible things didn't mirror that full range. It also has a bunch of other control registers from 0x3D8-0x3DC.)


Yeah, too bad I can't really read those schematics and datasheets, so I'd have to learn that first. I appreciate the info though, I came across the document but I thought it would be more about the low hardware-level aspects of components, like pin arrangement and voltage levels and stuff, so I skipped it. I'll look more into it though.

lidnariq wrote:
The same badness that happens if you're (trying to) using the video driver for an ATI video card but you actually have an Intel card instead. Some certain features are completely compatible (e.g. "VGA compatible"), many aren't. In the bad old days, you'd sometimes have quite a dance of moving things around to get everything to play together.


While I understand the driver thing, I don't understand how you can run a disk image like MS-DOS and have it run normally whether a CGA, WCGA, or whatever graphics card is installed. What I mean is that MS-DOS can for instance write to address $F00 to set certain pixels on the screen or whatever, because it expects some kind of graphics register there. How does the processor/graphics card make sure that the write is received as a write to the expected register, even though there are multiple possibly connected cards with different registers/behavior?
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118912)
ArsonIzer wrote:
While I understand the driver thing, I don't understand how you can run a disk image like MS-DOS and have it run normally whether a CGA, WCGA, or whatever graphics card is installed. What I mean is that MS-DOS can for instance write to address $F00 to set certain pixels on the screen or whatever, because it expects some kind of graphics register there. How does the processor/graphics card make sure that the write is received as a write to the expected register, even though there are multiple possibly connected cards with different registers/behavior?

AFAIK, programs relied on tests to detect what hardware was present. MS-DOS most likely only made use of the most basic features that every video card was supposed to offer (it's only text after all!) but any program that wished to use more advanced features would probably have to look for a "signature" indicating that said features are indeed present.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118913)
Some PC graphics programming books from the early SVGA era even had a list of cards, how to probe for them, and in what order to probe for them, because the probe sequence for one brand card would cause another brand of card to lock up. At that time, input protection was so poor that the wrong poke could damage a Hercules monochrome graphics card or the monitor connected to it (I forget which).
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118916)
So let's say a game like Ultima VI (as shown in one of the fake86 screenshots by miker00lz) is run on a Hercules/CGA card, what would happen? Would the game crash/give errors and why? Because it can't detect its minimally required graphics card, or because certain VGA-specific instructions are failing on the machine/emulator?
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118922)
ArsonIzer wrote:
While I understand the driver thing, I don't understand how you can run a disk image like MS-DOS and have it run normally whether a CGA, WCGA, or whatever graphics card is installed.
That's part of the purpose of the BIOS and DOS itself. They provide a set of globally provided functions ("interrupts" at the time, "syscalls" in modern parlance) that abstract away much of the lower-level functionality. It's just that the scale and scope of these functions were very limited, especially compared to the entirety of Win32s.

Arsonizer wrote:
How does the processor/graphics card make sure that the write is received as a write to the expected register, even though there are multiple possibly connected cards with different registers/behavior?
For the x86, there are two separate memory spaces: RAM and I/O. They're both readable and writeable, but the latter can't be executed from. Also, because the I/O space is distinct, there's very few configuration registers in RAM (they're almost all in the I/O space instead).
So drawing pixels onscreen was almost always some variant on "call [System Call Interface] 0x10 with configuration parameters, allow video card BIOS to set up screen, and then write data to memory starting at 0xa0000 (EGA and newer), 0xb0000 (MDA), or 0xb8000 (CGA). (How do you know which? Documentation...)

tepples wrote:
Some PC graphics programming books from the early SVGA era even had a list of cards, how to probe for them, and in what order to probe for them, because the probe sequence for one brand card would cause another brand of card to lock up.
As I recall it, most software just completely gave up and simply asked the end-user what graphics card they had, presumably because the heuristics were so lousy and mistakes would usually crash.

tepples wrote:
At that time, input protection was so poor that the wrong poke could damage a Hercules monochrome graphics card or the monitor connected to it (I forget which).
The monitor. High voltage power electronics are comparatively easy to toast; 5V logic is comparatively amazingly resilient.

ArsonIzer wrote:
So let's say a game like Ultima VI (as shown in one of the fake86 screenshots by miker00lz) is run on a Hercules/CGA card, what would happen? Would the game crash/give errors and why? Because it can't detect its minimally required graphics card, or because certain VGA-specific instructions are failing on the machine/emulator?

Very little visible would happen, but it's unlikely the game would crash. Depending on exactly how they wrote the game, it might even quit cleanly with a "where's my VGA?". Otherwise, because the VGA was specifically designed to put the majority of its configuration registers in a completely different location than the CGA, it's mostly likely the game would start, clear the screen, and then you'd be left with a blank text mode screen with a blinking cursor. You might hear the game music.
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118933)
Ultima 6 DID have a hercules mode...
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118934)
But this game probably doesn't:
Image
Re: What determines whether an emulator is "fast" or "slow"?
by on (#118939)
Thanks for all the info guys, things are starting to clear up. Just gotta learn to understand those datasheets and whatnot and I'll be fine :D

tepples wrote:
But this game probably doesn't:
Image


Lol, talk about off topic. That game sure brings back some memories though. I used to play it on the PS1 as a kid, and I never managed to complete it at that time because it had no way of saving progress. You could collect some of those pots/urns or whatever and get a password, but I sucked as a kid and never got them. Ironically, this is one of those PS1 games that makes me want to get better at emulation so I can one day write a PS1 emu :D.