Got any tips for Early NES Emulator Development?

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic

Got any tips for Early NES Emulator Development?
by MottZilla on 2008-03-11 (#31602)

Along time ago (something like 8 years ago) I thought it would be really cool to write an emulator. Ofcourse I didn't know anything back then, but now I actually have a decent understanding of both the NES and programming.

So I had made two previous attempts at writing a NES emulator which never went anywhere, which was before I'd ever written any ASM for the NES. But now I'm making another attempt. I understand the concept I believe and I've got a CPU interpreter partially written. By partially I mean not every opcode is there yet, and the ones that are there I can't be 100% sure they are all correct.

But everything I've done so far was enough that I see some life. In my little "Pong" demo I wrote ages ago for the NES, it shows the name tables being written to with some hacky graphics emulation. I also can see the menu of that NESTEST rom. But I have some confusion as to what could be causing one issue.

I started the CPU core pretty simple, made it reset to the vector and "fetch" an Opcode. It then would either do it if I supported it or tell me what it was if I didn't. So I used Donkey Kong (JU) as my test ROM. I would repeatedly load the game and let it run until it hit an opcode I hadn't supported yet, and then go to add suppotrt for that and kept doing this. I would "try" to verify the opcodes were working the way they should be comparing with other emulators.

I eventually got the thing to just loop forever as it wouldn't hit any unsupported opcodes. I figured out (I think so atleast) that the game was just waiting around now for NMI, or maybe for a nes register to return a value. After adding NMI I got to a bunch more opcodes. I verified the addresses it was executing are executed addresses and not errors by tracing in a completed emulator (FCEU). Eventually, I was back to the same thing, the game would run and never hit any unsuppotred opcodes.

I went to implement the "graphics" emulation if you want to call it that, but it's really just a function to draw what values in the NameTable 0 are on screen. I could see Donkey Kong and Donkey Kong Jr. both cleared the name table to all #$24, the "Blank" tile for the games. I also could see the nestest and my pong demo. This is the point I'm at now.

My problem is that Donkey Kong and DK Jr never write anything else to the screen. And the program counter is looping in the same range. I'm going to investigate what its looping on to try to figure out what to do next. Update: It wasn't looping like I had thought at one point. For all I know the game is running, though I'm not sure why I cannot see anything plotted on the nametable other than the blank tiles.

Now the point of this topic, what advice do you have for writing a NES emulator? Keep in mind I specifically mean advice for someone that is new to writing a NES emulator. Such as what games or NES programs do you suggest would be good targets to get working first? I have read on here before that many make the mistake of thinking getting Super Mario Brothers running is the easiest. I assume a game like Donkey Kong would be easier, no scrolling, no sprite 0 hit, no rom banking, etc.

So anyone got any tips?

by Disch on 2008-03-11 (#31603)

1) write a tracer (if you haven't already). Bugs and timing issues are next to impossible to solve without one.

2) get and pass as many CPU test ROMs as you can. nestest.nes is paticularly nice because it doesn't require a PPU to be emulated in order to run it.

3) This is in the same vein as #2 -- but focus on the CPU first. CPU bugs will usually be the biggest reason for emu muckups. PPU bugs will usually cause graphical glitches, APU bugs will cause sound glitches, but CPU bugs can and will cause every possible kind of glitch making them very hard to diagnose.

4) Pointers are your friend. NT mirroring, CHR/PRG swapping, and tons of other things can be handled painlessly and easily with proper use of pointers. Bulk memory copying is slow and cumbersome.

that's about all I can think of.

by MottZilla on 2008-03-12 (#31605)

Well I'll be doing what you suggest, and adding a tracer of some kind so I can better understand what is going on. I agree that it's virtually impossible to understand what is going on without something more than a few values displayed on screen.

by MottZilla on 2008-03-12 (#31610)

And you were right Disch. After writing a disasembling/tracer output and reading the log, I found out two bugs that were preventing the title screen from appearing. For one, my NMI was not executing correctly. It was going to NMI Vector + 1, but the odd thing is I have no idea why. I ended up fixing it by deciding that if an NMI is triggered to do everything and then return from the function and not to proceed through the CPU core.

The other bug, I forgot to update the flags on opcode B1, which meant it never took a branch and never loaded the title. Now I can read with my text output of the nametable "DONKEY KONG", with marked tiles where all the text and such would be. So ironing out my CPU certainly will be top priority.

by kyuusaku on 2008-03-12 (#31611)

Keep everything modular, flexible, and try to emulate at the lowest level you can manage; it will pay off later on in accuracy and probably speed after optimization.

by Disch on 2008-03-12 (#31612)

Some more advice:

Don't copy/paste or have multiple copies of the same code. You shouldn't be having a problem with a single opcode not setting flags, because that opcode probably should be sharing code with other similar opcodes.

That is to say... if you have LDA immediate, and ADC absolute, LDA absolute should not require any additional code other than an additional 'case' statement to tie together your 'LDA' code and your 'absolute' code. This way if you have a problem with LDA it will become visible much sooner and be much more obvious that you have a bug (meaning you can correct it sooner -- hidden bugs that you have to coax out are less preferable than bugs that stand up and shout at you)

Plus if you find and fix a bug or make a technical correction -- if all opcodes share the same code you only have to make the change once, rather than updating a dozen opcodes. Like if you found out you were wrapping Indirect,Y incorrectly -- it's much easier to just change your Indirect,Y code than it would be to update and fix every single opcode that uses Indirect,Y.

This kind of thing can be accomplished pretty well by function inlining. Here's a snippit of my ADC absolute code to give the idea:

Code:
void NES_INLINE AdRdAb(NESCPU& cpu)          // Absolute
{
cpu.adr = Rd(cpu.PC); ++cpu.PC;          // cycle 1
cpu.adr |= (Rd(cpu.PC) << 8); ++cpu.PC;    // cycle 2
cpu.val = Rd(cpu.adr);                // cycle 3
}

...

void NES_INLINE _ADC(NESCPU& cpu)
{
register u16 tmp = cpu.A + cpu.val + (cpu.fC != 0);
cpu.fC = (tmp >= 0x0100);
cpu.fV = (tmp ^ cpu.A) & (tmp ^ cpu.val) & 0x80;
cpu.fN = cpu.fZ = cpu.A = (u8)tmp;
}

...

   // the ops
   op = Read(mCPU->PC);
   ++mCPU->PC;

   switch(op)
   {
...
      /* ADC */
   case 0x69: AdRdIm(*mCPU); _ADC(*mCPU);    break;
   case 0x65: AdRdZp(*mCPU); _ADC(*mCPU);    break;
   case 0x75: AdRdZx(*mCPU); _ADC(*mCPU);    break;
   case 0x6D: AdRdAb(*mCPU); _ADC(*mCPU);    break;
   case 0x7D: AdRdAx(*mCPU); _ADC(*mCPU);    break;
   case 0x79: AdRdAy(*mCPU); _ADC(*mCPU);    break;
   case 0x61: AdRdIx(*mCPU); _ADC(*mCPU);    break;
   case 0x71: AdRdIy(*mCPU); _ADC(*mCPU);    break;
...

A trick for cycle tallying:

With the exception of stack opcodes like PHA, RTI, etc, and for one of two oddball instructions (JMP) the addressing mode directly dictates the number of cycle any given instruction uses. That is, Absolute always uses 4 cycles... Zero Page always uses 3... Indirect,Y always uses 5+1 cycles, etc.

Note this is true for read-only instructions like LDA, ADC, CMP, and for write instructions like STA, STX, STY. Read/modify/write instructions like INC, ASL use different numbers, but are still just as predictable: Absolute always uses 6 cycles, Zero Page always uses 5, etc.

I have three sets of addressing mode fuctions. One for read-only ops, one for write ops, and one for read/modify/write ops. You can then use these functions to tally your CPU cycles rather than building and using a lookup table.

Of course this method isn't really better or worse than using a lookup table. I just found that building the lookup table is time consuming and dull.. and it's very easy to make an ever-so-subtle mistake that will be really hard to find.

(In case you were wondering I don't tally cycles this way in my above code -- I actually just tally a cycle in my Rd() and Wr() functions -- and perform all the dummy reads and writes performed by instructions. This works because the CPU either reads or writes a byte for every cycle an instruction takes)

by MottZilla on 2008-03-12 (#31614)

I've gotten further with the CPU test ROM though I am getting some error codes. I'm planning on redoing alot of the CPU code as you suggested because as I was writing it, I found that I could probably make things alot neater and reuse the same code rather than copying and pasting it a million times. :p

As far as accuracy goes, I don't need or intend for it to be perfect or close to that. I think that'd be a prett big goal for a first try, and also when I'm not really looking to try to somehow top the great emulators others have created.

What I've been doing is not like you've suggested where CPU instructions are broken down into individual cycles. I've got it setup so it executes each instruction and increases the clock counter. And right now my biggest concern is a working CPU core anyway. So it doesn't have to be perfect, just has to work. And I'm thinking doing that will require or atleast go better if I change things so that I take advantage of Opcodes & Addressing modes rather than to do each opcode number individually.

I'll probably have some questions about the CPU too. The Test rom has certainly raised some strange issues. Like saying something about overflow and carry and not being affected by INX and DEX. But in the 6502.txt document, it says those flags are unaffected anyway.

by Disch on 2008-03-13 (#31615)

MottZilla wrote:
As far as accuracy goes, I don't need or intend for it to be perfect or close to that. I think that'd be a prett big goal for a first try, and also when I'm not really looking to try to somehow top the great emulators others have created.

I wouldn't recommend you try for a high level of accuracy on your first go, either. I've rewritten my emu a few times now and each time I've made changes based on what I learned from my past attempts.

There's no way you'll be able to plan for and work around every issue that comes up your first time out of the gate -- so I wouldn't worry too much about it. But that being said... I would say try to get things as accurate as you feel comfortable with. Details which might seem relatively insignificant can sometimes cause some games to go horribly wrong. But at the same time, it's not worth killing yourself over every little detail until you have a better grasp on things. You'll have to find some middle ground that you're happy with.

In short: keep doing what you're doing =P. You may get the urge to go back and rewrite later... but when you do you'll be amazed at how much easier it is.

Quote:
I've got it setup so it executes each instruction and increases the clock counter.

That's totally fine.

Quote:
I'll probably have some questions about the CPU too. The Test rom has certainly raised some strange issues. Like saying something about overflow and carry and not being affected by INX and DEX. But in the 6502.txt document, it says those flags are unaffected anyway.

If the test ROM is yelling at you for that, it may be because you're changing C or V on INX/DEX when you shouldn't be (INX/DEX only change N and Z... other flags should not be changed from their previous state)

Also -- for a quick reference, I would recommend obelisk over 6502.txt:

http://www.obelisk.demon.co.uk/6502/reference.html

6502.txt does a better job at giving details of what each instruction does... but when it comes to opcode listings and other reference stuff it has a few typos which can be a real pain.

Still use 6502.txt if it's easier for you, but cross-check the info with that obelisk page to make sure you're not assigning an instruction to the wrong opcode or something like that.

by MottZilla on 2008-03-13 (#31625)

Well the thing about INX and DEX, as far as I could tell I only was chagning N and Z. I didn't touch C or V. But I'm rewriting all the instructions now to use address mode functions which are then paired with instruction functions ex: LDA(AM_Immediate()). I don't think it's as fast as it could be in execution but I'm more concerned with readability of the code and maintaining it.

From what you say it sounds like I'm on the right track. And I'll start using the obelisk page to cross reference as before I relied solely on 6502.txt which has some typos that were probably generated by paper running through a machine to translate it into a text file. Things like capital Ds being 0s.

So far as I've been changing over from the code in the switch() case to functions I haven't broken anything. This approach seems much better than the initial one. You definitely learn as you go what works and what doesn't or what is better.

Update: As I've gone converting my core into functions, I've fixed a number of instructions that were not working properly. As a result, Donkey Kong and Donkey Kong Jr. (my test games) now proceed past the title screen and I actually see the name table is loaded with their first levels. I'm still converting and verifying the instructions I've done and after they are all converted I believe I have a few more to add. I'm stil using the cpu test rom and I still get errors on certain things but I'm sure I'll iron that all out eventually.

I do have a question though. The CPU test error codes often have entrys that don't say much more than the instruction that failed. Like I think the last ADC Immediate fails on me for some reason it doesn't tell me what though.

by Disch on 2008-03-13 (#31632)

yeah nestest doesn't really get into specifics. All I can say is double check your ADC code and make sure you're setting flags properly.

V, in paticular, tends to give people the most trouble. For ADC, V is set when:

positive + positive = negative
or
negative + negative = positive

and is cleared on all other cases. Another way to think of how V works is to look at the signed number range... V does for signed numbers what C does for unsigned:

unsigned range = 0 to 255 ($00-$FF)
signed range = -128 to 127 ($80-$7F)

Just as C is set when the addition produces a number higher than 255 ($FF) -- V is set when the addition produces a number higher than 127 ($7F)

by MottZilla on 2008-03-13 (#31633)

Can you spot anything wrong with this code?

Code:
fixed

When I replaced either one of the ADCs or the SBCs with the function call instead of the garbage I had in the case, Donkey Kong no longer shows the correct name table setup for the title and the level. I'm not sure why it broke.

by Dwedit on 2008-03-13 (#31635)

Code:
if( (CPU_A + Value + Carry ) < CPU_A) // Check if Carry will Result

WTF. How is that supposed to work?
Shouldn't this be something like this?
Code:
if (CPU_A + value + carry >= 256)

by MottZilla on 2008-03-13 (#31636)

I believe the idea was that, if you have an 8bit value and you are adding a number to it, if the number you end up with is less than wht you started with, then you wrapped around. However when I went back to my original code which does it with an int and a >0xff, it is working again. I guess I was thinking by adding a bunch of 8bit values together it would wrap and never be greater than 0xff. I dunno. I'm tired I guess. :p Thanks for pointing that out. Anyway, I took my newer code and fixed that fuckup. Everything is fine now.

by hap on 2008-03-14 (#31638)

It's ok, I made that thinking-error too once ;p. It would only work if you stored it into an 8 bit variable before the "if". Still though, it won't work that way, since if A=$FF, value=$FF, carry is set, result would be the same as A, but cause a carry anyway.

by MottZilla on 2008-03-14 (#31652)

I believe I have all the opcodes implemented now. However I'm getting error codes with nestest. Most if not all, refer to the final SBC error code in their respective blocks. Does anyone know what that means? This is my ADC function and SBC function, if there's some error I've overlooked let me know.

Code:
void ADC(unsigned char Value)
{
unsigned char Carry=CPU_P&0x01;
// Check for Carry
if( (CPU_A + Value + Carry ) > 0xFF) // Check if Carry will Result
{
   CPU_SETC=1;
}
else
{
   CPU_SETC=0;
}

// Check for Zero
if( (CPU_A + Value + Carry)==0 )    // Check if Zero will Result, Set Flag accordingly.
{
   SetZero();
}
else
{
   ClearZero();
}

// Check for Overflow
CPU_TEMP=CPU_A + Value + Carry;
CPU_SETV=0;
if(!((CPU_A ^ Value)&0x80) && !((CPU_A ^ CPU_TEMP)&0x80))
   CPU_SETV=1;
if(CPU_SETV)
{
   SetOverflow();
}
else
{
   ClearOverflow();
}

// Do ADC Operation
CPU_A = CPU_A + Value + Carry;

if(CPU_SETC==1)
{
   SetCarry();
}
else
{
   ClearCarry();
}

}

void SBC(unsigned char Value)
{
ADC(Value ^ 0xFF);
}

Value is returned by the appropriate address mode function.

by Disch on 2008-03-14 (#31653)

You're setting V incorrectly

!((CPU_A ^ Value)&0x80) <-- ensures CPU_A and Value have the same sign

!((CPU_A ^ CPU_TEMP)&0x80) <-- ensures A and Temp have the same sign

this translates to:

positive + positive = positive
negative + negative = negative

which is not quite how V works.

you want A and Val to have the same sign, but A and Temp to have opposing signs. So you're close... but you have a ! in there that you shouldn't:

Code:
if(!((CPU_A ^ Value)&0x80) && !((CPU_A ^ CPU_TEMP)&0x80))
^
|
|
remove that

if(!((CPU_A ^ Value)&0x80) && ((CPU_A ^ CPU_TEMP)&0x80))

That will ensure A,Val have the same sign, but A,Temp have opposing signs:

Positive + Positive = Negative
Negative + Negative = Positive

The rest looks good to me. Your "^ 0xFF" SBC trick will work exactly right.

by MottZilla on 2008-03-14 (#31658)

Well I fixed the overflow calculation. Thanks for that. Now I get error code 71 on immediate which is the first error code for SBC. Before it was 75 which was the last for immediate. So it suggests something is still wrong somewhere... I'll be going over my opcodes since now I believe every opcode is implemented in the new form. Maybe one of the opcodes is setting carry incorrectly which throws off the SBC.

I did try running some different games and they all do something. Some of them though end up looping with NMI disabled. Not sure why that is yet.

by Disch on 2008-03-14 (#31659)

To cover the obvious:

make sure that opcode $E9 is doing SBC immediate and not some other instruction/addressing mode?

by MottZilla on 2008-03-14 (#31660)

I forgot to set N after ADC (and SBC). I get OKs on most tests now.

Indirect X, Indirect Y, and Zeropage X give errors but everything else says OK. =) Also, Donkey Kong no longer gets stuck in that endless loop with NMI disabled. Instead I can see donkey kong animating and such, till attract mode ends and goes back to the title. =)

Update: I fixed Zeropage indexed, I didn't realize quite how that worked and that it had to wrap. Now it's down to fixing the indirect x and y.

Update: Seems I had the same issue with Indirect X where I forgot I needed to wrap the Zeropage. It's down to Indirect Y address mode problems. I imagine I should look into zp wrapping problems. :p

Update: Indirect Y needed zp wrapping. Now I pass all the cpu tests. Time to refine a few things I suppose and then look at adding real graphics support.

Update: I figured out how I wanted to handle graphics (as far as CPU access and rendering access). I broke it into 1k chunks so I can be flexible enough. I've also put together a routine to decode and draw NES tiles. I can finally SEE the games running via drawing the Nametables. I'm quite happy with my progress since I have only been at this for a week in my free time.

by mozz on 2008-03-15 (#31666)

MottZilla wrote:
I believe the idea was that, if you have an 8bit value and you are adding a number to it, if the number you end up with is less than wht you started with, then you wrapped around. However when I went back to my original code which does it with an int and a >0xff, it is working again. I guess I was thinking by adding a bunch of 8bit values together it would wrap and never be greater than 0xff. I dunno. I'm tired I guess. :p Thanks for pointing that out. Anyway, I took my newer code and fixed that fuckup. Everything is fine now.

What happened to you there, is that in C/C++ when you do arithmetic on integer types smaller than int, they get promoted to int to do the arithmetic. So its exactly like hap said.

Code:
if( (CPU_A + Value + Carry ) < CPU_A) // Check if Carry will Result

In this case, CPU_A, Value and Carry get promoted to ints. When it adds them together, it's adding three ints and the result is an int. Then its comparing int < CPU_A, again with int on both sides. To really compare an 8-bit value to CPU_A, you would have to cast the thing on the left to an 8-bit type. I think it would still get promoted back to int to do the comparison, but it would only have the bottom 8 bits set.

Its important to remember the rule about small integer types getting promoted to int before doing math on them, because you can get surprised if you do something like this:

Code:
short X = (myShort1 + myShort2) >> 2;

If you expected it to add two 16-bit values and then SHR a 16-bit value and store it in the 16-bit variable... that's not what happens. Both the add and the shift promote their arguments to int, so any carry into bit 16 of the add will end up in bit 14 of your result.

[Edit: I found this PDF with google. It has some good slides describing how integer promotions work, and some surprising consequences. e.g. if you have an unsigned char C = 0x55, and you write ~C then what you actually get is a very large negative *signed* int! If you then assign it to an unsigned char variable, nothing bad will happen--but I wouldn't recommend doing arithmetic with that value unless you completely understand the integer promotion rules. ]

by blargg on 2008-03-15 (#31671)

It's a bad idea to assume the sizes of the integral types in C and C++. It's much better to use unsigned int with bit masks. For example:

(unsigned char) x -> x & 0xFF
~x -> x ^ 0xFF or x ^ 0xFFFF
x + y < x -> ((x + y) & 0xFF) < x

If you feel you must use integral types instead of masks, use uint8_t and uint16_t, accessible with #include <stdint.h>.

by MottZilla on 2008-03-15 (#31672)

Well it's good to know as prior to this project I've never had to worry much about size of variables. But I understand now why what I tried to do failed to work as intended. I'll read that PDF when I get the time. You're all so very helpful.

As for my emulator, last night I added sprites and with that and drawing the first name table it was enough to play Donkey Kong and some other games. But strangely Mario Bros exhibits some interesting behavior. When the player starts, he falls to the very bottom of the screen. He cannot move. The enemys come out and also fall to the bottom but they CAN move and you die. :p

If anyone has a guess as to what would cause this let me know. I'll look into it when I get back to work on this but I think it's an interesting problem. I don't think it could be CPU core related as I passed all the tests, so I was thinking it was related to the NES. I don't think I emulate cpu reading from video memory yet, so possibly that or many other things perhaps.

by mattmatteh on 2008-03-15 (#31677)

i agree with what blargg said about the data types. i use stdint.h and most of my data types are uint8_t, uint16_t, that way you know exactly what types they are.

matt

by MottZilla on 2008-03-15 (#31679)

I fixed Mario Bros and Pac-Man. They needed to read from the PPU. I think that's strange though to read from the nametables to handle collision. But either way, that fixes any game that needs to read from VRAM now. =)

by Disch on 2008-03-15 (#31683)

MottZilla wrote:
I think that's strange though to read from the nametables to handle collision.

I doubt either game does that. SMB reads from CHR-ROM because that's where it's title screen arrangement is stored. Early games stored data in CHR-ROM once they ran out of space in PRG.

by tepples on 2008-03-15 (#31684)

blargg wrote:
It's a bad idea to assume the sizes of the integral types in C and C++.

Two specifications impose constraints on a C compiler: the C standard and each platform's application binary interface (ABI). In C, char means byte, and C guarantees that char is always at least 8 bits (CHAR_BIT >= 8).[1] The ABIs of the most popular platforms (x86, PowerPC, ARM) guarantee that CHAR_BIT == 8, making unsigned char and uint8_t equivalent.

If you do choose to rely on an aspect of your platform's ABI, there are methods to perform assertions at compile time. (Caution: a #if block isn't always the best choice, as it doesn't work for sizeof and other things that are evaluated after preprocessing is done.) I use code similar to this:
Code:
#define CTASSERT(name, condition) \
extern const char name[(condition) ? 1 : -1];

CTASSERT(char_is_8_bits, CHAR_BIT == 8)
CTASSERT(int_is_4_bytes, sizeof(int) == 4)

If an ABI assertion fails, the compiler will fail and issue a diagnostic about an array with negative size. That's a lot better than compiling a binary whose behavior isn't well defined.

Quote:
If you feel you must use integral types instead of masks, use uint8_t and uint16_t, accessible with #include <stdint.h>.

Not all compilers have been updated with <stdint.h>, which C99 introduced. Sometimes you have to rely on "config.h" or ABI assertions on platforms where you don't have a good C99 compiler. But in this specific case, I think masks should be a better choice.

by blargg on 2008-03-15 (#31686)

More straight-forward to verify the ranges of the types, which is ultimately what you would be relying on:

#include <limits.h>
#if UCHAR_MAX != 0xFF || USHORT_MAX != 0xFFFF || UINT_MAX != 0xFFFFFFFF
#error "unsigned char must be 8 bits, unsigned short must be 16 bits, and unsigned int must be 32 bits"
#endif

by MottZilla on 2008-03-15 (#31689)

Disch wrote:
MottZilla wrote:
I think that's strange though to read from the nametables to handle collision.

I doubt either game does that. SMB reads from CHR-ROM because that's where it's title screen arrangement is stored. Early games stored data in CHR-ROM once they ran out of space in PRG.

Oh, so it's more likely they stored an array in CHRROM of the collision map? Either way, Pac-Man and Mario Bros has no collision data if you aren't responding to VRAM read requests.

Anyway, my emulator has been coming along. I've added sprite support. Sprite DMA Transfers, OAM I/O, 8x8 Sprites, 8x16 Sprites, Sprite Flipping, Sprites in their correct palettes. After that before working on rendering the screen properly I wanted to work on attribute tables for the background tile pallettes. Right now I've just be rendering NT#0 at VBlank with sprites so I could test simple games before working on a real renderer.

I never quite understood with my attempts at NES homebrew how anyone could work with the attribute table. It seemed like such a bastard to me. But I knew there was a way to calculate it, and I did with success. So now the emulator is doing pretty well I think. =) I even cheated to make SMB run by setting sprite 0# hit after a certain amount of cycles. The background color is correct too, which was the main reason I wanted to run it right now anyway since I'd heard many people make an error in mirroring which causes the background to be black.

About variable sizes, I don't have a problem with them, I just never knew about what you were working with when adding various types together. :p

by Dwedit on 2008-03-15 (#31690)

Super Mario Land on the GB used the tilemap in VRAM for collisions and identifying tiles.

by tepples on 2008-03-15 (#31691)

But then the VRAM on Game Boy and Game Boy Color is mapped into CPU space (even if not dual-ported), so that was more likely.

by Disch on 2008-03-15 (#31693)

A trick I came up with for the renderer. It actually consists of several parts:

1) pre-render CHR to a seperate graphics buffer (where individual pixels are stored in their own byte). That will make rendering faster and easier so you don't have to decode 2bpp repeatedly.

For CHR-ROM this can be done once on ROM load

For CHR-RAM you'll need to re-decode an 8x1 section of the tile every time CHR-RAM is written to via $2007. This isn't a big deal, since $2007 isn't written to anywhere NEAR as often as pixels are rendered... so this approach still pays off.

Needless to say you'll still have to maintain the CHR buffers (you can't replace them with these graphics buffers) because you'll still need to respond to $2007 reads and other things.

2) When you decode CHR, each pixel can be one of 4 colors (2bpp). Have these colors be:

0x00, 0xFD, 0xFE, 0xFF

don't use 0,1,2,3. Read why explained below

3) when you're applying attribute bits to this CHR, your attribute will be 0x00, 0x04, 0x08, or 0x0C as you'd expect.. but don't use those values... OR them with 0x03:
0x03, 0x07, 0x0B, 0x0F

4) with this setup, attributes and transparency can be easily applied with a simple AND operation, rather than conditionals and ORs, which you'd might otherwise need:

Code:
outputpixel = decoded_chr_pixel & attribute;

I found that before I did this trick... I had to have something like the following:

Code:
output_pixel = decoded_chr_pixel;
if(output_pixel != 0)
output_pixel |= attribute;

The single AND is prefereable to the conditional+OR

Anyway just a trick. You don't have to use it... I'm just throwing ideas at you ^^

by MottZilla on 2008-03-16 (#31695)

Well, I have quite alot of games running fairly well now. Including UNROM games like MegaMan and Contra. =)

For rendering right now I started off with an inaccurate tile based, full screen renderer.

For CHRROM games, I decode all tiles on load of the ROM into an array of tiles (enough for 256kb CHRROM). These are used for quicker and easier rendering. I also have an array of pointers to point to the ROM data that is swapped in for reading from the PPU. Sections can be as small as 1K.

For CHR-RAM, there is an array (512 bytes) which consists of 0s on load. Anytime CHR-RAM is written to, it figures out which tile was modified and marks that it must be decoded before it can be rendered. Then my renderer checks for needed updates before drawing.

As for the values, the tile arrays use 0,1,2,3. I haven't had any issue with that yet.

1943 is giving me issues (sprites not appearing) I need to trace so I'll have that to look at tomorrow. But also I will get to work on the "Line Renderer" which will render a line after the CPU completes it, which will allow for the sprite 0# and other split screen scrolling effects. That will get SMB and I think Excitebike working.

I have to say working on the graphics has been alot more fun than working on the CPU core was. Not that it wasn't fun, but the CPU was more time consuming, frustrating, and the rewards weren't easily visable. However with graphics today, I went from a static name table display with incorrect colors all the way to a display with correct colors, full sprites, scrolling, etc.

Overall I'm just happy to have gotten somewhere with this project. I really wasn't sure I would get anywhere with this until I first saw Donkey Kong plotting something on the name table. And that was trumped when I finally say the actual graphics.

by Zepper on 2008-03-16 (#31699)

- Cool tip, but here's how the CHR is decoded:

Code:
unsigned char layerA = (src[8] & 0xAA) | ((*src >> 1) & 0x55);
unsigned char layerB = ((src[8] & 0x55) << 1) | (*src & 0x55);
unsigned char *buf = dst;

*buf = (layerA >> 6); buf++;
*buf = (layerB >> 6); buf++;
*buf = (layerA >> 4) & 3; buf++;
*buf = (layerB >> 4) & 3; buf++;
*buf = (layerA >> 2) & 3; buf++;
*buf = (layerB >> 2) & 3; buf++;
*buf = layerA & 3; buf++;
*buf = layerB & 3;

- src is a pointer from the CHR data.
- dst is a pointer to the decoded CHR data.

by MottZilla on 2008-03-16 (#31707)

I decode CHR by masking the bits needed, bit shifting, and adding them together to get the final value (0,1,2, or 3). It really isn't very hard to do. Infact, I did it before this project. I was making a NES map editor for a homebrew ROM and wanted to be able to load the NES graphics rather than a converted BMP.

I'm sure there are many ways to decode CHR. There's no one right way to do it. Your way Fx3, I'd have to give some study to fully understand it. I'm sure it works, but so does my way which I find much easier to understand. Afterall I wrote it. :p

I'm sure I'll have some questions when it comes time to emulate the APU. But so far the hardest part was getting the CPU up and working.

by tepples on 2008-03-16 (#31716)

Fx3's algorithm appears to have the odd pixels (x=1, 3, 5, 7) in one "layer" and the even pixels (x=0, 2, 4, 6) in the other.

by mozz on 2008-03-17 (#31734)

tepples wrote:
blargg wrote:
It's a bad idea to assume the sizes of the integral types in C and C++.

Two specifications impose constraints on a C compiler: the C standard and each platform's application binary interface (ABI). In C, char means byte, and C guarantees that char is always at least 8 bits (CHAR_BIT >= 8).[1] The ABIs of the most popular platforms (x86, PowerPC, ARM) guarantee that CHAR_BIT == 8, making unsigned char and uint8_t equivalent.

There are several things that are not guaranteed in C or C++ (except maybe in C99 or the latest version of the C++ spec, I dunno?). However, they are true on practically all platforms that anyone has made in the past 20 years, and will continue to be true basically forever:

(1) A byte is 8-bits, and types exist which are 8, 16 and 32 bits in size. Modern compilers all support 64-bit integer types also. Except, you might not know which types are which size! What most people do is simply define their own types for known sizes. Then if you want to support multiple compilers or port it to a different platform, its easy to supply alternate definitions.

In my own code, I usually use the following definitions:
Code:
typedef unsigned char U8;
typedef signed char S8;
typedef unsigned short U16;
typedef signed short S16;
typedef unsigned int U32;
typedef signed int S32;
typedef unsigned long long U64;
typedef signed long long S64;

Then I use those types everywhere, so that it is easy for me to keep track of what is going on when I do arithmetic or other operations on them. The only time I would use "int" or "unsigned" is as a loop counter where I'm not doing any operations with the counter that mix it with those fixed-size types. For example, if I'm only using it to index an array or something, then I might use "int" or "unsigned". But even then I tend to prefer U32 or S32 for loop counters. If it makes you feel better, then typedef these to the new language types (uint8_t or whatever) but I've personally never bothered to do that.

(2) Integers are stored using 2's complement representation for negative integers (i.e. the top bit is the sign bit, there is only one representation of zero--with all bits clear--and the representation of -1 is the number with all bits set. Contrast this with floating-point numbers, where they actually have *two* representations of zero). No one has made a machine with other int representations for at least 20 years.

(3) NULL pointers to any data type (including void*) can be represented by a bit-pattern of all clear bits. So you can (for example) use memset(data, 0, sizeof(MyStruct)); to clear a structure, and assume that any pointers in it are now NULL. The C/C++ languages actually allow the implementation to use almost anything they want for a NULL pointer--even different values for different types! But nobody does this, and too much existing code would break if they ever tried to change it. So go ahead and assume it.

(4) Most platforms nowadays are "32-bit", which means sizeof(int)==4 and sizeof(void*)==4 (in fact size of any pointer type except C++ for pointer-to-member types, should be 32 bits). If you want to be future-proof for 64-bit platforms its a good idea to keep in mind that their pointer types might be 64 bits instead of 32. But supporting those two combinations should be plenty for most code (unless you plan to port it to cell phones or something... and most of those have 32-bit processors now anyway).

(5) "Natural" alignment: this is not guaranteed on every platform, but it works on all x86-based platforms (as well as all of the common PPC-, Sparc- and ARM-based platforms, and probably most others). Basically, small types like to be aligned to their size (i.e. a 4-byte integer type should be aligned on a 4-byte boundary, i.e. bottom 2 bits of its address should be zero). Structures need alignment and size to the largest alignment of any of their members. *Also a structure's size is rounded up to a multiple of its alignment by adding padding at the end*, so that if you have an array of that struct, the members of the array are all properly aligned. Classes == structures (but if there are any virtual methods or virtual base classes, assume the compiler added some crud to your structure that you can't see to support the virtual stuff). On some platforms, a mis-aligned type is harmless (on x86 this is anything 8-byte-aligned or less), though it is probably slower to access. In other cases it is NOT harmless and causes the program to crash! So compilers have to insert extra code to do misaligned accesses (which is a lot slower), AND they have to know that they're doing it---so if you cast a structure pointer to an aligned U64* for example, you might get crashes because you tricked the compiler into thinking the data accessed through the pointer would be aligned when it isn't.

Anyway, you can avoid nearly all alignment problems if you use "natural alignment" for all of your data: Simply don't change structure packing from the compiler default (some people like #pragma pack(1) and such, but I always avoid them because of these alignment requirements), and always put the larger members of your structure first, *or* count the sizes of the members to make sure the later ones are properly aligned:
Code:
struct Foo
{
U8 m_type;
U8 m_flags;
U16 m_blockSize; // <-- offset 2, "natural" alignment == 2
U8* m_pData; // <-- offset 4, "natural" alignment == 4 (on most platforms anyway)
U16 m_dataAge; // <-- offset 8, "natural" alignment == 2
U16 m_padding0; // <-- only exists to make the next field 4-byte aligned
U32 m_counter;
};

Two things to notice about this little example:
(1) I assumed that sizeof(U8*) == sizeof(U32) == 4. You can always check that with a compile-time assertion, but its true on all 32-bit platforms. (NOT necessarily on some of the newer 64-bit platforms though! So the compiler would have inserted an extra 4 bytes of padding before the m_pData field!)
(2) I inserted a 2-byte m_padding0 field, just so that m_counter would have the proper alignment. Actually, the compiler will insert padding by itself (if its necessary, and unless you've told it not to)... but I prefer to stick to the "natural" alignment rule by inserting padding fields myself so that the compiler never has to add them. That makes it easier to manually add up the size of the structure at a glance, too.

[Edit: I forgot to describe the main usefulness of the "natural alignment" rule... many platforms, such as x86 for example, have rules where a 2-, 4- or 8-byte type can have any alignment you want, but if it happens to cross a cache line boundary then it will be slower to access (sometimes much slower). Or they have rules where the integer types support misaligned accesses but the floating point types don't. So if you just stick to "natural alignment", then you guarantee that no 4-byte or 8-byte type is ever going to cross a 32- or 64-byte cache line boundary, and you avoid having to deal with any of those special cases. "Natural alignment" is a simple rule that's easy to follow, and will avoid 99% of potential alignment problems for most code.]

Anyway, just some ideas. Happy coding!

by tepples on 2008-03-17 (#31736)

mozz wrote:
tepples wrote:
Two specifications impose constraints on a C compiler: the C standard and each platform's application binary interface (ABI).

There are several things that are not guaranteed in C or C++ (except maybe in C99 or the latest version of the C++ spec, I dunno?). However, they are true on practically all platforms that anyone has made in the past 20 years

That's what I meant by ABI constraints.

Quote:
(1) A byte is 8-bits

CHAR_BIT (number of bits in a byte) can be larger than 8 on some digital signal processors, which might have, say, 32-bit bytes. But I agree that most of us won't ever write NES emulators for such architectures.

Quote:
In my own code, I usually use the following definitions:
Code:
typedef unsigned char U8;
typedef signed char S8;
typedef unsigned short U16;
typedef signed short S16;
typedef unsigned int U32;
typedef signed int S32;
typedef unsigned long long U64;
typedef signed long long S64;

Those names sound familiar. Did you learn them from the GBA scene?

Quote:
Classes == structures

That's actually true by the C++ standard. Within C++, the only difference between the two is the privilege of members that precede the first privilege statement.

Quote:
Anyway, you can avoid nearly all alignment problems if you use "natural alignment" for all of your data

You'll also have to use byte-wise I/O for file formats too, as plenty of common file formats (such as .bmp) do not use natural alignment.

by blargg on 2008-03-17 (#31739)

Quote:
(3) NULL pointers to any data type (including void*) can be represented by a bit-pattern of all clear bits. So you can (for example) use memset(data, 0, sizeof(MyStruct)); to clear a structure, and assume that any pointers in it are now NULL.

Being portable costs very little in this case. Instead of
Code:
MyStruct* s = ...
memset( s, 0, sizeof *s );

you can do
Code:
MyStruct* s = ...
static const MyStruct zero = { 0 };
*s = zero;

This will work properly even if MyStruct has floating-point types in it. If you are declaring MyStruct locally, you can even just do
Code:
MyStruct s = { 0 };

Quote:
(4) Most platforms nowadays are "32-bit", which means sizeof(int)==4 and sizeof(void*)==4 (in fact size of any pointer type except C++ for pointer-to-member types, should be 32 bits). If you want to be future-proof for 64-bit platforms its a good idea to keep in mind that their pointer types might be 64 bits instead of 32.

If you're coding for a modern platform, why not use intptr_t (or uintptr_t)? The reader then knows that you're stuffing a pointer into an int, and it's guaranteed portable.
Quote:
(5) "Natural" alignment: this is not guaranteed on every platform, but it works on all x86-based platforms (as well as all of the common PPC-, Sparc- and ARM-based platforms, and probably most others). Basically, small types like to be aligned to their size.

This pretty much has to be the case, because it's guaranteed that for an array of T, elements will be sizeof (T) bytes apart. So the only way a type's alignment wouldn't be sizeof (T) bytes as well is if it were at some offset, for example if sizeof (int) were 4 and proper alignment required that its address % 4 be some non-zero value.

by Dwedit on 2008-03-17 (#31741)

tepples wrote:
CHAR_BIT (number of bits in a byte) can be larger than 8 on some digital signal processors, which might have, say, 32-bit bytes. But I agree that most of us won't ever write NES emulators for such architectures.

Who would refer to it as a "Byte" rather than a "Word" if it's more than 8 bits large?

by mozz on 2008-03-17 (#31743)

blargg wrote:
This will work properly even if MyStruct has floating-point types in it. If you are declaring MyStruct locally, you can even just do
Code:
MyStruct s = { 0 };

Interesting, thanks for that idiom!

Usually when memset is used its for something that is not statically initialized (e.g. an array of structures on the stack or something), but memset is pretty slow on some platforms anyway so your way might be faster for individual stack-allocated structs. And being portable is nice too of course!

blargg wrote:
If you're coding for a modern platform, why not use intptr_t (or uintptr_t)? The reader then knows that you're stuffing a pointer into an int, and it's guaranteed portable.

That might work, but then you either have to not assume a fixed size for those types (is it 4 bytes or 8? depends on the size of the pointers, *and also the size of int on your platform*) or you have to check sizeof(intptr_t) in your code at which point I'd rather be making my own union type anyway. Depending on your reason for doing such tricks (I usually encounter them in the context of a memory size optimization), you might need to know the pointer size, in which case you are better off using your fixed-size types (as well as putting a compile-time assertion near the code that uses it, that serves to both document and check the assumption). What I've found over the years is that I hate programming with types that I don't know the sizes of. But there is no easy way to avoid it if you want a pointer-sized union... oh well.

blargg wrote:
Quote:
(5) "Natural" alignment: this is not guaranteed on every platform, but it works on all x86-based platforms (as well as all of the common PPC-, Sparc- and ARM-based platforms, and probably most others). Basically, small types like to be aligned to their size.

This pretty much has to be the case, because it's guaranteed that for an array of T, elements will be sizeof (T) bytes apart. So the only way a type's alignment wouldn't be sizeof (T) bytes as well is if it were at some offset, for example if sizeof (int) were 4 and proper alignment required that its address % 4 be some non-zero value.

Close, but don't forget that you can have a type T where sizeof(T)==4 but the alignment required for T is only 1, for example. The "natural alignment" rule suggests that you align them on their size anyway, even if that is more than the CPU strictly requires, so e.g. 8-byte double variables should be aligned on an 8-byte boundary, even if some platforms would be perfectly happy with a 4-aligned or 1-aligned double.

Many compilers will already align structure members to natural alignment for you (all x86 compilers I know of do this by default). Knowing this rule means you can put the fields in the struct in an order where the compiler doesn't have to insert padding (or inserts only minimal padding). For example, if you have some U8's and some U32's in the same struct, either put all the U32's first, or make sure you group four U8's together, so that the U32's are 4-byte-aligned. If you don't do that, the compiler might need to insert more padding in the struct in order to satisfy its alignment rules. I'm not sure if its legal according to the C/C++ specs for compilers to *re-order* the fields in your struct, but I've never ever seen a compiler that does that, instead they just add padding whenever the next field would not be properly aligned without padding.

by tepples on 2008-03-17 (#31749)

blargg wrote:
why not use intptr_t (or uintptr_t)?

Because not everybody has a C99 compiler. And because C++ compilers aren't yet required to provide C99's new types as an extension.

Dwedit wrote:
tepples wrote:
CHAR_BIT (number of bits in a byte) can be larger than 8 on some digital signal processors, which might have, say, 32-bit bytes. But I agree that most of us won't ever write NES emulators for such architectures.

Who would refer to it as a "Byte" rather than a "Word" if it's more than 8 bits large?

octet n. A vector of eight bits. [From Latin octo = eight.]

byte n. A vector of bits whose size is that of an "addressable unit of data storage large enough to hold any member of the basic character set of the execution environment" (C standard, clause 3.6). [From "bite", modified in spelling to distinguish from "bit".]

word n. A vector of bits whose size is a machine's preferred size for integers, floats, or addresses.

On x86, PowerPC, MIPS, and ARM, a byte is the same size as an octet. On some specialized architectures, a byte is the same size as a word. C makes no explicit provision for architectures that have different sizes of bytes for different regions of memory, such as the VRAM of some Nintendo handhelds.

by MottZilla on 2008-03-20 (#31894)

Back to you know, writing emulators..

My emulator is coming along nicely I think. I started working on MMC1 and it was a bitch because of various things that weren't very clear to me. But I've managed to get it working I think for everything with the exception of the 32K switching mode. Does anyone know of a MMC1 game that uses 32K switching? Also are there any games where I need to worry about what happens to PRG which you change between 32k and 16k modes and what gets mappe where/etc.

I also finally wrote a real scanline renderer. Prior to this I was just using a hacked up version of my tiled screen renderer. This should allow me to better emulate Sprite 0 hit I think.

Most importantly perhaps I found out my emulator timing was totally broken. NMI would happen at a constant rate and all, but it wasn't the correct amount of cycles, and the vblank peroid and such was just missing. Whoops. :p

Not so important but nice, I looked at Loopy's docs to figure out why SMB's status bar flickered. It fixed that and I'm pretty happy with my progress now.

by Dwedit on 2008-03-20 (#31895)

MottZilla wrote:
Does anyone know of a MMC1 game that uses 32K switching?

Dragon Warrior 3 + 4 use it.

by MottZilla on 2008-03-21 (#31907)

How bout one that doesn't also use the 512k or 1024k cart banking? :p

by Dwedit on 2008-03-21 (#31908)

1024k MMC1 cart banking does not exist, no matter how many DW4 overdumps you find.
"Forbidden Four" multicart example by Tepples uses 32k bankswitching and a size of 256k.

by WedNESday on 2008-03-21 (#31911)

MottZilla wrote:
Code:
void ADC(unsigned char Value)
{
unsigned char Carry=CPU_P&0x01;
// Check for Carry
if( (CPU_A + Value + Carry ) > 0xFF) // Check if Carry will Result
{
   CPU_SETC=1;
}
else
{
   CPU_SETC=0;
}

// Check for Zero
if( (CPU_A + Value + Carry)==0 )    // Check if Zero will Result, Set Flag accordingly.
{
   SetZero();
}
else
{
   ClearZero();
}

// Check for Overflow
CPU_TEMP=CPU_A + Value + Carry;
CPU_SETV=0;
if(!((CPU_A ^ Value)&0x80) && !((CPU_A ^ CPU_TEMP)&0x80))
   CPU_SETV=1;
if(CPU_SETV)
{
   SetOverflow();
}
else
{
   ClearOverflow();
}

// Do ADC Operation
CPU_A = CPU_A + Value + Carry;

if(CPU_SETC==1)
{
   SetCarry();
}
else
{
   ClearCarry();
}

}

void SBC(unsigned char Value)
{
ADC(Value ^ 0xFF);
}

A simply hideous amount of code if I may say, for something as simple as ADC. Here is WedNESday's code;

Code:
CPU.TMP2 = (char)CPU.A + (char)CPU.Databus + CPU.CF;
if( CPU.TMP2 < -128 || CPU.TMP2 > 127 )
CPU.OF = 0x40; else CPU.OF = 0x00;
CPU.NF = CPU.ZF = CPU.A = CPU.CF = CPU.A + CPU.Databus + CPU.CF;
CPU.CF >>= 8;

No memory addressing provided, and CPU.TMP2 holds the byte fetched. In my experience if's and else's are what slows down an emulator the most, especially in any pixel-rendering functions. And as for calling ADC(Value ^ 0xFF) for your SBC code, you should really __forceinline everything to make sure that it is as fast as possible.

by Disch on 2008-03-21 (#31912)

WedNESday wrote:
you should really __forceinline everything to make sure that it is as fast as possible.

__forceinline is not a C++ keyword, but rather one of those "MSVS only" keywords that VC++ adds. I would recommend against using any sort of compiler addon that isn't part of the standard (any function/keyword that is preceeded by underscores should throw a red flag -- avoid all of them). "inline" should suffice... and is probably better than __forceinline anyway -- since inlining doesn't always produce faster code, and 'inline' will detect these instances whereas __forceinline will not.

Preferably, I would even #define calling conventions elsewhere in the code and use the #defines rather than using the calling convention directly. This way if you run into calling convention problems with other compilers or platforms you can easily change the #define and remove all related problems:

Code:

// what I would recommend against
void inline BadExample()
{
}

// what I would recommend

// in some header file
#define NES_INLINE inline

// in source
void NES_INLINE GoodExample()
{
}

by Zepper on 2008-03-21 (#31914)

Fx3's ADC code:
Code:
CPUOP(ADC0)
offset = cpu->A + value;
if(cpu->status & C_BIT)
offset++;
cpu->status &= ~(C_BIT | V_BIT);
if(offset & 0xFF00)
cpu->status |= C_BIT;
if((cpu->A ^ offset) & (value ^ offset) & 0x80)
cpu->status |= V_BIT;
cpu->A = (unsigned char)(offset);
set_sz_flags(cpu->A);
OPEND

by MottZilla on 2008-03-21 (#31916)

Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.

by Zepper on 2008-03-21 (#31918)

MottZilla wrote:
Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.

Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off.

by blargg on 2008-03-21 (#31919)

Here's fully portable version written for clarity. All variables are of type int. Who says efficiency and clarity are always at odds?
Code:
overflow = ((a ^ 0x80) + (operand ^ 0x80) + carry - 0x80) & 0x100;
temp = a + operand + carry;
carry = temp >> 8;
a = temp & 0xFF;
// update negative and zero flags based on a
// ...

EDIT: Actually, Wednesday's overflow checking is clearer. Untested:
Code:
temp = (int8_t) a + (int8_t) operand + carry;
overflow = (temp < -128 || temp > 127);
carry = temp >> 8 & 1;
a = temp & 0xFF;
// update negative and zero flags based on a
// ...

by MottZilla on 2008-03-21 (#31929)

Fx3 wrote:
MottZilla wrote:
Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.

Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off.

Not at all. I was mainly addressing WedNESday who was insulting my code. :p

I do appreciate all the help and insight.

by Zepper on 2008-03-21 (#31930)

Hmm... It's criticism, as far as I can tell you. ^_^;;

by MottZilla on 2008-03-21 (#31932)

I suppose, I just took offensive to the word hidious considering this is a topic about n00b emulator tips. :p

My emulator runs alot of games now, but I'm having issue with Sprite 0. I can't seem to get any luck so far with implementing an accurate emulation of it which is important for alot of games. Right now I just sort of fake it so it works decently enough.

I also went and added support for AxROM (Mapper 7) so I could try out BattleToads. The scroll was very much messed up. Part of it seems to have to do with updating scroll at Sprite 0 in the non-standard way. I tried added that in and it helped but it wasn't always right. I didn't put on it too much though since the Sprite 0 timing is bs anyway.

I'm not really sure what I want to do next. Ideally though it would be getting cycle accurate (atleast close to it) sprite 0 hit. Might require some more rendering adjustments and such. I had tried doing it one way that should have worked but for some reason didn't seem to help at all.

by WedNESday on 2008-03-22 (#31934)

MottZilla wrote:
Fx3 wrote:
MottZilla wrote:
Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.

Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off.

Not at all. I was mainly addressing WedNESday who was insulting my code. :p

I do appreciate all the help and insight.

I wasn't being insulting, it was just a bit shocking to see so much code that's all. Btw Fx3's ADC code is just as big, but blargg's seems nice and small. And yes you're right, it's your first emulator so you don't have to worry to much about efficiency at this stage, just get the damn games to work, and worry about other things later. As for inline/__forceinline, since the opcodes are only called once in the emulator, __forceinline is the best option IMO.

by Zepper on 2008-03-22 (#31941)

- I like to discuss programming skills and optimizations. Mr.Wed, by considerating the number of ADC opcodes in a game, well... it's quite rare if you compare it with the LDA, as example. And I don't think my code is as big as the previous one, heh ^_^;;

by WedNESday on 2008-03-22 (#31944)

Fx3 wrote:
- I like to discuss programming skills and optimizations. Mr.Wed, by considerating the number of ADC opcodes in a game, well... it's quite rare if you compare it with the LDA, as example. And I don't think my code is as big as the previous one, heh ^_^;;

Of course it's not as big. And I know that certain opcodes are called more times than others, but I have spent ages constantly refining each opcode to make it as fast as I can. Btw Disch, since the opcodes are called only once in the CPU code, you would want them to be inlined as they as they are only called once. You don't need to worry too much about the code getting too big etc. That's only if you are constantly calling the same function inline too many times. If you write an opcode and only call it once inside of the swith(WhichOpcode) bit, then the code would be exactly the same size. I must admit, I stole the concept of __forceinline rather the just inline from Nintendulator anyway .

by MottZilla on 2008-03-22 (#31949)

What does __forceinline do? It sounds to me like it replaces the function call with inline code? So it sounds to me like the compiler doesn't actually call your function but instead any part of code that uses it actually has it placed right in there? Let me know if my idea is close. :p

Anyway, curious what do you guys know about BattleToads? It seems to be one of the best games to test emulator accuracy and conviently uses one of the most simple memory mappers.

I've read you need pretty accurate Sprite 0# Hit Flag timing. And just timing in general. The game runs on my emulator, but the problem I'm having is scroll offset. The game appears to use the scrolling technique that Loopy's document is about. I "tried" to implement scrolling adjustments mid-frame for Battletoads. It was close but not correct.

I do seem to have the name table switching by writing to 2006 correct, atleast enough so that Super Mario's status bar doesn't flicker. But I'm not clear on how you change the scroll offset.

I think was I did was I was taking the first write to 2006$, masking for the lower 2 bits. Then shifted those left bit 3 bits. I saved that number till the second write. Then I would combine those 2 bits, with the second write masking for the upper 3 bits, shifted right 5 bits. Then I would set ScrollY to that value * 8.

As I said it was close and in some places it was correct but not everywhere.

by Disch on 2008-03-22 (#31951)

WedNESday wrote:
__forceinline is the best option IMO.

Just to reiterate...

__forceinline is not a standard C++ keyword. I'd say it's never a good idea to use it simply because it may give you trouble with compilers that don't support it -- which will make portability or even public source release somewhat problematic.

Plus inline (which is a standard keyword) does the same job. The only difference is that __forceinline doesn't detect conditions where inlining isn't favorable. You say it's favorable for CPU functions and I don't disagree -- but the truth is you shouldn't substitute your judgement for the compiler's. Inline doesn't always mean faster... and in the offchance you happen to make a function inline where inlining reduces performance, the compiler will correct that (that's its job) -- whereas with __forceinline you end up screwing yourself.

Any situation where it is favorable to have stuff inlined, inline works just as well as __forceinline.

So yeah -- __forceinline should never be the way to go, IMO.

But again -- this is another reason to #define calling conventions, since whoever compiles your source can change the define to inline rather than __forceinline if they choose -- rather than having to go and change every function.

EDIT (MottZilla replied while I was typing)

Quote:
What does __forceinline do? It sounds to me like it replaces the function call with inline code? So it sounds to me like the compiler doesn't actually call your function but instead any part of code that uses it actually has it placed right in there? Let me know if my idea is close. :p

Yeah sounds like you have it right. Function inlining makes it so that when you call a function, it doesn't actually jump to that function -- rather, the function gets sort of copy/pasted into the area that calls it.

This is good because there's a little overhead for function calling (variables pushed on stack and whatnot) which is avoided if the function is inlined.

But it can also be bad because it can greatly bloat code size, which may cause the program to run slower.

Quote:
Anyway, curious what do you guys know about BattleToads?

It's very picky about timing. If your NMI isn't timed just right, or if your sprite 0 hit is little off, the game can very easily deadlock on level 2. It's also picky about when in the scanline you reset the horizontal scroll and increment the Y scroll, etc. Doing these at the wrong times can cause it to deadlock.

Quote:
I do seem to have the name table switching by writing to 2006 correct, atleast enough so that Super Mario's status bar doesn't flicker. But I'm not clear on how you change the scroll offset.

How it works is the PPU address set by $2006 is the same address that the PPU uses to fetch tiles to render. During rendering, every time the PPU fetches a tile it increments the address so that the next tile to be displayed is pointed to. I'm not really sure it helps to think of it in terms of scroll offset.

For example... if the game sets the PPU address to $1234 by writing to $2006, this means that the next tile fetched comes from $2234 ($0234 + $2000) and with a fine Y scroll of 1 ($1000 >> 12). In effect this translates to:

Y scroll: $89
X scroll: $A0 (to $A7... depending on the fine X scroll set by $2005)

by blargg on 2008-03-22 (#31952)

WedNESday wrote:
I have spent ages constantly refining each opcode to make it as fast as I can.

Only a handful of opcodes are used regularly, while some are virtually never used. Optimization effort is best spent on the former.

WedNESday wrote:
Btw Disch, since the opcodes are called only once in the CPU code, you would want them to be inlined as they as they are only called once. You don't need to worry too much about the code getting too big etc. That's only if you are constantly calling the same function inline too many times. If you write an opcode and only call it once inside of the swith(WhichOpcode) bit, then the code would be exactly the same size.

The inline version would probably even be smaller, since the outline one would have function call overhead. But, you sometimes want a once-called function outlined, if it's used rarely. If it were inline, it'd use more of the cache since its beginning and end would be kept in the cache by the often-used code around it, and branches over it would have to hit a different cache line. It might also stress the optimizer out enough that it can't optimize other parts of the function as well. As Disch says below, by using regular inline, you allow the compiler to detect things like this.

Disch wrote:
[...] the truth is you shouldn't substitute your judgement for the compiler's. Inline doesn't always mean faster... and in the offchance you happen to make a function inline where inlining reduces performance, the compiler will correct that (that's its job) -- whereas with __forceinline you end up screwing yourself.

I'd normally say that the programmer sometimes knows best, but in this case, you have a very good point. With profiler-guided optimization (often called PGO), it really can decide best as to what should be inlined.

But like I always say, with optimization, the only authority is how something affects the speed of the actual program. If something really does speed up your program, then it's good. WedNESday, do you have any numbers for speedups you've gotten with your techniques?

by MottZilla on 2008-03-22 (#31956)

Well, what you're saying can't be entirely true. Battletoads runs on my emulator and doesn't lock up. However, that's probably because Sprite Hit is faked. "Lockups" with the game seem to happen because it's stuck waiting for the flag to be set.

Like I said the game runs, and I managed to get the scroll updates partially correct. It's always the correct scroll, give or take 8 pixels. I tried to get it to work with those 8 pixels. But these scrolling tricks seem to tell me that I may be handling drawing the screen in a way that won't work out so well for these tricky games. Forinstance, I do have Scroll X and Scroll Y registers I draw the screen by. I don't keep track of the VRAM Pointer other than for I/O. It seems though that you should be using VRAM's Pointer to base your rendering off of and such.

I've been thinking with Sprite Hit that I'd have to redo the whole rendering anyway. But I'm pretty sure these changes to the VRAM pointer are the reason for all the scroll issues in games on my emulator.

by tepples on 2008-03-22 (#31957)

Disch wrote:
Plus inline (which is a standard keyword) does the same job. The only difference is that __forceinline doesn't detect conditions where inlining isn't favorable. You say it's favorable for CPU functions and I don't disagree -- but the truth is you shouldn't substitute your judgement for the compiler's.

Unless the compiler's judgment is failing. Compilers have become smarter over the past couple decades, but they're still not perfect at determining which functions could benefit from inlining. It's entirely possible that with a given combination of compiler and CPU, #define FORCEINLINE __forceinline might produce a faster time for the test suite than #define FORCEINLINE inline. But before you start doing funny stuff like this, make sure your test suite works.

Quote:
But again -- this is another reason to #define calling conventions

Agreed. The tradeoffs vary based on instruction set, microarchitecture, and compiler. When compiler flags aren't enough, macros are a comparatively clean way to abstract over this.

But why even care about speed? My 7-year-old PC runs a game in Nestopia with NTSC filter at 60 fps. Maybe you want to make a ROM picker that looks like PS1 demo discs or Wii Menu, emulating 12 systems at once.

by WedNESday on 2008-03-22 (#31958)

MottZilla wrote:
Well, what you're saying can't be entirely true. Battletoads runs on my emulator and doesn't lock up. However, that's probably because Sprite Hit is faked. "Lockups" with the game seem to happen because it's stuck waiting for the flag to be set.

Well since your Sprite #0 code is faked, then the results cannot be relied upon no matter whether all games work perfectly or not. It's no good saying 'Oh, game X works fine but game X doesn't' or 'I've had no problems so far with my Sprite #0 code'. It just doesn't count.

tepples wrote:
But why even care about speed? My 7-year-old PC runs a game in Nestopia with NTSC filter at 60 fps. Maybe you want to make a ROM picker that looks like PS1 demo discs or Wii Menu, emulating 12 systems at once.

It's good practice to make an emulator as fast as possible. In fact it's necessary. If you had to choose between a fast emulator and a slow emulator with the same compatibility, which one would you use?. And frankly, if you make a NES emulator that requires a Pentium Quad-Core 2.66Ghz with 4GB RAM, then to put it politely, your programming skills blow big time. Plus I think it's good fun too. I could write an emulator for any console very quickly, but it would be full of switch/case statements and such and would have 1FPS at most. When writing an emulator, I believe it's speed that dictates most of our time and effort.

by MottZilla on 2008-03-22 (#31959)

Well that's true, but I just think it's misleading not to point out that the game gets locked up because of an endless S0Hit loop. Afterall Nesticle played the game. ;p

I do agree that you should make atleast a reasonable effort to make your code efficant so the requirements aren't up in space. Though you could always do some optimization later on.

by WedNESday on 2008-03-22 (#31960)

MottZilla wrote:
...Afterall Nesticle played the game. ;p

by Zepper on 2008-03-22 (#31961)

MottZilla wrote:
Well that's true, but I just think it's misleading not to point out that the game gets locked up because of an endless S0Hit loop. Afterall Nesticle played the game. ;p

You are joking. Anyway, go ahead and write your stuff. Once it's working, congrats. It's enough for me.

by MottZilla on 2008-03-22 (#31963)

Yes I'm joking. Because we all know Nesticle is the best. >:p

But back to things that matter. Do you guys have your emulators keeping track of a VRAM Pointer, or VRAM Pointer and a Reload value for it? And then do you go along updating the pointer as you render and then update it according to various things like register writes? And ofcourse do you use it for rendering?

I ask because when I started I didn't know much, so I just assumed you have Scroll X register and Scroll Y register, which I know now that they don't exist. But my emulator takes Scroll X and Scroll Y from $2005 register writes and that is how I base what is drawn from. This worked until I got to games that are manipulating the VRAM pointer manually to adjust the scroll.

The way I see it, if I rewrite the rendering so that it uses the VRAM Pointer, it would make handling scrolling virtually automatic, and I guess should make alot of things easier. So I'm just wondering how you guys do it.

by Dwedit on 2008-03-22 (#31965)

PocketNES doesn't track the VRAM address as the screen renders, but PocketNES isn't an ordinary NES emulator.

If you properly track the VRAM address, emulating MMC3 IRQs becomes much easier.

by Zepper on 2008-03-22 (#31966)

VRAM address? I though it was loopy_t (latch)... as loopy_v being the real. Oh no, wait... it's like *another* VRAM address, as the PPU renders & accesses tiles. Weird.

by Dwedit on 2008-03-22 (#31968)

Yes, the phrase "The VRAM Address" is referring to loopy_v, if you want to call it that.

by Zepper on 2008-03-22 (#31969)

Dwedit wrote:
Yes, the phrase "The VRAM Address" is referring to loopy_v, if you want to call it that.

Actually no. It's an internal address build when the ppu is accessing its memory. I never understood it 100%, but that's it.

by MottZilla on 2008-03-22 (#31970)

Well the only thing important other than your obvious I/O through $2007, is for manipulating things for the scroll effects used in games. That's what I'm having trouble with besides needing to implement accurate Sprite Hit. Does anyone have some easy to understand explanation of how scrolling updates by games mid-frame are handled?

From what I've read you can manipulate an address that is formed by $2005 and $2006 writes, and on the 2nd (according to latch) write to $2006, the scrolls and nametables are set to reflect the value you wrote into the registers.

The thing is that I tried implementing this in various ways but none functioned quite right. Though it's perfectly possible my sprite 0 being wrong is responsible.

by Zepper on 2008-03-22 (#31971)

1. You must follow loopy's ppu logic (docs on nesdev).
2. Once it's done, go to the screen rendering. It's easy, but I had to figure it out by myself. With a few lines of code, it works nicely.

by Dwedit on 2008-03-22 (#31972)

The thing I don't like about Loopy's document: He uses 1's and 0's instead of x's and .'s to indicate which bits are affected.

by Zepper on 2008-03-22 (#31973)

Dwedit wrote:
The thing I don't like about Loopy's document: He uses 1's and 0's instead of x's and .'s to indicate which bits are affected.

An hexadecimal format would be fine. ^_^;;

by MottZilla on 2008-03-22 (#31974)

The thing is I did follow Loopys doc and it still has issues.

How do you convert properly, the values written to $2005 and $2006 to new X/Y coordinates?

by Dwedit on 2008-03-22 (#31975)

Coarse X bits: 0-4 (in units of 8 pixels)
Coarse Y bits: 5-9 (in units of 8 pixels)
Most significant X bit: 10 (in units of an entire nametable, 256 pixels)
Most significant Y bit: 11 (in units of an entire nametable)
Fine Y bits: 12-14 (in units of 1 pixel)

Fine X is separate from loopy_v

by MottZilla on 2008-03-22 (#31976)

Alright hold onto your pants, time for some of my code. Tell me if something is screwwed up. There must be something.

Code:
   if(Address==0x2005)    // Write to the "Scroll" Register
   {
      // Scroll Write (2x write reg)
      if(PPU_LATCH==0)
      {
         Scroll_X=Byte;
         PPU_LATCH=1;

         // Mid-Frame Stuff

         // Fine Horizontal Scroll. Immediately Effective.
         MF_FH=Byte&0x07;
         Scroll_X=(Scroll_X&0xF8)|MF_FH;
         MF_HT=Byte&0xF8;
         MF_HT=MF_HT>>3;
      }
      else
      {

         Scroll_Y=Byte;
         PPU_LATCH=0;

         // Mid-Frame Stuff

         // Fine Vertical Scroll, Lower 3 Bits (&0x07)
         MF_FV=Byte&0x07;
         // Large Vertical Scroll, Bits 3&4.
         MF_VT=Byte&0xF8;
         MF_VT=MF_VT>>3;

      }

   }

   if(Address==0x2006)    // Write to PPU_ADDR Register
   {
      // PPU Address Write (2x write reg)
      if(PPU_LATCH==0)
      {
         VRAM_POINTER=(Byte<<8);
         PPU_LATCH=1;

         // Midframe Stuff
         MF_NT=(Byte&0x0C)>>2;
         MF_FV=MF_FV&0x30;

         MF_VT=(MF_VT&7)|((Byte&3)<<3);


      }
      else
      {
         VRAM_POINTER=VRAM_POINTER | Byte;
         PPU_LATCH=0;

         // Midframe Stuff
         MF_HT=Byte&0x1F; // Lower 5bits.
         MF_VT=MF_VT&((Byte&0xE0)>>5);

         Scroll_Y=(MF_VT*8)+MF_FV;
         Scroll_X=(MF_HT*8)+MF_FH;
         PPU_CTRL&=0xFC;
         PPU_CTRL|=MF_NT;

      }
   }

From what you posted I think I see some errors already but please point it out incase I don't figure it out.

Edit: Well this fixed the Zelda 2 title screen scrolling. Checking again to make sure I'm affecting bits everywhere it should. Your post definitely helped.

Edit: Updated code block.

by Dwedit on 2008-03-23 (#31977)

I don't think that code you wrote is correct on 2005 writes. Looks like it's not affecting loopy_t as it's supposed to.

You can also just try using loopy_t and loopy_v directly in scrolling code.
While a scanline renders, loopy_v and fine X update. After the scanline finishes, fine X ends up back where it started (because it drew 256 pixels)

Basically there are several different events that affect scrolling...
* Writes to PPU

* Pixel clock 256 (I think?) on a scanline
Copies X scroll bits from loopy_t to loopy_v (not fine X scroll)

* "Start of Frame" (When exactly is this? Is it pixel clock 0 of first visible scanline, or pixel clock 256 of dummy scanline?)
Copies loopy_t to loopy_v (which affects all scrolling except fine X scroll)

Because X scrolling updates are delayed until pixel clock 256, when the game writes a new X scroll value to 2005 (done between pixel clock 256 and 341), they need to include Coarse X for the scanline AFTER the upcoming scanline, and fine X for the upcoming scanline. If you emulate that wrong, you get nasty artifacts in Slalom. Lots of emulators don't get it right.

by Dwedit on 2008-03-23 (#31979)

Your code has flaws, although it will work with 99% of all games, it doesn't do loopy_v correctly on 2005 writes followed by 2006 writes.

by MottZilla on 2008-03-23 (#31982)

How would the order affect it? Don't I just need to move the right bits into certain values after 2005$ 1st, $2005 2nd, $2006 1st, and $2006 2nd?

The games that are having scrolling issues for me are pretty important. Ninja Gaiden's cinema scenes, Zelda's vertical scrolling, and Battletoads.

by WedNESday on 2008-03-23 (#31984)

MottZilla you must use the following:

Code:
switch (Address) {
case 0x2005:
   ...
   break;
case 0x2006:
   ...
   break;

Especially when it comes down to any code that goes into the RenderPixel function. You could use if/else, but those jump tables are so neat. Trust me, a couple of extra if/switch statements inside of RenderPixel and your FPS vanish.

by blargg on 2008-03-23 (#31985)

Until the emulator handles this correctly, speed isn't an issue. If speed were, you could incorrectly handle it with maximum efficiency like this:

{
/* empty */
}

by WedNESday on 2008-03-23 (#31986)

I know at this stage speed isnt important, but having switch instead of a bunch of if/else's makes it clearer for him to read/debug etc.

by Zepper on 2008-03-23 (#31990)

WedNESday wrote:
I know at this stage speed isnt important, but having switch instead of a bunch of if/else's makes it clearer for him to read/debug etc.

- Indeed, as if it was basic enough for a newbie.

by MottZilla on 2008-03-23 (#31993)

The register writes are in the CPU. Also I am rendering pixels and it still runs fine for me. ;o

Now can you tell me anything about why my scrolling doesn't work in Battletoads, the Ninja Gaiden cinema scenes, and Zelda's vertical scrolling?

One question I have is that while I now have pixel accurate Sprite Hit, Ninja Gaiden's status bar has a single half line of garbage. Similarly the Zelda 2 title screen has a single half line of garbage before the split point. Am I supposed to delay name table and scroll changes (except for the lower scroll x bits) until the end of the line?

Also, Does Battletoads rely on the fact that while the screen is off, the VRAM pointer does not advance? I noticed in Battletoads & Double Dragon's intro mode the "film strip" with character names at the top has the correct sprite 0 sprite, but looks like all the tile data is drawn higher than it should be. Right now I always increase where in the nametable I'm fetching from.

And if anyone can tell me what Dwedit meant by my 2005 writes followed by 2006 isn't right?

by WedNESday on 2008-03-23 (#31996)

You're gonna have to provide us with some more information, like some source code, as to why you're scrolling is off.

by ReaperSMS on 2008-03-23 (#32004)

For small enough switches, the compiler will just spit it out as an if-else tree anyways.

You really should look carefully at loopys doc or the 2C02 technical reference, but here is the gist:

There are a series of latches that are loaded from various bits of the registers on particular writes, these are FV, V, H, VT, HT, and FH. Those are Fine Vertical, Vertical nametable, Horizontal nametable, Vertical Tile, Horizontal Tile, and Fine Horizontal.

FV, V, H, VT, and HT have counters associated with them, and the output of those counters is what the PPU uses to address memory.

$2000's nametable bits sets the V and H latches
$2005's first write sets the HT and FH latches
$2005's second write sets the FV and VT latches
$2006's first write sets the low two bits of FV, V, H, and the high 2 bits of VT
$2006's second write sets the low three bits of VT, and all of HT

The counters get loaded with the values in the latches at specific times:
at the beginning of scanline 20 (end of vblank), if sprite or BG rendering is enabled: FV, V, H, VT, HT

on the second write to $2006: FV, V, H, VT, HT

at the end of hblank: H and HT

This is why you need the latches seperate from the counters. Writes to $2000 and $2005 (other than FH) will not effect rendering until the beginning of the next frame or scanline. Writes to $2006 will affect the counters, and thus rendering, immediately upon the second write.

It's best to just implement the logic directly from the 2C02 doc. It keeps things the simplest, these latches update on these register writes, these counters get loaded at these times during rendering. It'll Just Work.

by Zepper on 2008-03-23 (#32007)

<offtopic>

1. Depending of the source, one "if()" less means +2FPS.
2. value = table[index] is faster than value = *(table + index).
3. "value = color"; if( cond ) value &= mask" is slower than value = color & mask.

Boo...
</offtopic>

by mozz on 2008-03-23 (#32008)

Late to the party, but... here's my 2 cents on __forceinline.

It's a useful tool for occasional optimizations, however:

(1) Since its a compiler extension and not a portable C++ keyword, Wrap it in a macro like NES_FORCEINLINE. Use the preprocessor to detect the compilers that support it, and just define it to "inline" for any other ones.
[Edit: oh, the GCC equivalent of __forceinline is: __inline__ ]

(2) NEVER USE IT WITHOUT PROFILING FIRST. The only time you should use it is when you've discovered that the compiler chooses NOT to inline your little function, but its called very often and the the profiling shows that forcing the compiler to inline it results in faster program. DON'T ASSUME IT WILL BE FASTER because often it won't. Don't force the compiler to inline unless you *know* it should do it and isn't doing it, in that case.

(3) If you have several call sites, one of which you want force-inlined and the others you don't, consider duplicating the source of the function into two copies (one to be forceinlined, the other to be left up to the compiler's discretion). This can make code harder to maintain, but has less risk of code bloat than just using forceinline on a large function.

By the way, another extension I read about somewhere, but I've never tried it? (It might be MSVC-only)
Code:
__assume(x < 256)

Does anyone know if this helps MSVC or GCC with the codegen for the bytecode dispatch switch statements? Most compilers like to generate unnecessary range check(s) that can slow them down quite a bit, even if the switch expression has an 8-bit type. I haven't checked any of them lately to see how they handle this; however, I've heard of at least one project which had a custom .obj-patching tool to remove those range checks from their bytecode interpreter after GCC compiled it.

by ReaperSMS on 2008-03-24 (#32021)

__assume sounds like VC only. Can't recall ever hearing about it for GCC.

In most cases, the range check won't mean dick for fps. # of ifs mostly depends on the code and the compiler, and where that if is. If it's per frame, who cares, per scanline, it might matter, per pixel and yeah, you've done something horribly wrong.

by WedNESday on 2008-03-24 (#32022)

Fx3 wrote:
<offtopic>

1. Depending of the source, one "if()" less means +2FPS.
2. value = table[index] is faster than value = *(table + index).
3. "value = color"; if( cond ) value &= mask" is slower than value = color & mask.

Boo...
</offtopic>

1. Only 2 FPS? And the rest dude. You stick another if/switch inside of the pixel function, and you will lose a lot more than that.
2. 3. You are dead right about that.

by Marty on 2008-03-24 (#32024)

mozz wrote:
By the way, another extension I read about somewhere, but I've never tried it? (It might be MSVC-only)
Code:
__assume(x < 256)

MSVC-only afaik. It optimizes nicely with switch-cases if using default: __assume(0);. Doesn't seem to help much otherwise, or at least not on the 2005 edition. Still nice to have for production code, using something like this:

Code:
#include <cassert>

#ifdef NDEBUG
#if defined(_MSC_VER) && _MSC_VER >= 1300
#define NES_ASSERT(x) __assume(x)
#else
#define NES_ASSERT(x) ((void)0)
#endif
#else
#define NES_ASSERT(x) assert(x)
#endif

Fx3 wrote:
2. value = table[index] is faster than value = *(table + index).

If that's true, I'd switch compiler if I were you. They are the exact same thing. Just different ways of writing it. Here are others:

Code:
value = index[table];
value = *(index+table);

by Dwedit on 2008-03-24 (#32028)

Never forget to make "Release" builds. Debug builds have optimizations turned off.

by mozz on 2008-03-24 (#32031)

ReaperSMS wrote:
__assume sounds like VC only. Can't recall ever hearing about it for GCC.

In most cases, the range check won't mean dick for fps. # of ifs mostly depends on the code and the compiler, and where that if is. If it's per frame, who cares, per scanline, it might matter, per pixel and yeah, you've done something horribly wrong.

I guess NES emulators do more work per instruction than some other kinds of bytecode interpreter. (Maybe the project I was thinking of was a Smalltalk or Java bytecode interpreter. Those only do a couple instructions worth of work for many of their bytecodes, and the dispatch overhead can be quite significant.)

by ReaperSMS on 2008-03-24 (#32032)

Not so much less work per instruction as it is a low instruction rate overall. Other bytecode systems don't bother with execution throttling, as they aim at peak performance. For a NES emulator, there's not a whole lot of point to breaking 1.789MHz, which isn't too difficult given that individual instruction handlers burn 2-7 virtual cycles a piece.

Most of the time will be spent in the PPU/APU.

Now, branching certainly costs inside the CPU core, but it doesn't matter much in the end when you consider the sheer clock advantage on the host. You could burn nearly 3000 cycles per instruction and still keep up. Most of the time, you won't break 50 cycles per instruction. Also, no amount of hinting to clean up the dispatch will help when the indirect jump through the jumptable (and it will almost certainly be a jumptable) is unpredictable. The only way around that really is dynarec or STC, which are needlessly complicated for the problem at hand.

by MottZilla on 2008-03-24 (#32033)

Thank you ReaperSMS for your explanation. I have gotten my renderer working off the "loopy_v" and "loopy_t" terms. And for many games it is fine. But I seem to be having an isue whenever a game modifys scrolling mid-frame usually via Sprite 0 waiting.

What happens is the static area at the top in games such as SMB and Castlevania, the bar is indeed static. But the area below the status bar, constantly jiggles horizontally. From what it looks like to me, it seems that perhaps every frame the scrolled portion of background is drawn 1 pixel further right until it hits 8 and wraps back down to 0. I'm not sure what is causing it yet. But I am properly fetching H and HV from loopy_t on hblank I believe.

by ReaperSMS on 2008-03-24 (#32034)

Sounds like you've got FH screwed up somehow, off by one or so.

by Zepper on 2008-03-24 (#32035)

<newbie> What is the cool usage of assert() after all? </newbie>

by loopy on 2008-03-24 (#32037)

by mozz on 2008-03-24 (#32038)

Fx3 wrote:
<newbie> What is the cool usage of assert() after all? </newbie>

assert(expr) is a way of saying "at this point in the program, (expr) is true". In a debug build, if it turns out to actually be false at run-time, the program will display an error message and exit. But in a release build (one with NDEBUG defined) the code assert(whatever) is defined away to nothing.

So, some rules of thumb:

* Use assert to check your assumptions at run-time. It can help you catch programming mistakes (bugs), and verify assumptions about the host compiler/system (e.g. if your program requires that sizeof(void*)==4 in order to function properly, then assert(sizeof(void*)==4) somewhere near the top of main() seems like a good idea).

* Use assert(0); or assert(!"Should never get here"); or something like that, for places in the control flow that should be unreachable if your program is working correctly. (E.g. in the default case of a switch statement, if the program is never supposed to go outside of the regular cases of the switch.) The idiom assert(0) can be thought of as saying "assert not reached".

* DON'T use assert to detect run-time errors (such as bad input from the user, or bad data read from a file). Because assert() compiles to nothing in release builds. Use regular if statements and write normal error handling code for those cases.

* Some people like to program in a "design by contract" style, where each routine has certain "preconditions" which must be true when the routine is called, and then is guaranteed to meet certain "postconditions" when it finishes. There are also sometimes "invariant conditions" such as a loop invariant (must be true on each iteration of the loop), or a class invariant (must be true at the end of the constructor, and must be true at the beginning and end of each method call except for the destructor). Anyway, assert can be used to check these things at run-time, though its a bit more clumsy than doing it in a language with language-level support for DBC. But its still very workable.

Assertions are a form of documentation: they declare the intent of the programmer ("things are supposed to be like this, or my program is not working correctly"), but they declare it in a way that can be checked by the compiler at run-time (at least in debug builds). If you use them well, they are a handy tool to help you get confidence that your code is actually working the way you think it is working.

by MottZilla on 2008-03-25 (#32049)

I figured out why it was jiggly. For whatever reason, my timing is not correct as far as PPU and CPU sync, or perhaps just timing of VBlank flag setting.

When I read the trace output from my emulator, the problem is that the scroll is being updated outside of HBlank, causing the FineX for the split scroll area to be incremented a random amount of times before reaching the area it was intended to be used at.

So I need to figure out what exactly is wrong with my PPU timing. I've noticed 2 problems actually. One of them is the jiggle from writing outside HBlank, but there is another issue when playing for a few minutes it seems like its starting to render the picture at the wrong address or something like it might be a sync issue. I'll have to figure it out. I think the most likely issue is the timing between when VBlank's flag is set and when games expect HBlank is, is incorrect in my emulator.

I did make a temporary fix which checks if there is a mid-frame finex write outside of hblank making it delay it till hblank. But that is not ideal as I shouldn't have to do that when I have the timing correct.

by MottZilla on 2008-03-25 (#32050)

I found out another bunch of issues and now it's actually looking much nicer. For one thing I misarranged things and was corrupting the VRAM pointer by updating it when it shouldn't be and that sort of thing. Now Ninja Gaiden and Zelda look great. Infact now everything seems to look great, but there's a strange issue in Zelda 2 I have to look at what is causing that...

by Disch on 2008-03-25 (#32051)

MottZilla wrote:
but there's a strange issue in Zelda 2 I have to look at what is causing that...

<hunch>

don't allow the user to press both left+right or up+down at the same time. Some games didn't count on an NES controller making this possible and thus will gritch very weirdly when the user presses both simultaneously.

Games I know of where this is an issue includes Battletoads and Zelda 2.

</hunch>

by Zepper on 2008-03-25 (#32053)

yeah...

by BMF54123 on 2008-03-25 (#32054)

Disch wrote:
<hunch>

don't allow the user to press both left+right or up+down at the same time. Some games didn't count on an NES controller making this possible and thus will gritch very weirdly when the user presses both simultaneously.

Games I know of where this is an issue includes Battletoads and Zelda 2.

</hunch>

For the sake of accuracy, such a feature should be made optional. I seem to recall at least one game that had debugging features activated by pressing two or more directions simultaneously.

This makes me wonder if any Japanese games mapped such features to Start or Select on controller 2, since a stock Famicom with hard-wired controllers doesn't have those buttons?

by blargg on 2008-03-25 (#32057)

Yes, be sure to filter that out as Disch says. The bug in the snake level of Battletoads took a long time to find the cause of. There's a past discussion about this, along with code to handle it more intelligently than "if left+right or up+down then report that no direction is pressed".

by MottZilla on 2008-03-26 (#32058)

No, the problem in Zelda 2 I had wasn't related to input. I don't filter input but I also haven't had any issue since I play with a gamepad anyway.

I fixed a bunch of stuff up today. I even have MMC3 support with IRQs. IRQs were really easy to get working infact. Well really the whole MMC3 was a breeze. I thought people said it was supposed to be hard? Although I am not emulating the Scanline Counter exactly the way it would really behave (based on A12).

Anyawy I still have plenty of problems to deal with but I also have alot more games to play with now. =)

by Dwedit on 2008-03-26 (#32059)

MottZilla wrote:
Although I am not emulating the Scanline Counter exactly the way it would really behave (based on A12).

Making fake MMC3 IRQs by counting scanlines is easy. Doing it the real way with A12 is the hard part.
I'm guessing that Kick Master will probably crash after you pick up the first item.