Fun with MMC3 Scanline IRQs

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-14 (#157133)

Relevant thread: viewtopic.php?f=2&t=13188

Mid-screen palette changes are much more complicated than mid-screen scroll changes, for example. You absolutely have to disable rendering, since palette updates are essentially VRAM updates, and you can't update VRAM while the PPU is rendering.

The two things that make mid-screen palette updates tricky are that you have to restore the scroll after you're done (since using $2006/$2007 messes up the scroll), and that the PPU has a "feature" that causes the color being pointed by the VRAM address register to be rendered to the screen while rendering is off. This prevents you from accessing the palette outside of hblank without generating a visible "rainbow" on the screen, as the address register jumps from color to color with each write. Since the hblank period is pretty short, there really isn't enough time to disable rendering, changing palette bytes, resetting the scroll and enabling rendering back from one scanline to the next, so you have to spread these tasks across multiple scanlines, which will appear to be blank on screen.

Disabling rendering mid-frame can also corrupt sprites, if not done carefully, so that's another thing to keep in mind.

Quote:

On an emulator, I didn't have any trouble whether I disabled rendering or not

This is why you should test on multiple emulators, and preferably on the real hardware as well. This particular emulator sounds pretty inaccurate (FCEUX?).

Quote:

nor whether it was in the middle of a scanline.

The drawback of doing it in the middle of a scanline, as opposed to doing it during hblank, is that you'll get a jittering rainbow on that scanline, but as long as you do it when rendering is off and you restore the scroll later, the rest of the frame should be fine.

BTW, this isn't necessarily related to mapper IRQs, since you can just as well use a sprite 0 hit, a sprite overflow, or timed code to time raster effects. Here are the types of raster effects that are commonly done on the NES and how complicated they are to implement (in what I consider to be the increasing order of complexity):

Color emphasis and/or grayscale: write normally to $2001;
Pattern switch: switch banks normally, according to the mapper being used;
Horizontal scroll: write normally to $2000 and $2005 (Y portion has no effect);
Vertical (and horizontal) scroll: write specially arranged values to $2006/$2005/$2005/$2006;
Palette: disable rendering, update bytes, set the scroll using $2006/5/5/6, enable rendering;

With the exception of palette updates, all of these can be performed, without glitches, from one scanline to the next, as long as the final steps take place during hblank, meaning they are safe to be used in the middle of the gameplay area. Palette updates make sense mostly between the gameplay window and a status bar.

EDIT: You have to laugh at comments like this:

mmc3irqs.txt wrote:

// BG off to force stupid emulators to realize i'm changing the BG color

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-14 (#157135)

darryl.revok wrote:

I'm not willing to take a chance that my game could damage anyone's console though.

To add to what tokumaru said, failing to do the full dance won't damage any console. It just won't work at all.

Also, on the 6502, the IRQ has a random latency up to the number of cycles of the instructions that the IRQ interrupts (so a dead wait of JMP self could produce up to 4 3 CPU cycles of latency, while a busy wait of BIT zeropagesomething / BNZ theabovebit could produce up to 3 CPU cycles, and interrupting a huge NOP slide could be later than the ideal by up to 2 cycles)

Re: Fun with MMC3 Scanline IRQs
by tepples on 2015-10-14 (#157148)

How does JMP aaaa have 4 cycles? As I understand it, JMP aaaa is a 3-cycle instruction, meaning it responds to an interrupt 1, 2, or 3 cycles after the interrupt is asserted. That's two plus or minus one.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-14 (#157154)

tokumaru wrote:

you have to restore the scroll after you're done (since using $2006/$2007 messes up the scroll),

From what I read, hBlank gives you 340 PPU cycles. If so, then does that equate to 1020 CPU cycles? (NTSC) That seems excessive. If that's actually the case, then that's enough to do a ton of stuff. I feel like that can't be correct though.

(From what I'm reading in the thread you posted, I can see this isn't the case. Not sure about how many cycles I actually have though.)

So from what I'm reading, I have to have a blank scanline to change palettes. That's unfortunate. I'm guessing there's no workaround.

The scene in particular that I'm working on right now is of the main character walking down a cliff with a scene of the game world in the background. It has eight strips of parallax scrolling, and I had hoped to integrate mid-screen CHR switching as well as palette swapping during the process. Seems I may need to do some careful planning to get the effects I wanted.

So, a black scan line is what's needed for a mid-screen palette swap? That means if I have a strip of black sky, I can use the bottom row of that strip to switch my palette, right? That's one I can recover. And I can switch one palette out in NMI when it gets scrolled off the screen. This leaves me, I believe, one palette shy of what I was shooting for.

So, the palette color getting written is drawn on the screen. I wonder if I can use this to my advantage. Is there any way to make it draw a single color across the entire screen? Like, utilizing a palette swap to change the BG color, and using the color written as a line on the image? For the tops of the sky strips that could work.

And what is the trouble with leaving the PPU targeting a color that isn't color 0?

Quote:

as the address register jumps from color to color with each write.

So if I'm understanding how this works properly, is this possible:

Quote:

Disabling rendering mid-frame can also corrupt sprites, if not done carefully, so that's another thing to keep in mind.

Is there an easy explanation for how to do so carefully?

Quote:

This is why you should test on multiple emulators, and preferably on the real hardware as well.

I have an Everdrive that I use for testing but I haven't done so with my raster effects yet. Is there anything other than setting PPU to slave mode that could damage hardware?

Quote:

This particular emulator sounds pretty inaccurate (FCEUX?).

That's the one.

mmc3irqs.txt wrote:

// BG off to force stupid emulators to realize i'm changing the BG color

Hehe. Stupid emulators!

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-14 (#157156)

darryl.revok wrote:

From what I read, hBlank gives you 340 PPU cycles. If so, then does that equate to 1020 CPU cycles? (NTSC) That seems excessive.

No, no, you got it all wrong. There are 341 PPU cycles per scanline, 256 of which output pixels, 85 of which are hblank. The PPU is 3 times *faster* than the CPU, so you have to *divide* by 3 to find the corresponding number of CPU cycles, which is 28.3. It's not a lot of time at all, specially if you have rendering enabled, because the last few cycles (20 PPU cycles, I believe) are used to fetch NT and AT data for the next scanline.

Quote:

So from what I'm reading, I have to have a blank scanline to change palettes. That's unfortunate. I'm guessing there's no workaround.

If not an entire scanline, at least some blanking near the sides of the screen, but you'll hardly be able to change many colors at a time that way.

Quote:

So, a black scan line is what's needed for a mid-screen palette swap? That means if I have a strip of black sky, I can use the bottom row of that strip to switch my palette, right?

It's a blank scanline, not black. If black is your background color (color 0), then the scanline will be black, if color 0 is something else, that's the same color the blank scanline will be. Depending on the number of colors you have to change and where the values come from, 1 blank scanline might not be enough. In order to update 3 colors in a palette, you need something like this:

Code:

lda Color1 ;pre-load the first color
ldx Address+1
stx $2006
ldx Address+0 ;preload the low byte of the address
ldy #$00
sty $2001 ;must finish as soon as hblank starts
stx $2006 ;4
sta $2007 ;4
lda Color2 ;3
sta $2007 ;4
lda Color3 ;3
sta $2007 ;4 = 22

That's 22 cycles out of the 28 of hblank, but there's the IRQ latency that depends on the instruction that was running when the IRQ fired, and you'll probably lose a couple of cycles to make sure the $2001 write doesn't happen before hblank starts. So yeah, if the color values come from ZP variables, it does seem possible to update 3 colors right in the first hblank, and then you can use the next scanline to reset the scroll and enable rendering on the next hblank.

Quote:

So, the palette color getting written is drawn on the screen. I wonder if I can use this to my advantage. Is there any way to make it draw a single color across the entire screen? Like, utilizing a palette swap to change the BG color, and using the color written as a line on the image? For the tops of the sky strips that could work.

I don't understand what you want to do. If you want the entire screen to be the same color, why not do it the conventional way? If you just want to draw strips using any of the colors in your palette, then yeah, you can just change the VRAM pointer during hblank whenever you want to show another color. And you can even show the otherwise unrenderable colors at $3f04, $3f08, $3f0c.

Quote:

And what is the trouble with leaving the PPU targeting a color that isn't color 0?

No trouble, that color will simply be rendered on the screen.

Quote:

Make the last write the color you want the next scanline to appear (could even be the same color it already was)

The PPU address auto increments, so it actually points to the next color that will be changed, not to the one you just changed, so what you write last won't matter, but what was already stored in the slot after that.

Quote:

Is there an easy explanation for how to do so carefully?

I never quite understood the details, but it has something to do with interrupting the sprite evaluation process. To be sure you'll not do that, I believe you have to turn rendering off as soon as hblank starts, since the latest the sprite evaluation process can extend to is the end of the visible scanline. Don't take my word for it though, it's better to look it up in past discussions.

Quote:

Is there anything other than setting PPU to slave mode that could damage hardware?

I'm not sure, but I don't think it's easy to damage the hardware, considering that a bad connection between the cartridge and the cartridge slot could result in "random" code running and writing unexpected values to registers and such, and that happens somewhat often with the NES.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-14 (#157158)

Quote:

No, no, you got it all wrong.

Okay, that seems a lot more reasonable. I knew that had to be way off...

Quote:

I don't understand what you want to do.

This image is terrible, but this is a quick mockup of the idea. The character moves diagonally across the background.

Attachment:

Mountains.png [ 10.57 KiB | Viewed 3141 times ]

Being able to use more palettes would increase design options a lot but it sounds like it could be a few headaches. I may be able to find a way to just use it in the sky, changing just one color but otherwise using the same palette for the sky portions. Then I'd have three palettes left for the ground.

So is two the maximum number of colors that can be swapped with a sacrifice of just one scanline?

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-15 (#157160)

darryl.revok wrote:

So is two the maximum number of colors that can be swapped with a sacrifice of just one scanline?

Judging from the sample code I wrote above, it seems like 3 is possible.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-15 (#157161)

Cool. Three is a pretty flexible number to work with.

So would it be typical to, after setting colors on one hBlank, set up the next IRQ during the scanline so as to get the proper timing for the following scanline where scroll will be set? I haven't experimented with fancy timing options yet as I just wanted to get the basics working first.

Edit: Okay, I went back and reread this and thought about it a little more. If you leave IRQ and set another one, it's going to fire too late to get the next line, so that would be two lines skipped. After changing palettes, the hBlank timing will be known +/- a couple cycles, so that should be 91.66 cycles to eat up before rendering is enable, if I'm figuring correctly.

Just to make sure I'm doing this right, I currently set the IRQ one line less than where I want to change scroll, then I use that scanline to eat up time before hBlank, then I set scroll and then test to see if another IRQ needs to be made.

Just to get things to work, I did a really simple method at first of just looking at the screen and seeing if my scroll was lining up properly, and adding code to extend it until it did. I realize there are better methods though. I seem to recall reading that Blargg wrote a good one. I downloaded it the other day but have looked into it yet.

Also, another question that I didn't think about yet, is whether or not it's okay to just turn off BG rendering for that scanline, or if sprites must be disabled as well.

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-15 (#157162)

See also the previous threads about this: viewtopic.php?t=12586
And this one: viewtopic.php?p=122763#p122763
Neither cover IRQ latency.

IRQ latency is between 7 and 13 CPU cycles from when it's requested; MMC3 IRQ fires at either dot 260 (when background=$0000 and sprite=$1000, shortly after hblank begins) or 324 (vice versa, shortly before it ends), so your IRQ code will start approximately in the middle-to-end of hblank, giving you no time to set up registers or anything and requiring you to spend an entire scanline waiting to make the writes. At least you should have plenty of time to prepare everything...

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-15 (#157164)

darryl.revok wrote:

Just to get things to work, I did a really simple method at first of just looking at the screen and seeing if my scroll was lining up properly, and adding code to extend it until it did.

This is fine, but make it sure to get it down to the cycle. I say that because the most common method for padding is using NOPs, which take 2 cycles each, so unless you try a 3-cycle instruction (such as CMP $XX) by the end of the NOP chain, you'll can't be sure you found the best alignment. Once you find the exact number of cycles to wait, you can optimize the timed code to take less space.

Quote:

I seem to recall reading that Blargg wrote a good one. I downloaded it the other day but have looked into it yet.

Blargg's level of synchronization is way too extreme for your needs, IMO... I'm not even sure you can easily combine his method with things like a music player or scanline IRQs, which could introduce timing variations. Anyway, you still have enough cycles to prevent the IRQ latency from being an issue, so you should be fine.

Quote:

Also, another question that I didn't think about yet, is whether or not it's okay to just turn off BG rendering for that scanline, or if sprites must be disabled as well.

Both have to be disabled. Disabling one but not the other only prevents the respective pixels from being displayed, but the PPU keeps doing everything it would do if both layers were enabled. It would be really handy if we could use sprites to hide background glitches, or to make the scanlines not completely blank, but unfortunately that's not the case.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-15 (#157181)

Well, the good times with raster effects roll on.

It seems like a mid-screen palette change may not be very practical. If the sprites disappear for a scanline, that's going to cause some serious issues with the way I have this set up. I guess the only practical way to get more colors in the overall background is going to be to reclaim the palettes that are scrolled off-screen vertically.

Also, I might be able to squeeze out the appearance of more colors by playing with color emphasize bits and grayscale.

What I've been working on now is getting my tile fetch to sync with the different scroll rates as well as nametable updates in NMI. It was a pain but I think I got it. I left attributes out of the equation for the moment since I'll have to entirely rewrite that routine to handle 1/2 of an attribute being in one scroll strip.

tokumaru wrote:

Blargg's level of synchronization is way too extreme for your needs, IMO

If I understand correctly, his is to time off of NMI anyway, which would essentially take the place of what I'm doing with the MMC3 scanline counter.

lidnariq wrote:

IRQ latency is between 7 and 13 CPU cycles from when it's requested

I was worried about IRQ latency but I don't THINK it's going to be an issue for what I'm not doing. Since I won't be doing mid-frame palette swaps, I'm not doing anything too intensive on timing. X/Y scroll setting seems to fall into the hBlank fine with my simple NOP array. I will look into optimizing it when everything's done, but no need in doing anything complex if it's not necessary. I might add a mid-screen CHR bankswitch but that shouldn't be too intensive.

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-15 (#157184)

If you're still exclusively testing on FCEUX, you should know that FCEUX's renderer is scanline-at-a-time, so any minor timing errors won't show up.

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-15 (#157187)

darryl.revok wrote:

Also, I might be able to squeeze out the appearance of more colors by playing with color emphasize bits and grayscale.

But those affect sprites too, so they won't help with creating more colorful backgrounds with overlaying sprites.

Re: Fun with MMC3 Scanline IRQs
by thefox on 2015-10-15 (#157196)

Note about IRQ latency: If you can't control what code is executing when the IRQ fires, you may be able to chain two IRQs to minimize latency. First generate an IRQ slightly earlier than you need, set up the next IRQ, and start executing a NOP slide. This way you know a NOP was executing when the second IRQ fires, and you won't need a ton of NOPs since the second IRQ happens soon enough. I think I first heard of this trick from the C64 scene, but don't remember it being mentioned on this forum.

(Disclaimer: I never tried using this trick on NES myself, but don't see why it wouldn't work with a PPU-synchronized IRQ source such as MMC3's.)

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-15 (#157200)

lidnariq wrote:

you should know that FCEUX's renderer is scanline-at-a-time

After I knew that it was safe, I started testing it on hardware, and I don't see any timing related glitches anymore.

thefox wrote:

you may be able to chain two IRQs to minimize latency

When are some times that this would be useful? It seems like the combination of MMC3 IRQs and just hScroll/vScroll setting is pretty lenient. I will eventually want to work in CHR bankswitching.

I noticed some cool raster effects in Ninja Gaiden III: https://www.youtube.com/watch?v=_FmWbIqe7FQ (at 00:17). Maybe this trick would be useful for something complex like that. (I wonder why it shows garbled tiles when bankswitching. I wonder if that's the emulator or the cartridge. Apparently nobody gets that far on the cartridge...)

tokumaru wrote:

But those affect sprites too, so they won't help with creating more colorful backgrounds with overlaying sprites.

Yeah that's probably going to rule those out as well. The biggest problem is that the strips scroll vertically as well so they'd really stand out.

As for the tile and attribute fetching, I wondered if anyone had any thoughts on this.

What I did was basically this:

Now that I've got the first big hurdles to getting this working, my next step is reintegrating attributes. It's going to be a little bit of a pain, but before I do, I was curious if it seemed like I was taking a relatively logical approach to what I've got so far. I'm a little concerned about how to integrate PRG switching but I can't really see that far ahead. I know 8KB will be more than enough for each level, the way I have them designed. As such I can probably squeeze the enemies for the levels in that space too. NMI, as I understand, could really be swapped in anywhere, since it doesn't need access to other code. So I guess that would be a flexible part to change per level.

Re: Fun with MMC3 Scanline IRQs
by thefox on 2015-10-15 (#157203)

darryl.revok wrote:

thefox wrote:

you may be able to chain two IRQs to minimize latency

When are some times that this would be useful? It seems like the combination of MMC3 IRQs and just hScroll/vScroll setting is pretty lenient. I will eventually want to work in CHR bankswitching.

Whenever you need to more precisely land on a specific pixel. That's all it is for. Obviously there's no need to use techniques like that if you can get your stuff to work without them. But it's at least good to be aware that the sync can be improved, if needed. Blargg's nmi_sync is an extreme (and impressive) example of this.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-15 (#157205)

thefox wrote:

Whenever you need to more precisely land on a specific pixel.

I wonder if that would be precise enough for a vertical scroll split.

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-15 (#157206)

Depends entirely on the code that's being interrupted by the interrupt.

A NOP slide will have 2 cycles of jitter; more complicated code will have more. Two CPU cycles is only six pixels—definitely enough to stuff a full 2006-2005-2005-2006 write in hblank. It's about as close as you can get to blargg's nmisync 2 pixels of error without ... well, either nmisync and timed code or special interrupt hardware that will inject clockslides.

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-15 (#157208)

thefox wrote:

If you can't control what code is executing when the IRQ fires, you may be able to chain two IRQs to minimize latency.

This is a really cool idea! For most typical raster effects this wouldn't be necessary though.

darryl.revok wrote:

When are some times that this would be useful? It seems like the combination of MMC3 IRQs and just hScroll/vScroll setting is pretty lenient.

The $2006/$2005/$2005/$2006 trick only needs the last 2 writes to happen during hblank, and if you pre-load both values before hblank starts, you only need 5 or so cycles for the writes themselves (the last cycle of the second to last write and the 4 cycles of the last write), out of the 21 you have before the PPU starts fetching NT/AT data for the next scanline, so you have about 16 cycles to absorb any possible latency, way more than the 7 cycles that the slowest 6502 instruction takes. So yeah, changing the scroll is perfectly safe even with the regular timing techniques.

Quote:

I will eventually want to work in CHR bankswitching.

Mid-screen CHR bankswitching is relatively painless because it's completely transparent to the PPU, since its registers are not used at all. How long it takes will depend on the mapper, and to time it right you have to keep in mind when during the scanline the PPU fetches the patterns for drawing the graphics. If you look at this very useful diagram, you'll notice that sprite patterns for the next scanline are fetched near the beginning of hblank, so you'll definitely want to switch sprite patterns before that, while the scanline is being rendered. As for background patterns, they start to be fetched 20 or so PPU cycles before the end of hblank, so you should have the new patterns loaded by then.

Quote:

I noticed some cool raster effects in Ninja Gaiden III: https://www.youtube.com/watch?v=_FmWbIqe7FQ (at 00:17). Maybe this trick would be useful for something complex like that.

Back in the day, developers didn't really know all these crazy tricks we know today. The $2006/5/5/6 trick for example, certainly wasn't documented anywhere, because apparently no commercial game from back then ever used it. All games that changed the vertical scroll mid-screen did it for status bars or floor tiles that were always aligned to the tile grid, so the fine Y scroll would always be 0. Back then, people didn't have the time or dedication to find out the things that smart people like blargg do. As far as we know, most of the timing in the old games was done by trial and error, which is why even big name games like Super Mario Bros. 3, Kirby's Adventure and Mega Man 3 have noticeable glitches in their raster effects.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-16 (#157240)

Okay, I gotta throw this out there. I've been stuck on a bug that's driving me crazy. It does not appear on FCEUX but it appears on NES hardware and Nintendulator DX.

First off, when my game scrolls, I'm getting garbled tiles. This is new since I've adapted everything for split scrolling.

I've narrowed it down to the writing of two attributes. Only writing to the last row of attributes does this. If I disable those, it stops.

So, my first thought is bad math on the nametable address. Well, here's where it gets really confusing.

In troubleshooting, I set it to write that tile to an absolute location. Even THEN the glitch still happens. I am highly perplexed.

Is there any reason that this:

Code:

  LDA #$00
  STA $2000
  
  LDA $2002                           ; Read PPU status to reset the high/low latch
  
  LDA #$23
  STA $2006                           ; Write the high byte of column address

  LDA #$D8
  STA $2006                           ; Write the low byte of column address
 
  LDA preloadNametable00Attributes+8
  AND #%00001111
  STA $2007

Could write to anywhere but $23D8? Somehow it's getting into other attributes and sometimes even tiles...

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-16 (#157241)

Any chance you didn't disable rendering, or ran out of time before rendering automatically started?

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-16 (#157242)

Actually... I had not been disabling rendering during NMI. Is that odd? If I set it to do so, I get crazy glitches with my scanline IRQ for some reason. I imagine this is pretty important.

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-16 (#157243)

It shouldn't be "crazy" glitches...

During NMI you have ≈2200 cycles to do everything you want. If you run out of time, rendering starts unless you've explicitly disabled it. And that'll change the address in a (seemingly) random manner.

If you do disable rendering, and the time at which you re-enable rendering varies, then you could have the MMC3 IRQ vary by that much (since the MMC3 IRQ is counted in "number of lines rendered since IRQ enabled")

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-16 (#157244)

Hmmm... I'm going to need quite a bit of reworking. I see a max of 2400 cycles.

Now, the way my NMI is set up to handle tile fetches of varying rates, there is going to be a lot of variability to the number of cycles it takes to reach the NMI.

Am I correct in understanding that what you're saying is that not disabling rendering during NMI managed to mask that variability without me realizing?

If that's correct, it's going to be easier for what I'm doing to reduce the max length than to time all of the possibilities to reach the end of NMI in the same number of cycles.

I'm wondering though if the rendering issue is actually the problem. Here's why:

My NMI starts loading attributes/tiles, by LDXing the number of scanline splits, then proceeds to decrement.
It calls the updater for split 8 first, this one glitches.
It then calls the updaters for splits 7-5. These are fine.
It calls the updater for split 4 next, this one glitches.
It calls the updater for splits 3-1, these are fine.

So it's not the last one that messes up.

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-16 (#157255)

darryl.revok wrote:

Am I correct in understanding that what you're saying is that not disabling rendering during NMI managed to mask that variability without me realizing?

That's what I am saying ... although now I'm wondering whether I actually saw the problem correctly.

Quote:

It calls the updater for split 8 first, this one glitches.
[... others ...]
It calls the updater for split 4 next, this one glitches.

Nintendulator's debugger will let you find out what pixel/scanline these writes are happening on. In Nintendulator's GUI, rendering is active from scanline -1 through 239; NMI happens on scanline 241.

Should be an easy way to see if I'm entirely barking up the wrong tree.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-16 (#157267)

I think you were right. Nintendulator occasionally throws up an error that there was a write to $2007 during rendering. It wasn't the actual attribute that was being written that was causing the problem, but I believe removing the two attributes removed enough cycles so that the problem didn't manifest during tile writing. (which occurs after attributes)

Removing the tile updates but leaving attribute updates removed the issue.

I'm trying to tweak the code now. I managed to clear enough cycles to get one more of the attribute updates in there. I think if I can squeeze out a few more, I can get all of my updates in.

Thank you very much for the suggestion. This was driving me crazy.

So is it typical to disable rendering during NMI? I was looking this up a little bit, and it seems like not disabling rendering is wasting some cycles. http://forums.nesdev.com/viewtopic.php?f=2&t=11117 I'll definitely want to do that in my levels without scanline IRQs, but I don't even know if it would be possible to time everything right in this level with rendering disabled in NMI. The NMI could load anywhere from 0-8 sets of tiles and 0-8 sets of attributes. If it could all be timed out, it definitely wouldn't be easy.

Edit: In my quest for more cycles, I removed all of the reads from $2002. This made my IRQs a little jittery, so I added in one single read from $2002 at the beginning of the NMI, instead of before each nametable write, and that fixed the issue. From what I'm reading, that should be okay, but I have a few questions.

1. Okay, I understand that $2002 is read so that you can say for sure that $2006 is expecting a high byte, but I'm wondering why it ever wouldn't be. Every time that I've had to write to $2006 thus far, I've used two bytes. Is there every a time that something will be done intentionally that causes a mismatch in address order, or is this just an error-proofing method, in case NMI hits at an inopportune time, or something?

2. Is there any particular reason why $2006 seems to be the only part of the system that's expecting the high byte first?

Okay, this one isn't related to the high/low latch, but something else I've been wondering. How long does setting the PPU to +32 bit increment stay in effect?

Let's say I'm writing a column of tiles. I write #%00000100 to $2000 for +32 bit mode. Now if I want to write attributes next, do I have to write #$00 to $2000, or will it default to +1 on the next write?

It's looking pretty tight in NMI, so I'm looking for any options available to save cycles.

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-16 (#157269)

darryl.revok wrote:

So is it typical to disable rendering during NMI?

I personally don't do it, but I was curious so I just debugged a handful of games I had laying around, and there doesn't seem to be a consensus. These are the games that did turn rendering off:

Code:

Super Mario Bros.
Duck Tales
Street Fighter 2010
Bucky O'Hare

And these are the ones that didn't:

Code:

Alfred Chicken
Baloon Fight
Felix the Cat
Galaxian
The 3-D Battles of World Runner
Gimmick!

I intentionally didn't test Battletoads or anything known to use forced blanking, because those obviously have to disable rendering.

Anyway, I guess that disabling rendering can also be seen like a safety measure... If your NMI handler blows the VRAM access budget by accident, there will be no persistent VRAM corruption, the screen will just jump for a frame. I guess this is the reason why most of the games that do disable rendering do it, it's just to minimize the damage in case something goes wrong.

Quote:

I was looking this up a little bit, and it seems like not disabling rendering is wasting some cycles.

Yeah, you can buy yourself a few more cycles of VRAM access during the pre-render scanline that way. I think I can get by without using those cycles, because there's some cleaning up I have to do in my NMI handler before setting the scroll anyway, and the pre-render scanline is a good time to do it.

Quote:

In my quest for more cycles, I removed all of the reads from $2002. This made my IRQs a little jittery, so I added in one single read from $2002 at the beginning of the NMI, instead of before each nametable write, and that fixed the issue.

Most of the time I don't read $2002 at all, and I never had any problems with that. I'm extra careful to always perform $2005/$2006 writes in pairs, though.

Quote:

1. Okay, I understand that $2002 is read so that you can say for sure that $2006 is expecting a high byte, but I'm wondering why it ever wouldn't be. Every time that I've had to write to $2006 thus far, I've used two bytes. Is there every a time that something will be done intentionally that causes a mismatch in address order, or is this just an error-proofing method, in case NMI hits at an inopportune time, or something?

It must be a bug in your code. This flag doesn't change unless you read $2002 or write to $2005/$2006. I just noticed you didn't mention $2005... was that just an omission or did you not know that these registers share the even/odd write flag?

Quote:

2. Is there any particular reason why $2006 seems to be the only part of the system that's expecting the high byte first?

Maybe the engineers who designed the PPU though programmers would like to write addresses in the order humans read them. I don't think there's any technical reason for this, the designers simply had to pick a byte to go first and they decided on the high byte.

Quote:

Okay, this one isn't related to the high/low latch, but something else I've been wondering. How long does setting the PPU to +32 bit increment stay in effect?

AFAIK, until you change that. I don't think there's anything automatic touching that setting.

Quote:

Let's say I'm writing a column of tiles. I write #%00000100 to $2000 for +32 bit mode. Now if I want to write attributes next, do I have to write #$00 to $2000, or will it default to +1 on the next write?

It will definitely not default back, you have to change it yourself. If you're writing attributes for columns though, increments of 32 bytes can still be useful: Since each row of attributes is 8 bytes long, 32 bytes is equivalent to 4 rows, so you can update a full column by setting the address for the 1st byte and writing the 1st and 5th bytes, then set the address for the 2nd byte and write the 2nd and 6th bytes, and so on. This way you can write the 8 bytes of a column while setting the address only 4 times, instead of 8.

Quote:

I'm looking for any options available to save cycles.

If you have any loops at all, you should really look into unrolling them. Even partially unrolling can have incredible results. For example, if you have a loop that counts each byte that's being copied, that's one decrement instruction + a branch for each byte (a total of 5 cycles), which is a lot of overhead for a single byte. If you're copying 20 bytes, that's 100 cycles you're losing, while only 160 cycles are actually being spent copying bytes (assuming 8 cycles per byte). If you partially unroll that loop and count pairs of bytes instead, copying 2 bytes per iteration, you'll be cutting back that overhead by half! The more you unroll, the less overhead you'll have.

Re: Fun with MMC3 Scanline IRQs
by rainwarrior on 2015-10-16 (#157271)

darryl.revok wrote:

1. Okay, I understand that $2002 is read so that you can say for sure that $2006 is expecting a high byte, but I'm wondering why it ever wouldn't be.

If you're careful, it doesn't ever have to be read after startup. If you're not desperate to save 4 cycles, reading it once at the start of your NMI might be worthwhile just in case it corrects some edge case you missed.

darryl.revok wrote:

2. Is there any particular reason why $2006 seems to be the only part of the system that's expecting the high byte first?

It's the only part of the PPU that has a 16-bit interface, so it's not really inconsistent with itself, or anything else in the NES that Nintendo actually designed. The 6502 of course is little endian, but it's a different component designed at a different time by different people.

Re: Fun with MMC3 Scanline IRQs
by thefox on 2015-10-16 (#157272)

rainwarrior wrote:

darryl.revok wrote:

1. Okay, I understand that $2002 is read so that you can say for sure that $2006 is expecting a high byte, but I'm wondering why it ever wouldn't be.

If you do read it at start of NMI, be careful to not have things like bulk PPU uploads running in the main thread that could be affected by it. $2002 read in NMI happening in the middle of the two PPU address writes in the main thread = bad news. (I have actually got bitten by this.)

Re: Fun with MMC3 Scanline IRQs
by lidnariq on 2015-10-16 (#157273)

darryl.revok wrote:

It's looking pretty tight in NMI, so I'm looking for any options available to save cycles.

Since you have an MMC3, an obvious solution is use an IRQ to turn off rendering early. An extra 8 scanlines there will get you a little more than an additional 1000 cycles in your vblank handler, and won't be visible on the vast majority of NTSC televisions.

Note that if you start relying on end-of-frame timing, you might want to be careful to not break Dendy famiclone timing.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-16 (#157274)

I managed to squeeze out a few more cycles. I'm down to 2265 max currently.

One thing I know absolutely nothing about is audio. How many cycles will I need in NMI for a music and sound player?

I'd still like to squeeze in palette effects if possible so I'm looking for more optimizations but I think I may be out of easy options.

I started writing a routine to create tile update code in RAM, but it would be so cumbersome with varying rates of tile fetching, that I'm not going to bother.

tokumaru wrote:

If you have any loops at all, you should really look into unrolling them.

I did this previously with my tile updating, and it especially helped a lot when everything was sequential. I took this approach, and in a relatively extreme example, I took this:

Code:

;----------------Check for new background tiles-----------------------------------------------------
 
  LDX ScanlineSplitNum

DrawColumnsAttributesLoop:  
  
  LDA newTileFlag, x
  BEQ NoNewTiles
  BPL DontDrawNewAttributes

  LDA AttributeDrawPointersLo, x
  STA ColumnDrawJumpPointerLo
  LDA AttributeDrawPointersHi, x
  STA ColumnDrawJumpPointerHi

  LDY nametableColumnAttributesLo, x
  
  JMP (ColumnDrawJumpPointerLo)  
;  JSR DrawNewAttributes

AttributeDrawReturn:

  LDA newTileFlag, x
  AND #%00000001
  BEQ NoNewTiles
  
;----------------------------------------------------------------------------------------------------------------
DontDrawNewAttributes:
  
  LDA ColumnDrawPointersLo, x
  STA ColumnDrawJumpPointerLo
  LDA ColumnDrawPointersHi, x
  STA ColumnDrawJumpPointerHi
  
  LDY nametableColumnLo, x
  
  JMP (ColumnDrawJumpPointerLo)
;  JSR DrawNewColumns

ColumnDrawReturn:
  
;------------------------------------------------------------------------------------
  
NoNewTiles:

  LDA #$00
  STA newTileFlag, x
  
  DEX
  BPL DrawColumnsAttributesLoop

And wrote it out 8 times, replacing ",x" with +1/+2/+3/etc.

It saved enough cycles to make things work but I don't know if it's enough to add a music player and palette swaps.

Quote:

I just noticed you didn't mention $2005... was that just an omission or did you not know that these registers share the even/odd write flag?

Actually, no I didn't realize that. I just kind of assumed that $2006 was the one affecting that since it's a 16-bit address. So I'm guessing if the hi/low latch is off, then writes to $2005 will be reversed?

Quote:

If you're writing attributes for columns though, increments of 32 bytes can still be useful

That's a good tip. I didn't think of that. It won't work with my scroll splits but it would save a little in my basic scrolling NMI.

rainwarrior wrote:

If you're not desperate to save 4 cycles, reading it once at the start of your NMI might be worthwhile just in case it corrects some edge case you missed.

Seemed to help for me. Hopefully eventually I figure out why it was happening, but at least that glitch won't hold me up for now.

thefox wrote:

If you do read it at start of NMI, be careful to not have things like bulk PPU uploads running in the main thread that could be affected by it.

Hmmm.. Right now the only thing outside of NMI that should be doing so is drawing the initial nametable for a level, but I haven't seen any problems with it. NMI should be disabled during that procedure. I'll be sure to keep that in mind though.

lidnariq wrote:

An extra 8 scanlines there will get you a little more than an additional 1000 cycles in your vblank handler

That is a really cool idea! Has anyone used that technique to cram in massive graphic updates? That would be really good for CHR-RAM, I'd imagine.

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-17 (#157275)

darryl.revok wrote:

How many cycles will I need in NMI for a music and sound player?

During the important part of the NMI (i.e. vblank), none. Normally you call the audio update routine after you're done with all the PPU stuff.

Quote:

Actually, no I didn't realize that. I just kind of assumed that $2006 was the one affecting that since it's a 16-bit address.

The latch selects between high/low address bytes for $2006, and X/Y scroll for $2005.

Quote:

So I'm guessing if the hi/low latch is off, then writes to $2005 will be reversed?

Yes. An odd number of writes to $2006 will cause the next $2005 write to affect the vertical scroll, instead of the horizontal. In fact, this is an important part of the $2006/5/5/6 trick, because the two $2005 writes must be Y first, then X.

Quote:

Has anyone used that technique to cram in massive graphic updates? That would be really good for CHR-RAM, I'd imagine.

Yes, some games do that. Off the top of my head I can think of Big Nose Freaks Out, and maybe Solstice. They use other methods for timing though, since they don't have IRQs. Jurassic Park uses IRQs to blank 8 scanlines at the top of the screen and 8 scanlines at the bottom, not by disabling rendering, but by bankswitching black patterns. This is for hiding scrolling glitches, not for the extra blanking time. Battletoads does a lot of CHR-RAM updates, but it blanks the top of the screen, not the bottom.

EDIT: Just thought of another one that does exactly what lidnariq described: Somari (or any of the variations of that game). The vblank handler in that game must be very poorly optimized if they need that extra time for VRAM transfers. It uses CHR-ROM, so there's no need for massive VRAM updates.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-17 (#157279)

So I managed to find plenty more cycles in my NMI. It's currently running at a max of 1800 cycles, so that should be plenty of room for anything I might want to add to this scene.
I took the loop unrolling posted earlier one step further. I realized that since I have an unrolled loop, not only do I have not have to pull pointers from an array to build a JMP (indirect), but I avoided JMPing altogether by moving the PPU update routines into the routines which call them.

tokumaru wrote:

Jurassic Park uses IRQs to blank 8 scanlines at the top of the screen and 8 scanlines at the bottom, not by disabling rendering, but by bankswitching black patterns.

This seems like a better approach than Alfred Chicken using up a large portion of it's sprites to cover the glitches in vertical mirroring mode.

Quote:

so there's no need for massive VRAM updates.

Is it possible that Sonic runs faster than 7 pixels per second in this version, requiring double the tile updates? If so, scrolling in both directions could be 124 tiles. Just a guess.

If anybody should know what it takes to make a Sonic port on NES, it would be you.

Is your NES Sonic available to play? To be honest I haven't played too many homebrews except for Battle Kid. I didn't know about most of them that are out there until I came here. I did play the demo for Lizard a bit. It was cool, I liked it. It had a very atmospheric, metroid like, exploration vibe. The music was good too.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2015-10-17 (#157297)

tokumaru wrote:

Back in the day, developers didn't really know all these crazy tricks we know today. The $2006/5/5/6 trick for example, certainly wasn't documented anywhere

I was thinking about this, and it's strange that Nintendo didn't even know this. Would that classify this function as "heavy wizardry"? https://en.wikipedia.org/wiki/Magic_(programming)#Variants

Re: Fun with MMC3 Scanline IRQs
by dougeff on 2015-10-17 (#157303)

I could be wrong...but I think 'wizardry' or 'magic' is like when you download some speciallized library to perform some task that you don't understand how to do, and maybe almost nobody knows how to do it...but include X library and magically turn Y into a Z file.

The $2006/5/5/6 trick would apply, if one guy at Nintendo figured it out and sent it to the game makers as a way to do certain things...and the programmers had no idea how or why it worked (because the hardware wasn't fully documented, at the time)

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-17 (#157323)

darryl.revok wrote:

So I managed to find plenty more cycles in my NMI. It's currently running at a max of 1800 cycles, so that should be plenty of room for anything I might want to add to this scene.

Cool.

Quote:

This seems like a better approach than Alfred Chicken using up a large portion of it's sprites to cover the glitches in vertical mirroring mode.

Well, I think both are valid solutions. Personally, I would never want to reduce the sprites-per-scanline limit to 7, but the approach used by Alfred Chicken is simpler to implement when you don't have a scanline counter available. Interestingly enough, Felix the Cat also uses this technique, even though it does have a scanline counter. In my own engine, which uses a simple discrete logic mapper, I chose to hide 16 scanlines at the top of the screen, timed from the end of vblank.

Quote:

Is it possible that Sonic runs faster than 7 pixels per second in this version, requiring double the tile updates? If so, scrolling in both directions could be 124 tiles. Just a guess.

That shouldn't be a problem... My own engine always updates 16 pixels worth of new background data anyway, and 128 bytes (17 * 4 + 15 * 4) are necessary to update both a row and a column of blocks, not counting attributes, and it all fits under regular vblank time just fine. Granted, when this happens, nothing else is updated (besides sprites), but it's impossible for the player to move in a perfect diagonal at 16 pixels per frame for several frames in a row, so there will be time to update other things, even at ridiculous speeds.

Quote:

If anybody should know what it takes to make a Sonic port on NES, it would be you.

I know that Somari is quite crappy, and I don't even have to be a good programmer to know that! :lol:

It has its merits though... It shows that the NES isn't so far behind the Genesis like most people think, making it clear that the basic idea is possible. It just needed a lot more polish in order to be a good game... the physics is the worst part, but the music and the graphics could definitely be improved a lot.

Quote:

Is your NES Sonic available to play?

Nah, I coded a solid scrolling engine, but never got to implement any physics/gameplay. hopefully that will change soon.

Quote:

I did play the demo for Lizard a bit. It was cool, I liked it. It had a very atmospheric, metroid like, exploration vibe. The music was good too.

Indeed. Recently I've grown fonder of the Metroidvania style of gameplay.

Re: Fun with MMC3 Scanline IRQs
by tepples on 2015-10-17 (#157326)

tokumaru wrote:

it's impossible for the player to move in a perfect diagonal at 16 pixels per frame for several frames in a row

Let me guess: no Chemical Plant Zone Act 2.

Re: Fun with MMC3 Scanline IRQs
by tokumaru on 2015-10-17 (#157330)

AFAIK, the camera in Sonic 1 and 2 is limited to 16 pixels of movement in each axis (same as my engine), which would explain why it can't follow Sonic in CPZ. It seems Sonic 3 (&K) can handle more, but I'm not sure that happens much.

Anyway, I never planned to copy the physics of any existing games. I'll create everything from scratch and tweak until it feels good to me, so I'm not sure it will be possible to move faster than the camera.

Re: Fun with MMC3 Scanline IRQs
by darryl.revok on 2016-04-28 (#169561)

thefox wrote:

Blargg's nmi_sync is an extreme (and impressive) example of this.

In working on my compo entry, I didn't want to rule out the possibility of using raster effects, so I started looking into options available without a scanline IRQ. Well, timing code isn't too tough if your NMI is a fixed length, but that's not always practical. I looked into nmi_sync a little, then I had an idea. Why not set a wait at the beginning of logic for sprite 0 to clear, which would begin your logic in the pre-render scanline each time. Then pick a part of your logic that's easy to make a static length, such as scrolling, and run that at the beginning for a status bar. Then, you can have the sprite 0 hit available later in the frame. You have to be sure that it always hits though, or your frame will never run.

I've never heard about anybody doing this but I figure it's probably not that uncommon. Has anyone else thought to time off the sprite zero clear?