Way too slow...

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Way too slow...
by on (#47818)
Hi there!

I'm currently working on a particle engine. I have a first draft ready, but unfortunately so far my efforts are way too slow.

My goal was to move 128 Tile pixels at 60Hz, but my current best approach moves only 96 pixels at 15Hz :oops:

Albeit my main VBLANK code is probably near perfect I think (first loop of NMI code, does 96 PPU RAM updates per frame), but maybe I'm missing something substantial.

I have uploaded my source and binary, hoping that maybe someone here has an idea to make things faster :)

=> http://home.arcor.de/cybergoth/apocalypse.zip

Greetings,
Manuel

by on (#47823)
Well, I don't think it is too slow, in fact it is pretty cool. It is always faster than my pseudo-mode 7 demo that was computing a frame in arround 2-3 seconds or something like that (I didn't optimize my code).

Does only the pattern table or the name table too being uploaded ?

by on (#47825)
Bregalad wrote:
Well, I don't think it is too slow, in fact it is pretty cool. It is always faster than my pseudo-mode 7 demo that was computing a frame in arround 2-3 seconds or something like that (I didn't optimize my code).


Thanks for the heads-up. I found the thread here and tried your demo, that's really impressive stuff! :shock:

Bregalad wrote:
Does only the pattern table or the name table too being uploaded ?


Both, yes. I need 4 writes per dot: Erase old dot+tile and draw anew. I just randomized dot positions for the moment and implemented the simplest movement scheme possible, but every dot can actually be freely(*) positioned.

*Right now I'm doing nothing against tile clashes, so the last drawn dot always "wins" the tile :wink:

Basically I have a few general directions of thought for speed-up:

1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?

2. Selfmodifying code. Should definitely speed it up some, but could require up to 15*128 Bytes of RAM.

3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not? (Still, Y-Sorting 128 objects during a frame might be impossible as well...)

4. Optimize by restricting/patternizing the dot movement. Something I will possibly do later.

by on (#47829)
Quote:

1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?

No this will not be electronically possible. Even if you used dual-port RAM there would still be issues. Maybe if you insrert a modern Multi-megahertz-super-fast DSP in the cartridge that will multiplex RAM read and writes for both chips in a transparant fashion this would be possible.
Quote:
3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not? (Still, Y-Sorting 128 objects during a frame might be impossible as well...)

No you need to be in VBlank or forced VBlank to acess RAM. You can however force VBlank for a part of the frame to get more RAM writes.

by on (#47832)
Cybergoth wrote:
*Right now I'm doing nothing against tile clashes, so the last drawn dot always "wins" the tile :wink:

Ah, I thought I saw some of the pixels disappearing for small amounts of time...! =)

Quote:
1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?

I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.

Quote:
2. Selfmodifying code. Should definitely speed it up some, but could require up to 15*128 Bytes of RAM.

That's nearly all of the internal RAM, but shouldn't be a problem if you use a cart with 8KB of extra RAM.

Quote:
3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not?

Like Bregalad said, there is no way around that. The address register we use to write data to the PPU (accessed through $2006) is also used by the PPU during rendering, so accessing it outside of VBlank will corrupt whatever is being rendered at the moment. As Bregalad suggested, you can turn rendering off manually for a few extra scanlines of PPU access.

by on (#47835)
tokumaru wrote:
Ah, I thought I saw some of the pixels disappearing for small amounts of time...! =)


I'm undecided yet if I'm going to do something against it or just allow it to happen ;)

tokumaru wrote:
I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.


That mappers quite a beast! :shock:

I'd assume that's a configuration that'll never be used for homebrew efforts, unless one is going to cannibalize original carts :shock: :shock: :shock:

tokumaru wrote:
That's nearly all of the internal RAM, but shouldn't be a problem if you use a cart with 8KB of extra RAM.


While still in tech-demo stages I might just try it once to see how much speed up it provides. Probably not really worth the tradeoff :?

tokumaru wrote:
Like Bregalad said, there is no way around that. The address register we use to write data to the PPU (accessed through $2006) is also used by the PPU during rendering, so accessing it outside of VBlank will corrupt whatever is being rendered at the moment.


That's quite interesting. I think I got that idea from reading some tech notes from Ian Bell coming with his Tank demo, where it says he's cycle counting in order to know the position of the beam or somesuch? Maybe I just misunderstood that part.

tokumaru wrote:
As Bregalad suggested, you can turn rendering off manually for a few extra scanlines of PPU access.


Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?

by on (#47836)
To expand on what Bregalad said, This document shows exactly what the PPU does each scanline. It accesses VRAM with every cycle it has available except for one, and since 1 PPU cycle is 1/3 of a CPU cycle, it doesn't help you that much. So it's impossible to access VRAM while a scanline is being rendered without messing up the video. However, you CAN access the PPU's internal registers, like the scroll registers, during certain times while a scanline is being rendered. The document there, along with loopy's "The Skinny on NES Scrolling," tell you how to do so.

Quote:
Quote:
I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.



That mappers quite a beast! Shocked

I'd assume that's a configuration that'll never be used for homebrew efforts, unless one is going to cannibalize original carts Shocked Shocked Shocked


The MMC5's internal ram, when it's used for enhancing the graphics, is usually used as a second name and attribute table. It allows you to use up to 16384 different tiles in the background at the same time, and also lets you use 1 palette per tile, instead of the 2 by 2 tile area that normal attribute tables use. It only works with one screen mirroring though, and doesn't really help with sprites. More info is on the NESdevWiki.

Quote:
Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?


Depends on the TV. If it's NTSC then some scanlines are probably being chopped off. Most emulators remove the first and last 8 scanlines, though real NTSC TVs might remove more or less, and they might remove more from the top than the bottom or vice versa. But 8 from the top and bottom is usually a safe bet. If it's PAL then it's displaying more scanlines than the NES does, approximately 260 scanlines according to the NESdevWiki. So with PAL you shouldn't remove any scanlines from the display if you don't have to. But then again with PAL you get 70 scanlines of VBlank, so you probably don't need to.

That document by Ian Bell probably refers to counting cycles so that the tank demo always enables and disables the PPU at the same scanlines. The reason he needed to do so was because he was using a mapper (UNROM I think) that didn't have scanline interrupts. If you use something like an MMC3 or better though it'll probably have the option to use scanline interrupts, which means you don't have to count cycles to know when a certain scanline is being rendered. The Tank demo is an example of a program that disables rendering for extended VBlank. So are Battletoads and this one test program by Celius. I forget what it's called though.

by on (#47837)
Cybergoth wrote:
Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?

LJ65 turns off rendering about nine lines early so that it can blit a whole 200-byte playfield, plus OAM and the palette, in one NTSC vblank.

CartCollector wrote:
This document shows exactly what the PPU does each scanline. It accesses VRAM with every cycle it has available except for one

Unless a mapper queues up (address, data) pairs to write to VRAM and executes the queue during times when the data read by the PPU doesn't matter. The document lists the following memory fetch phases when the PPU appears to ignore what it reads:
  • 125-128: Unused thirty-fourth sliver of the background
  • 129, 130, 133, 134, ..., 157, 158: Garbage nametable bytes
  • 169 and 170: PPU is frozen while waiting 5 dots for horizontal blanking to end

But then that might be almost as hard to build as MMC5, with independent front-side and back-side PPU address buses. And like MMC5, it might screw up on mostly-NES-compatible consoles using "NOAC" chipsets. If you really want to write to VRAM during rendering, then try programming for the TurboGrafx-16 or the Game Boy Advance.

by on (#47842)
The latest program I've written allows a lot of room for VRAM writes. But all the code running during the displayed part of the screen (and vblank) has to take the same amount of cycles every frame. I didn't even use the sprite #0 hit yet. But the end part was just a delay loop, which would work fine for sprite #0 hit. It's a totally wrong set up, but what I'm doing takes several frames to finish the main loop, but it updates the screen all at once when it's ready. Having some fun with the nametables. :)

Using the sprite #0 hit detect is the common way to combine variable code with timed code. But it only works once per frame. An IRQ would be better, but only ASIC-basic mappers have them, generally.

by on (#47843)
Quote:
LJ65 turns off rendering about nine lines early so that it can blit a whole 200-byte playfield, plus OAM and the palette, in one NTSC vblank.


How do you know when to turn the screen back on? Is the blit guaranteed to not take over 30 scanlines of time? I get 30 from 20 normal VBlank + 9 extended VBlank + 1 scanline before normal VBlank that the PPU doesn't render but doesn't trigger the VBlank NMI for either.

Quote:
If you really want to write to VRAM during rendering, then try programming for the TurboGrafx-16 or the Game Boy Advance.


Or the Atari 800, Atari 5200, or Commodore 64. All of which were released before the Famicom ;)

by on (#47847)
CartCollector wrote:
How do you know when to turn the screen back on? Is the blit guaranteed to not take over 30 scanlines of time?

Correct. I timed a 22-line update plus palette update plus OAM DMA, and it didn't exceed 3500 cycles, even with DPCM stealing cycles. That's 9 forced blank lines, 1 post-render line, 20 vblank lines, and the first 100 CPU cycles of the pre-render line. But then you do get most of the pre-render scanline too.

by on (#47851)
I wrote what I consider to be a pretty great PPU updating routine. It uses 12 scanlines of extended Vblank, but it can do a lot (in my mind).

Each frame, it does:

64 tile writes (row)
30 tile writes (column)
16 attribute writes (row)
8 attribute writes (column)
32 entry palette update
10 * (1 CHR RAM tile) or (6 Miscellaneous PPU writes)
Sprite DMA
Sets Scroll

Unfortunately, since I'm using extended Vblank, it has to take an exact amount of cycles every frame. So it's about 3600 cycles for all of that, but it's worth it. And for each of the CHR RAM tile routines, one can decide to do 6 miscellaneous PPU writes instead if they so desire, which is handy if there's a BG update that's not related to scrolling over level data.

by on (#47863)
Thanks guys for all your feedback. I'll see what I can do with all the ideas I got now for the next revision of the engine :)
Re: Way too slow...
by on (#47983)
Cybergoth wrote:
my current best approach moves only 96 pixels at 15Hz


By significantly changing that approach I managed to almost double the speed of my particle engine (without any vblank extension!) :D

Now it updates 94 particles with 30Hz.

=> http://home.arcor.de/cybergoth/apocalypse.zip

by on (#48003)
Nice! Looks like a couple 'lemmings' blew up in outer space. :D

by on (#48011)
Looks great! Now all you have to do is something like this (fast forward to around 6:30): http://www.demoscene.tv/prod.php?id_prod=12945

by on (#48018)
Hehe... it'd be hard to beat the C64 in this discipline ;)

I didn't find a video of the C64 version, but this is what I'm aiming at:
http://www.youtube.com/watch?v=JHKFqz7e-so

by on (#48025)
Uh, yeah, wow... I don't know much (next to nothing, actually) about the C64 but that looks pretty insane. Hopefully I can accomplish something similar on the NES. Doynax, a user here, did a really cool polygon filling engine (with real time 3D) that I was pretty inspired by, and it actually led me to design a polygonal movie engine. It basically shows predefined frames of 2D polygons in sequence, so I could make a movie like the one that was shown for the NES, but nothing would be rendered with 3D calculations, and it would have a really low frame rate (I didn't optimize it enough, so it runs at like 6-8 FPS, haha).

by on (#48026)
The C64 is a completely insane platform and the whole C64 developpement scene is completely insane. It's really surprising how the system is only able to do crap natively, but how it can be tweaked to do absolutely awesome thing by fooling the VIC II (C64's ppu) with tricky timing and stuff.
Also, the CPU and VIC II acess to video RAM alternatively each cycle, so it is possible to write and read VRAM at any time, at the price of very low CPU frequency (it's twice slower than NES' : less than 1 MHz !!)

by on (#48033)
I said something LIKE. I didn't command you to go out and make something better.

But yeah you could probably do something closer to that if you were using PAL (the vast majority of C64 demos are in PAL because most sceners (people who make demos) are in Europe) and chopped off some scanlines. Even if you didn't remove a whole bunch of scanlines you could probably do some cool 3D effects. I mean PAL gives you 70 lines of VBlank to work with, so that's 3.5x the amount of VBlank time to work with compared to NTSC. So if the only bottleneck in your program is VBlank, you could draw 3.5x more dots per frame or speed up your updates without extending VBlank.

And you don't need a whole lot of dots to make a decent looking 3D object. Look at this NES demo for instance (ROM here).

But what you've got right now is great. If you want to do something else with it, that's your choice.

by on (#48035)
That was beautiful! Wow, I was totally fooled by the spinning gears at first, but looking at the pattern tables, it was obvious how it was done (4x4 pixels, all possible combinations fit into the pattern table). But it looked surprisingly great for the pixels being so big. I still don't understand the twisting tubes though. I saw that in another demo, and I don't know how it's done, and think it looks totally awesome. The winding path was awesome too. And the spinning box is really impressive looking, though I think I can see how it was done. That would be perfect for something like a developer logo, though it might take up to much graphical space. I'm very impressed. It really inspires me to optimize my polygonal movie engine and start making something cool.

by on (#48037)
CartCollector wrote:
I mean PAL gives you 70 lines of VBlank to work with, so that's 3.5x the amount of VBlank time to work with compared to NTSC. So if the only bottleneck in your program is VBlank, you could draw 3.5x more dots per frame or speed up your updates without extending VBlank.

This is wrong, beacuse the CPU is slower in PAL consoles. The amount of cycles in a PAL scanline is 15/16 its NTSC counterpart. Taking that into account, the PAL CPU has "only" 3.28 more times VBlank than NTSC (not that it really changes something), but also when doing raster timing tricks, you only have 106.5 cycles per scanlines instead of 133.6, which means there is less computation possible between each raster split.

Quote:
I still don't understand the twisting tubes though. I saw that in another demo, and I don't know how it's done, and think it looks totally awesome.

I guess it's just two tubes in the pattern tables with all possible distances, and they change the scroll each scanline.

by on (#48038)
CartCollector wrote:
But what you've got right now is great. If you want to do something else with it, that's your choice.


I can only work on such fun stuff in short bursts, e.g. like during those two vacation weeks I just had. Now it'll rest again for a while, until another window opens where I can dedicate some time to it. That's why I try to work very focused towards my goal.

What I can offer is my code though. The dots positions and directions are stored in the "XPOS", "YPOS" and "CURRDIR" arrays and they're accessed through xposPtr, yposPtr and currdirPtr.

If you want to experiment with it, you can just apply another movment code by tweaking lines 252-327. A and X are free to use there, Y indexes the current dot :)