Line plotting engine

Line plotting engine
by hcs on 2011-12-29 (#88109)

I've been working on a little line plotting engine for a few days, after having abandoned it since August. Here's the current test version: http://hcs64.com/files/vectest3.nes

The square is intentionally writing over the triangle, I'm trying to demonstrate that it is actually overwriting it every frame. The triangle can be redrawn by pressing select. If you press start you can manually rotate the square with the D-pad. It hits around 5 or 6 FPS with this scene.

12x21 tiles of CHR-RAM are used for each frame. Each tile is used twice; the screen is split so that the left half only displays the low bitplane and the right half the high, this is simply accomplished with attribute tables and palettes. This gives a 192x168 pixel area where we can render arbitrary 1-bit images.

The main trouble is accessing CHR-RAM. As it can only be safely read and written during NMI, I came up with a scheme to allow the main code to generate a "display list" which would be run during the next NMI. This is written to a ring buffer, which at present takes up almost half of RAM.

I made a decision early on not to use WRAM, so it is not possible to store the bitmap in CPU-accessible RAM for buffering purposes. Thus the NMI routine must:
- set the VRAM address
- read the currently set pixels
- perform the set or clear operation
- store this somewhere temporarily (since VRAM is accessed serially)
- set the address back to the start
- write the stored data back

There are quite a few optimizations on this theme. For instance, if we are updating a dirty tile we can update bits and perform the necessary copy simultaneously. I have a system that tracks for each tile whether it was updated on the even or odd frame, and how many primitives currently occupy that tile. that when the last prim is removed I can do a fast clear of the tile.

My end goal for the moment is to get a bit more speed (already 30% faster than two days ago) and do something in the style of the old Windows Mystify screensaver.

Source for NESHLA is at https://github.com/hcs64/Nestify/

---
Current ROM http://hcs64.com/files/nestify7.nes

by tepples on 2011-12-29 (#88115)

Wireframe with dirty tiles? Sounds like Ian Bell's tank demo.

Mystify screensaver? Sounds like Qix.

by hcs on 2011-12-29 (#88117)

Cool, I didn't realize there was source available for the Tank Demo. I'll hold off on looking at it for a while, but it should be fun to see what I've missed.

by hcs on 2011-12-29 (#88141)

Very little progress with speed, but I do have the full Mystify effect running:
http://hcs64.com/files/nestify1.nes

Pretty awful speed, gets under 1 FPS regularly. But I did what I set out to do lo those months ago.

by thefox on 2011-12-29 (#88148)

Not bad for not using WRAM (pretty sure Tank Demo does).

by hcs on 2011-12-29 (#88149)

Nope, though Elite uses WRAM the Tank Demo does not. Though it also doesn't persist stuff across frames, which is where a lot of the unavoidable CHRRAM access comes from.

by tepples on 2011-12-29 (#88154)

You technically don't need to persist stuff across frames if you just redraw the four polygons from scratch every frame.

by hcs on 2011-12-30 (#88181)

The "lines erase each other" behavior is part of the effect I'm going for.

If I wanted non-interfering lines I'd have to be a lot smarter than I am now to redraw without having the last 3 polygons still have to pull from CHRRAM to merge blocks. So unless I can draw the 4 lines of each side simultaneously (which is worth looking into for other uses) I don't get much advantage, and additionally I'd need to redraw these 16 long lines each frame. Real wireframe has the advantage of fewer intersections (and near-intersections), and the majority (corners) are completely predictable, so when drawing you can hold onto the corner between lines. Which is something I ought also to do, but corners are just a small part of the performance issues here.

Which is not to say that I shouldn't do all this stuff much faster than it is currently done.

by Kreese on 2011-12-30 (#88183)

Cool. Do you actually plot dots, or is it tile swapping?

I want to see some line vector cube, hidden surface! Add some sprites and music too.

by tepples on 2011-12-30 (#88188)

What the tank demo does is store a copy of the nametable in RAM and then use that for allocating tiles in each frame. Before plotting each pixel, it looks in the RAM copy of the nametable to check whether the tiles are already allocated, and if not, allocates the next numbered tile and writes it back to the nametable. Then during what appears to be forced blanking, it rewrites the nametable and all used tiles using some sort of double buffering technique.

by hcs on 2011-12-30 (#88191)

Kreese wrote:
Cool. Do you actually plot dots, or is it tile swapping?
I want to see some line vector cube, hidden surface! Add some sprites and music too.

It's all naive Bresenham dot drawing, which is again part of the problem. I'd like to flesh it out more fully (some music would be fun), but it's already running way too slow at the moment. I don't think sprites would mix well with the ridiculous framerate, either. I considered a cube but it just didn't seem that exciting.

tepples wrote:
What the tank demo does is store a copy of the nametable in RAM and then use that for allocating tiles in each frame. Before plotting each pixel, it looks in the RAM copy of the nametable to check whether the tiles are already allocated, and if not, allocates the next numbered tile and writes it back to the nametable. Then during what appears to be forced blanking, it rewrites the nametable and all used tiles using some sort of double buffering technique.

I implemented it with a totally static nametable on the assumption that it would just be more trouble to reallocate everything. Maybe I need to reconsider that.

My current scheme is to ship tiles out as soon as possible, but that is counterproductive when I just need to read them in again. It may make sense to ditch the code-based ring buffer for a tile- (or line-)based LRU, and have the eviction and writing (until the end of the frame) handled directly during vblank. This would make more efficient use of memory, but at the cost of more precious vblank cycles. I don't like the potential for thrashing, though it can't be much worse than what I have now, and when under the cache size it would be well better.

Maybe I should go further in the direction of cycle counting and allow the line drawing to happen opportunistically in vblank? So many choices...

by hcs on 2012-01-02 (#88280)

Got another 10% or so out of it with a simplified dlist buffering strategy. This way, whatever is available with the NMI hits gets used, and a BRK is used to easily tell how far it was able to execute. Still a shambling monstrosity.

http://hcs64.com/files/nestify2.nes

Interestingly, it is actually slower (4%) on the rotating square benchmark. This is probably due to it wasting time once the vblank has filled up, whereas it used to be able to run ahead and start filling the next vblank. It does generally run faster on Mystify, more testing shows by 13%. It seems like I need to bring back the old multiple buffering while still allowing an incomplete dlist to run.

---

All around faster with a more flexible buffering method.
http://hcs64.com/files/nestify3.nes

by hcs on 2012-01-03 (#88355)

Just when I was about to give up, fixed some ugly bugs (should always use bcc/bcs for unsigned compares) for both stability and speed, it's almost tolerable now. Also made some modifications to the scheme I was using for opportunistic processing, and turned off the PPU while updating to be safe and not screw everything up if I wind up a few cycles over.

http://hcs64.com/files/nestify4.nes

by Bregalad on 2012-01-04 (#88386)

Nice,
what are the differences between this and the "Tank Demo by Ian Bell" that already exists on the NESdev main page ? I think your demoes "only" updates the pattern table while Ian Bell's updated both pattern and name table, but I'm not so sure how they made it that fast.

by tepples on 2012-01-04 (#88388)

The secret is that the tank demo's tiles are double buffered, and any blank tile is never written to VRAM.

by hcs on 2012-01-04 (#88397)

tepples wrote:
The secret is that the tank demo's tiles are double buffered, and any blank tile is never written to VRAM.

Yeah, I think that's the big one. Another is that each frame is buffered as much as possible in main RAM before sending to CHRRAM, mine doesn't attempt to do that and has to do a lot of work reading stuff out of CHRRAM and immediately putting it back whenever it updates a tile again. As I fully double buffer the whole pattern table I also have to waste time copying anything still dirty between frames (though I can often pair this with a frame's first update).

I'm theoretically able to push 32 tiles per frame (60 cycles per tile, 12 to set VRAM addr, 6 for each line), Bell's actually is somehwhat slower than that (93 cycles per tile, 29-ish+8*8 for each line), and then it still needs to update the name table. All of those updates are managed by nice families of subroutines in ROM, though, whereas my code spends a lot of time generating opcodes in RAM that it then runs straight through during vblank.

Where I lose is in making updates: if I had to update every line of a tile it costs 160 cycles (12 adr, 4 dummy read, 4 read+2 immed op+3 ZP write*8, 12 adr 6 write*8 + 12 for a JSR/RTS). I could improve that by getting rid of the JSR, but then I wind up using almost twice as much buffer for it (the ZP writes store into immediate ops in a routine that we then jump to for the write phase), and both buffer and vblank cycles end up being limiting factors at different times.

Besides the massive difference in skills, it's also a difference in goal. I wanted to get as close as possible to a fully bitmapped display. The tank demo wanted to do 3D wireframes (and a nice BG) and does 'em best. Whenever the framerate starts to slow to a crawl, that is something Bell's engine simply couldn't do with its tile budget. I still haven't read through the source, though, so this is based on watching it in FCEUX's debugger.

by tepples on 2012-01-04 (#88403)

Blargg and I wrote a fully bitmapped display driver once, but it was designed primarily for text (an e-book reader), not wireframes. If I were to clone Mystify, I'd use an engine designed for wireframes.

by hcs on 2012-01-04 (#88406)

Mystify isn't a well-behaved wireframe, though. There's a lot more line intersections at hard-to-predict places, very long lines running close to each other, and you need to do clears as well as sets.

I'd be interested to see that ebook reader, but I wasn't able to find a link on your site, is it released?

by tepples on 2012-01-04 (#88408)

hcs wrote:
Mystify isn't a well-behaved wireframe, though. There's a lot more line intersections at hard-to-predict places, very long lines running close to each other, and you need to do clears as well as sets.

A full clear is implicit in each frame with an engine like Bell's.

hcs wrote:
I'd be interested to see that ebook reader, but I wasn't able to find a link on your site, is it released?

No, but it was discussed in this topic.

by hcs on 2012-01-04 (#88412)

tepples wrote:
A full clear is implicit in each frame with an engine like Bell's.

But that's why it isn't applicable,
1) Since it messes up the interference effect I remember
2) It'd be even slower drawing 16 huge (potentially screen-diagonal) lines each frame. I suppose we could take each set (those tracking the evolution of a side of the polygon) and draw them together to save a lot of duplicated effort combining them later. Still not sure what would be a good way to handle the other intersection cases.

The vblank actually isn't the limiting factor right now, it's the line drawing code itself. I'm thinking of using Bresenham's other "Run Length Slice" algorithm on longer lines, where the cost of the division can be amortized. Even my naive implementation could probably be improved with some thought, I've still got a lot of ROM. I should really take a look at Bell's method for line plotting, that much at least should be completely applicable to the full bitmap model.

tepples wrote:
hcs wrote:
I'd be interested to see that ebook reader, but I wasn't able to find a link on your site, is it released?

No, but it was discussed in this topic.

Cool, thanks.

by hcs on 2012-01-17 (#88970)

Got the non-vblank processing to catch up by some optimizations in how dlists are generated, about 17% faster that v5. Now the limitation is vblank time again.
http://hcs64.com/files/nestify5.nes

by tepples on 2012-01-17 (#88980)

Looking great.

by Ambit_Energy on 2012-01-24 (#89217)

thefox wrote:
Not bad for not using WRAM (pretty sure Tank Demo does).

Yap, not bad... All good

by hcs on 2012-01-25 (#89262)

Thanks spam guy!

I've been working hard on this for the past week, but only made small progress until the past two days.

1. I've ripped out the old 6502 generating code in favor of a threaded code implementation, where I keep a ring buffer of subroutine addresses-1 on the stack page, and the NMI routine executes these directly with each RTS. Arguments for the routines (VRAM addresses and bitmap data) are kept in a separate array. I put the cycle cost of each subroutine in the byte before it, so the address-1 tables allow that to be easily looked up. There is a whole lot of fairly uninterestingly code to support all the variant routines, in some cases I have 32 versions of each so I don't need to spend another byte specifying the high 8 bits of the address. It has expanded greatly as a result, from just under 8KB in v5 to 30KB in v6. I have Python generating some code (as well as lookup tables), though I don't really use it as more than a powerful macro preprocessor. I'm becoming less and less enthused with NESHLA as I deal with its limitations and faults.

2. The biggest improvement is an aggressive caching system, which dynamically allocates 64 tiles that are kept in RAM until they become empty. Any more must still go out to CHRRAM. Cache is interleaved with other memory in order to make the best use of space.

Overall it is now about 25% faster than last time. The slow parts aren't much faster, but the fast parts are speedy.

http://hcs64.com/files/nestify6.nes

by Bregalad on 2012-01-25 (#89264)

Tank demo does not use WRAM, it says "32k PRG and nothing else".

I guess the big difference is that Tank Demo (and Elite) relies on the fact there is many completely empty tiles on the screen so they only upload the necessary characters and the name table. This demo however does only update characters, and have a completely free "bitmap surface" in the middle of the screen that is hardwired and never changed.
This implies more character updating, but no tame table updating.

by hcs on 2012-01-25 (#89266)

Yeah, I noted that in my response, spam guy just picked a line at random.
---
Another day, another 10%: http://hcs64.com/files/nestify7.nes

I realized that I could use the stack pointer directly to index the data that the routines would deal with, that and putting one byte of the 9 on the zero page (interleaved with single byte values) saved 5 cycles per command, and simplified writing considerably (this version uses 2KB less ROM than v6).

I also reworked caching and double buffering again, cache writes are now limited to only the lines that need to be written to a frame. I had implemented this before but due to various issues (worked out by drawing a lot of state machines) it wasn't providing measurable benefit.

Also caught a bunch of cycle counting bugs that were causing v6 to bounce occasionally when it took too long for vblank. That took away some small bit of speed for the sake of caution, the other changes much more than made up for it.

I tried doing color cycling, but most colors look awful with vertically-oriented diagonals on a black background, on my old CRT anyway, so I've kept that on its own branch for now.

by Bregalad on 2012-02-13 (#89922)

Well this inspired me to do my own vector graphics code.
And guess what I did all it in an afternoon (well an afternoon + an evening to be exact).

It works quite differently form hcs' and from Ian Bell's I think I took the advantages of both and tried to do my best on it. If someone is interested just say so.

I can currently draw arbitrary pixels and almost* arbitrary lines on a 256x128 pixels monochrome surface.
As for speed it seems I can draw approximately a dozen of vectors per frame I have no idea if this is supposed to be good or bad.

(* I still have bugs with lines longers than 64 pixels I guess it's due to some kind of overflow in 8-bit signed numbers and I need twice the lenght in the line drawing algorithm)

The real problem is that I lack any source for vector graphics to play with my engine. I can draw static images but it gets boring very quickly, and I don't want to code crazy matix rotations or stuff of the like.

The goal would be to have some vertex "video streaming" where I could just have a list of 2D coordinates, and render to the screen. Anyone have an idea how I could do that other than manually entering all the cordinates (which comes quickly annoying and tedious) ?

by hcs on 2012-02-13 (#89924)

Cool, I'd like to see it. For some point of comparison, it takes 12 frames for Nestify to draw its initial set of 16 (long, overlapping) lines. The next update (erasing 4 and drawing 4 lines) takes 9 frames, and once it gets going it's usually around 7 (though with notable slowdowns). I haven't thoroughly measured (or counted lines) in the Tank Demo but it usually completes an update in one or two frames.

Something like a few rotating regular polygons is easy to do with a sine table, if you want something to test with that takes no artistry. With N entries you just pick x, x+N/4, x+2N/4, x+3N/4 and you have the angles of the corners of a square, just march x along to rotate it.

The cool thing about using Mystify as a test is that it is really easy to set up, and it exercises pretty much every possible scenario of combined writes, erasures, reuse of tiles, line lengths, etc. It's a decent stress test, which makes it an interesting challenge to a rasterizer.

by Bregalad on 2012-02-14 (#89939)

Thanks for the idea of doing rotating polygons, it's not very exiting but it gets the job done.

I think I can now share a working version of my engine here :
http://dl.dropbox.com/u/23465629/NES_junk/vector.nes
This ROM is optimized for NTSC (since it's what most emus default to) but I can make the engine works just as well on a PAL console. Oh and I tested both on my real NESes, and both works perfect.

It takes 3 frames to plot 10 vectors (in fact it takes about 2.5 frames but yeah this rounds up to 3). Since half of a frame is reserved for updates anyways I can draw approx. 5 vectors per frame, but it really depends on the lenght of the segment, since each pixel is plotted individually the number of white pixels in a frame directly determines how long it is to render this frame.

by tokumaru on 2012-02-14 (#89950)

Pretty smooth demo. You know what I would just love? To see the NES doing animated sequences like the ones in Another World... I'm sure that a LOT of optimization would be necessary for that!

by hcs on 2012-02-14 (#89951)

Bregalad wrote:
Thanks for the idea of doing rotating polygons, it's not very exiting but it gets the job done.

I think I can now share a working version of my engine here :
http://dl.dropbox.com/u/23465629/NES_junk/vector.nes

Very nice, and in ~ 1K! Whereas my massive monthlong project does only very slightly better at the same task: vectest4

Timing on the TV for 4 full revolutions gives 20.3 FPS for yours (as expected) and 23 FPS for mine (as is hardly predictable due to a lot of dynamic behavior).

I'd love to see the source (mine is up on the vectest branch of Nestify on github), though at 1K it shouldn't be hard to read through a disassembly.

by Bregalad on 2012-02-15 (#89957)

I don't think it's such an achievement to make it so small because there is nothing that really can take up much space.
I just have an unrolled loop that takes quite some space and the sine table for the rotation of course.

I uploaded the source here, but don't complain if you think it's a huge mess !
Especially the line plotting part which I arranged from some C code I found on the NES and which still needs some cleanup/optimisation/fixes.

To compile it you'll need WLA-DX and some other files but I think you wanted to see it just to look at it and not to comile it so it should not be a problem.

Basically how the engine function is like that :
The main difference between it and yours (and Ian Bells) is that I don't use any kind of Double Buffering. For some reason I just wanted to get rid of it whenever possible to simplify the logic, which also means faster code.
I also wanted something in the middle of the screen, the borders not being important. This leaves me the possibility to do VRAM updates at the top and bottom quarter of the screen using forced blanking.
At first I thought about a system that would update half of data at the bottom of a frame, then the other half at the top of the following frame, so that I didn't need double buffering, because all data could be updated between the end of the display section of the first frame and the start of the display section of the next frame.

In order to update as few data as possible I thought it was better to take Bell's route : Update the map as well as chars (instead of updating only chars like your demo), that way I avoid updating a lot of blank chars uselessly, at the price of having to update the map as well.
Since I use a 256x128 surface, I have 512 map bytes to update every time an image is rendered.

Finally I was able to get something even better than I planned as I was able to update 512 characters bytes and 512 map bytes in a quarter of the screen with simple partially unrolled code. I always update this even if not all characters are used, so that the timing is constant and everything is simpler, I can update all data in one frame instead of splitting it in two.

I could use either the top quarter of the screen or the bottom quarter of the screen. The problem when doing it at the bottom is that I can't synchronize easily, I'd have to wait for VBlank, then wait for the bottom of the screen, then use forced blanking and update. But all this waiting is a waste of precious CPU time, and I didn't want to jerk around DMC IRQs unless there was no other solution, so I simply decided to update at the top quarter of the screen.
This leaves the bottom section completely untouched, I didn't use it but you could put other graphics, such as text, there.

That way I have absolutely no constraint on FPS when it comes to updates, it can be as fast as 60FPS or as slow as you want and that will never be a problem.
You could even render an image, freeze it, run another program that overwrites all the RAM, then render another image, it will never be a problem.

The only real limitation (aside CPU power obviously) is the fact only 64 tiles can be used for this. I think my engine could be made to extend for updating 64 other tiles, limit the speed to max. 30FPS and use some kind of double buffering on the tiles, but I did not fee like doing that.

Quote:
Pretty smooth demo. You know what I would just love? To see the NES doing animated sequences like the ones in Another World... I'm sure that a LOT of optimization would be necessary for that!

This is not really related as this is no wireframe graphics, but I don't think a lot of optimization would be needed, most of it can be pre-rendered, and use sprites for what moves. The only trick would be use solid BG tiles for inner part of sprites which are way too big to be displayed.

I don't think it'd take a lot of optimisation, but rather a lot of artistic work and a lot of ROM space.
I think there was a Famicom game I tried once with such impressive graphics, but I don't remember its name as it was in Japanese and was not "rememberable". Of course you couldn't go far in the game without knowing japanese so I didn't care much about the game but it had impressive graphics.

by hcs on 2012-02-15 (#89982)

Bregalad wrote:
I don't think it's such an achievement to make it so small because there is nothing that really can take up much space.
I just have an unrolled loop that takes quite some space and the sine table for the rotation of course.

Sure, but being relatively compact and fairly efficient with RAM could make it more feasible to use, potentially.

Quote:
Especially the line plotting part which I arranged from some C code I found on the NES and which still needs some cleanup/optimisation/fixes.
C code on the NES? Or the net? Doesn't look too bad, I got the gist of it.

Quote:
In order to update as few data as possible I thought it was better to take Bell's route : Update the map as well as chars (instead of updating only chars like your demo), that way I avoid updating a lot of blank chars uselessly, at the price of having to update the map as well.
Yeah, you win in the end if you don't have to update empty, but with the size I'm using I couldn't find a good way of keeping track of tile mappings.

Quote:
The only real limitation (aside CPU power obviously) is the fact only 64 tiles can be used for this. I think my engine could be made to extend for updating 64 other tiles, limit the speed to max. 30FPS and use some kind of double buffering on the tiles, but I did not fee like doing that.
Yeah it seems like it would be straightforward to use different sets of tiles, just init NextChar to something different for each buffer. Then you can take as long as you need updating patterns and names, you could perform the swap just by switching to a different nametable.

Regarding filled polygons, I've been giving some thought to that, but gladly I haven't let it take over my life yet like this last project did. Having something worth displaying at a fairly low framerate is a big issue.

by Bregalad on 2012-02-16 (#90003)

I meant C code I found on the net, sorry for the lapsus.