higan dot-based renderer discussion

higan dot-based renderer discussion
by byuu on 2016-05-30 (#172092)

Split from: viewtopic.php?f=12&t=14281&p=172090#p172076

> The sprites should render into a line buffer once each scanline (like the real hardware does), rather than searching through the list of 32 active sprites every pixel.

Do you mean pre-render all 512 pixels for the scanline all at once?

I have trouble believing the real hardware sets aside room to hold 512 pixels+metadata (priority, transparent, etc.)

> As for the windows, you have them doing a large amount of branch-heavy calculations

Yeah, this code is modeled more like hardware. Real hardware isn't going to care about branch conditions. It'll be doing much of the operations in parallel, and it has the same amount of time to compute the window information regardless of whether it can short-circuit some unnecessary tests.

I do this a lot actually in the PPU core. But it's probably worthwhile to rein that in a bit.

And yes, I know ... "making C++ code look like hardware design is stupid"; not the least because C++ is nothing like even Verilog in operation. I'm willing to compromise a bit here where reasonable.

> at the time you were very emphatic "I don't care how slow accuracy is because there's balanced and performance for that"

Important to stress that I still have my limits on this point. But yes, we need to speed up accuracy, otherwise the ~10% of my userbase that remains will drop to around ~2% of where it used to be on the next release.

> especially if you're willing to store them as byte-packed bitmasks and not as a kazillion bools.

This stuff I find is kind of iffy on modern hardware. The L1/L2 cache sizes are getting bigger and bigger on modern CPUs, and doing &(1<<bit) for testing along the way tends to slow things down more. I've heard some lamenting around the std::vector<bool> optimization, for instance. (Then again, on 64-bit PowerPC, sizeof(bool)==8, so ... yeah. Cache overhead becomes a much bigger concern there.)

blargg was the king of doing these extreme bit-manipulation optimizations (I loved 'bit-twiddling hacks' as well), but they tend to become less and less effective on modern CPUs. *Especially* the ones that try to throw in 2-3 extra math ops to eliminate a conditional expression.

Although that's a double-edged sword. I'd bet money that both of the above are not true for ARM. Which would explain why higan is just pathetically awful on ARM, way beyond where it should be.

...

Going on my own tangent ... something of an idea I've been kicking around my head for a while now ...

libco is of course a cooperative threading library with a stack frame. And this is essential for complex logic devices like the SNES CPU and SMP (short of enslavement which I refuse to do.) However, it's pretty inefficient for single-switch tables, as is obvious for both the PPU and DSP.

I'm very happy with the consistency through my scheduler right now, so I don't want to have some cores be state machines, and others be threads, and have really different styles of code to synchronize them.

However ... I've been thinking, it may be possible to write an abstraction layer around libco itself, and provide the wrapped co_create with new functionality where a stack size of 0 means "no stack at all", or in other words, "don't change out the stack frame."

Essentially, it would just act like setjmp/longjmp at this point (may be better to implement it in assembler to avoid OS-specific red tape inside those functions.) So long as we don't co_switch away in a deeper function, then this could potentially end up being much closer to the speed of a regular switch table, but it would keep the state manipulation code out of our core (the only state would be registers.)

However, this may not work if our stackless coroutine has local variable state. It's kind of risky. C++17 is potentially adding stackless coroutines (which allocate objects on first invocation to hold local variables), but the implementations so far I hear are very poor performers. We may be able to later on wrap our libco abstraction around that. And at that point, we could end up with equivalent performance to a state machine, without having to change the way higan's schedulers work at all.

Re: higan dot-based renderer discussion
by tepples on 2016-05-30 (#172106)

byuu wrote:

> The sprites should render into a line buffer once each scanline (like the real hardware does), rather than searching through the list of 32 active sprites every pixel.
Do you mean pre-render all 512 pixels for the scanline all at once?

It means render all 256 sprite pixels at once. Then the stream of sprite pixels overlays the 512 background pixels as they stream in normally.

Re: higan dot-based renderer discussion
by AWJ on 2016-05-30 (#172123)

What tepples said. The linebuffer is only for sprites, which are never double-resolution. It's 256x9 bits (2 bits of priority, 3 bits of palette, 4 bits of VRAM character data), which is smaller than OAM itself.

For efficiency, in an emulator you can give it 7 pixels of slop on each side so that 8-pixel slivers near the edges of the screen don't have to be special cased. You're already rounding up from 9 to 16 bits out of necessity, a few more bytes on top of that won't hurt.

Re: higan dot-based renderer discussion
by tepples on 2016-05-30 (#172126)

If you want to be cycle-accurate about it, it probably fills this line buffer during horizontal blanking as it reads the pattern slivers out of VRAM (at 2 dots per sliver) and clears the line buffer as pixels are read out of it.

What sort of tests would expose this internal line buffer behavior? Forced blanking at strategic points mid-scanline?

Re: higan dot-based renderer discussion
by Sik on 2016-05-30 (#172168)

Yeah that'd probably trip either scanning the OAM table or fetching tiles from VRAM. Either way there should be less sprites being rendered on the scanline.

Re: higan dot-based renderer discussion
by 93143 on 2016-05-31 (#172220)

This demo switches OBSEL mid-screen using HDMA. There's a bit of glitching on the right-hand side of the screen on real hardware (you can see it in the video, and I double-checked on my Super Everdrive), but it seems to work fine in emulators, including higan. (Except for the mosaic at the end of the intro, which fails one way in the accuracy core and the other way in the balanced core, only looking (broadly?) correct in the performance core (and every other emulator).)

I once tried the same trick, but with only one 8x8 sprite. It worked as expected on real hardware and every emulator except higan accuracy, in which the switch was one scanline too low (I counted, but you could tell because the reset at the top of the screen was also HDMA, and it took effect on the second visible scanline). An older version somehow showed up with the intended colour effect reversed when run in higan (any core), but the newer one doesn't.

I used a similar method in a display engine prototype for my shmup port, except that instead of HDMA, I used an IRQ mid-line. This meant that the change happened during an active line (though further to the right than any sprites), and it seemed to work perfectly both on real hardware and in higan accuracy. Though the line wasn't exactly heavily loaded; 3 sprites and 9 tiles max, with one tile off the left-hand side of the screen. I plan to try a stress test once I get some breathing room; I don't expect changing VRAM pointers to affect OAM operations, but with the SNES you never know for sure until you try...