Split from: viewtopic.php?f=12&t=14281&p=172090#p172076
> The sprites should render into a line buffer once each scanline (like the real hardware does), rather than searching through the list of 32 active sprites every pixel.
Do you mean pre-render all 512 pixels for the scanline all at once?
I have trouble believing the real hardware sets aside room to hold 512 pixels+metadata (priority, transparent, etc.)
> As for the windows, you have them doing a large amount of branch-heavy calculations
Yeah, this code is modeled more like hardware. Real hardware isn't going to care about branch conditions. It'll be doing much of the operations in parallel, and it has the same amount of time to compute the window information regardless of whether it can short-circuit some unnecessary tests.
I do this a lot actually in the PPU core. But it's probably worthwhile to rein that in a bit.
And yes, I know ... "making C++ code look like hardware design is stupid"; not the least because C++ is nothing like even Verilog in operation. I'm willing to compromise a bit here where reasonable.
> at the time you were very emphatic "I don't care how slow accuracy is because there's balanced and performance for that"
Important to stress that I still have my limits on this point. But yes, we need to speed up accuracy, otherwise the ~10% of my userbase that remains will drop to around ~2% of where it used to be on the next release.
> especially if you're willing to store them as byte-packed bitmasks and not as a kazillion bools.
This stuff I find is kind of iffy on modern hardware. The L1/L2 cache sizes are getting bigger and bigger on modern CPUs, and doing &(1<<bit) for testing along the way tends to slow things down more. I've heard some lamenting around the std::vector<bool> optimization, for instance. (Then again, on 64-bit PowerPC, sizeof(bool)==8, so ... yeah. Cache overhead becomes a much bigger concern there.)
blargg was the king of doing these extreme bit-manipulation optimizations (I loved 'bit-twiddling hacks' as well), but they tend to become less and less effective on modern CPUs. *Especially* the ones that try to throw in 2-3 extra math ops to eliminate a conditional expression.
Although that's a double-edged sword. I'd bet money that both of the above are not true for ARM. Which would explain why higan is just pathetically awful on ARM, way beyond where it should be.
...
Going on my own tangent ... something of an idea I've been kicking around my head for a while now ...
libco is of course a cooperative threading library with a stack frame. And this is essential for complex logic devices like the SNES CPU and SMP (short of enslavement which I refuse to do.) However, it's pretty inefficient for single-switch tables, as is obvious for both the PPU and DSP.
I'm very happy with the consistency through my scheduler right now, so I don't want to have some cores be state machines, and others be threads, and have really different styles of code to synchronize them.
However ... I've been thinking, it may be possible to write an abstraction layer around libco itself, and provide the wrapped co_create with new functionality where a stack size of 0 means "no stack at all", or in other words, "don't change out the stack frame."
Essentially, it would just act like setjmp/longjmp at this point (may be better to implement it in assembler to avoid OS-specific red tape inside those functions.) So long as we don't co_switch away in a deeper function, then this could potentially end up being much closer to the speed of a regular switch table, but it would keep the state manipulation code out of our core (the only state would be registers.)
However, this may not work if our stackless coroutine has local variable state. It's kind of risky. C++17 is potentially adding stackless coroutines (which allocate objects on first invocation to hold local variables), but the implementations so far I hear are very poor performers. We may be able to later on wrap our libco abstraction around that. And at that point, we could end up with equivalent performance to a state machine, without having to change the way higan's schedulers work at all.
> The sprites should render into a line buffer once each scanline (like the real hardware does), rather than searching through the list of 32 active sprites every pixel.
Do you mean pre-render all 512 pixels for the scanline all at once?
I have trouble believing the real hardware sets aside room to hold 512 pixels+metadata (priority, transparent, etc.)
> As for the windows, you have them doing a large amount of branch-heavy calculations
Yeah, this code is modeled more like hardware. Real hardware isn't going to care about branch conditions. It'll be doing much of the operations in parallel, and it has the same amount of time to compute the window information regardless of whether it can short-circuit some unnecessary tests.
I do this a lot actually in the PPU core. But it's probably worthwhile to rein that in a bit.
And yes, I know ... "making C++ code look like hardware design is stupid"; not the least because C++ is nothing like even Verilog in operation. I'm willing to compromise a bit here where reasonable.
> at the time you were very emphatic "I don't care how slow accuracy is because there's balanced and performance for that"
Important to stress that I still have my limits on this point. But yes, we need to speed up accuracy, otherwise the ~10% of my userbase that remains will drop to around ~2% of where it used to be on the next release.
> especially if you're willing to store them as byte-packed bitmasks and not as a kazillion bools.
This stuff I find is kind of iffy on modern hardware. The L1/L2 cache sizes are getting bigger and bigger on modern CPUs, and doing &(1<<bit) for testing along the way tends to slow things down more. I've heard some lamenting around the std::vector<bool> optimization, for instance. (Then again, on 64-bit PowerPC, sizeof(bool)==8, so ... yeah. Cache overhead becomes a much bigger concern there.)
blargg was the king of doing these extreme bit-manipulation optimizations (I loved 'bit-twiddling hacks' as well), but they tend to become less and less effective on modern CPUs. *Especially* the ones that try to throw in 2-3 extra math ops to eliminate a conditional expression.
Although that's a double-edged sword. I'd bet money that both of the above are not true for ARM. Which would explain why higan is just pathetically awful on ARM, way beyond where it should be.
...
Going on my own tangent ... something of an idea I've been kicking around my head for a while now ...
libco is of course a cooperative threading library with a stack frame. And this is essential for complex logic devices like the SNES CPU and SMP (short of enslavement which I refuse to do.) However, it's pretty inefficient for single-switch tables, as is obvious for both the PPU and DSP.
I'm very happy with the consistency through my scheduler right now, so I don't want to have some cores be state machines, and others be threads, and have really different styles of code to synchronize them.
However ... I've been thinking, it may be possible to write an abstraction layer around libco itself, and provide the wrapped co_create with new functionality where a stack size of 0 means "no stack at all", or in other words, "don't change out the stack frame."
Essentially, it would just act like setjmp/longjmp at this point (may be better to implement it in assembler to avoid OS-specific red tape inside those functions.) So long as we don't co_switch away in a deeper function, then this could potentially end up being much closer to the speed of a regular switch table, but it would keep the state manipulation code out of our core (the only state would be registers.)
However, this may not work if our stackless coroutine has local variable state. It's kind of risky. C++17 is potentially adding stackless coroutines (which allocate objects on first invocation to hold local variables), but the implementations so far I hear are very poor performers. We may be able to later on wrap our libco abstraction around that. And at that point, we could end up with equivalent performance to a state machine, without having to change the way higan's schedulers work at all.