Optimizing multiple conditionals? (specifically modern CPUs)

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Optimizing multiple conditionals? (specifically modern CPUs)
by on (#240145)
Sorry if this is a silly question, but I haven't seen much literature on this subject, other than double condition checking with subtraction. I've been playing with x86-64 Linux assembly and was making code that would search through a buffer until it finally reached ascii '0'-'9', 'A'-'F', or 'a'-'f' (for hexadecimal conversion). While I could accomplish this with a "jump if less than" and a "jump if greater than" instruction for each range, I figure there must be a faster way of doing this, particularly on a modern pipelined processor where you generally want to avoid branching.

About branching, it's unfortunate branch prediction doesn't appear to be too standardized. I understand this is largely do processor manufacturers trying to develop smarter branch prediction algorithms, but this also makes it more difficult to optimize code (not for any specific processor) dealing with branches. (I don't really understand why the Pentium 4 branch prediction hints later went unused; I figure even a compiler has more knowledge than a CPU about program flow...)
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240149)
apart from using an AND to convert A-F -> a-f and avoid a 3rd check range, not really that much you can do. You might be able to use Sub from original, as you have a byte operand the system will have a few 8bits worth of ALU to spare, so it might be able to do original - X,Y,Z in parallel vs doing original -x, result - y, result -z
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240150)
Subtract 0x30
First condition: <= 9
AND ~0x20
Subtract 0x11
Second condition: <= 5

This gets it down to two comparisons...
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240152)
Wow, thank you both! :) I don't know why, but the link in the example I posted was a bit hard for me to understand, but I now realize the whole point of the subtraction is to shift the range of values to where the first number in the range is 0. (Conversely, I suppose you could add until the largest number in the range is 0xFFFF or whatever and check for greater than / equal to, but then you have to worry about register size.)

I had thought it was silly that seemingly no processors (at least not ones I know of) are unable to perform an OR operation on processor flags rather than only being able to overwrite them, but it occurs to me that for the majority of cases, this wouldn't gain much.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240160)
Drew Sebastino wrote:
I don't really understand why the Pentium 4 branch prediction hints later went unused

Because it wasn't very useful.

Even 6502 had something equivalent. Any branch instruction takes more time if taken than if not, so the optimization: put the most common case in the not-taken path. (In practice a little more complicated, but I'm illustrating the idea.)

Intel branch instructions have a similar natural difference between branch taken or not. One way is slower than the other. Even without a hint, a compiler could usually reorganize code to put the common case on the fast side of the branch.

So... all the hint did was let you reverse the natural speed order of a branch instruction. Since in many cases you can just reorganize the code, the times where a hint prefix on the branch could really help are relatively rare. It just wasn't that great a tool.

Then you get to a later generation CPU with dynamic branch prediction, and it blows static prediction out of the water. The hint which was now only marginally useful is now entirely useless. The dynamic prediction will do a better job, and can't make use of the hint anyway. The hint is just a wasted byte at that point.

Oziphantom wrote:
I figure even a compiler has more knowledge than a CPU about program flow...

Compilers know a little bit of relevant useful information. For example they are likely to know when something is a simple loop counter, and can optimize that accordingly. There are a lot of cases where the compiler doesn't know much at all about which branch is going to be taken more, though.

Consider a utility that loads a file, processes its bytes, outputs some other file. There's no place in most high level languages to tell the compiler much about what kind of data is expected to be found in the file. You might know that e.g. your files are filled with mostly 0 bytes but the compiler won't.

Profile-guided optimzation can help with this a bit. You record some runs of the program and save the analysis, and then the compiler can use that information to inform its own optimizer. This generally does improve performance a little bit, but it tends to require a lot of setup work (e.g. re-running the profile test every time you build).

This is actually one of the areas where "hand optimization" on the part of the programmer can help most: you can know more about the incoming data than a compiler can, and that knowledge can be applied, whether by revising the algorithm at a higher level, reorganizing the data itself, tweaking the high level code to get a more appropriate result from the compiler, rewriting something critical in assembly, etc.

Drew Sebastino wrote:
About branching, it's unfortunate branch prediction doesn't appear to be too standardized... ...this also makes it more difficult to optimize code (not for any specific processor) dealing with branches.

This is not really a problem. They're not standardized because successive generations are generally just better at predicting than previous ones. Nothing is lost there, and even if you somehow optimized your code for a previous generation in a way that doesn't carry forward (extremely unlikely), the newer CPU would be faster anyway. It won't matter.

Optimizing branches is a more or less generic problem: the rule of thumb is just to try and set up branches so that the same direction is taken in clusters. Randomly going left and right is the worst case. Going left a little more than right is better. Going only left many times, then switching and going only right many times is the best case.

There really isn't much need (or ability) to get CPU-specific with this kind of optimization, for the most part. There are isolated cases but the bulk of the problem is generic.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240167)
rainwarrior wrote:
Because it wasn't very useful.

Even 6502 had something equivalent. Any branch instruction takes more time if taken than if not, so the optimization: put the most common case in the not-taken path. (In practice a little more complicated, but I'm illustrating the idea.

The 6502 has nothing equivalent. In the case of a loop, you have to branch backwards in all cases, even though this is by far the most common case. The 6502 "branch-prediction" would be always-non-taken which is awful for loops. In the case of an if/else statement you'd be correct, but another instruction would have to be used to skip the "else" part at the end of the "if" part, so the optimisation of having the most common case non taken could sometimes be reversed. In practice we often don't mind this when coding 6502 code because that's already enough trouble.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240180)
rainwarrior wrote:
(In practice a little more complicated, but I'm illustrating the idea.)

My whole point was to appeal to a platform OP is more familiar with. By choosing what is in or out of a branch, you save a cycle on 6502. That's the only important part of the illustration.

6502 branch optimization is more complicated than just inverting a branch (the savings are too small, it's inadequate for loops, an if/else needs a jump that inverts the savings, etc.) but I thought the principle was clear. That kind branch optimization did work better for the CPU s we're really talking about.

On a Pentium the branch penalty is from a pipeline stall rather than just a question of branch taken or not, hence prediction is the only relevant factor. For loops, earlier static branching implementations already treated a backward branch (i.e. the most common for a loop) as a prediction that it will take the branch. The added hint did add some nuanced optimization possibilities, but only subtly so compared to what we already had without the hint.

That's the point: the hint didn't gain much, and dynamic prediction made it obsolete.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240207)
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x. To implement saturated addition, you only concern yourself with the High Bit becoming set. Shift right, multiply your mask, then OR with the mask, then you've saturated the number. When I did it for Snes9x, I combined the Red and Blue channels into the same 32 bit register to improve speed.

Example: Branchless 8 bit saturated add:
//add the numbers
C = A + B;
//look at high bit only
D = C >> 8;
//Multiply by a mask (0xFF)
D *= 0xFF
//Restrict range to 00-FF, then OR with FF if the high bit had become set before.
C = (C & 0xFF) | D;

Another Example: Branchless 8 bit saturated subtract:
//subtract the numbers
C = A - B
//If number became negative, isolate the sign only, and invert it
D = ((C >> 31) & 1) ^ 1;
//Multiply by FF to make an AND mask (value 0 if the subtraction went negative, FF otherwise)
D *= 0xFF;
//Do the AND mask
C = C & D
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240215)
Oh yeah, that's the other generic way to optimize branches: get rid of them!

Actually you'll find compilers do this automatically for a lot of simple things these days. Sometimes it's better to do e.g. a multiply and an add than a branch. A pipeline stall might easily be worth a few arithmetic instructions.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240224)
Dwedit wrote:
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x.

Hmm... Is there a particular reason you did not use any of the SSE instructions for this? I think saturated arithmetic is the only type of arithmetic available here.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240400)
Dwedit wrote:
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x.

On the other hand, I've done similar stuff to avoid a branch, only to have the optimizing compiler put the branch back in.

It does make me suspect that blindly removing branches isn't always faster, since the branchless code tends to be more complicated than branching code. I didn't think to benchmark it at the time, though.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240407)
Note that use of comparison operators (==, !=, <, <=, >, >=), logical operators (&& ||), as well as the Question Colon operator ( ? : ) are all branches, so if you used them, you added in a branch.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240409)
Dwedit wrote:
Note that use of comparison operators (==, !=, <, <=, >, >=)
That's not quite true. LAHF allows the flags register to be used in arithmetic. Whether the flags are in a useful encoding, or whether there's any way to coax the compiler to emit LAHF is another question. (A few earlier x86-64 Intel cores don't support it, but everything 32bit only, less than ten years old, or made by AMD, does. GCC's manual says that it will use the related SAHF to optimize a few math builtins)
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240425)
Don't forget CMOV 8-)

Quote:
I've been playing with x86-64 Linux assembly and was making code that would search through a buffer until it finally reached ascii '0'-'9', 'A'-'F', or 'a'-'f' (for hexadecimal conversion).

There's probably a sse/avx version that blows everything else out of the water.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240429)
Rather surprisingly, there doesn't seem to be; text processing probably just isn't a big enough resource hog for this sort of thing.

I did, however, design a function that converts a 64 bit hexadecimal number to 16 ascii characters using several different sse instructions. I don't think this could be much, if any, faster without utilizing newer instructions.

Code:
  bswap rax                  ;reverse the order of bytes (number needs to be backwards)
  movq xmm1, rax            ;move low nibble qword into low qword of xmm1
  shr rax, 4
  movq xmm0, rax             ;move high nibble qword into low qword of xmm0
  punpcklbw xmm0, xmm1         ;interleave bytes
  pand xmm0, [lowNibbleMask]   ;and with mask to get low nibbles of xmm0
  movapd xmm1, xmm0
  pcmpgtb xmm1,[checkIfAtoF]   ;find what numbers are represened as A through F
  psubusb xmm1,[correctAtoF]   ;subtract with saturation to yield 7 for these numbers
  paddb   xmm1, [asciiNumStart]   ;add the ascci base number offset
  paddb xmm0, xmm1            ;add xmm1 to xmm0 to obtain correct ascii values


Also, unrelated, but does anyone know if there's any way for nasm/yasm to know the data size of macro arguments? (For example, al, ax, eax, rax.) I have macros that generate slightly different code based on one argument and have this argument's size is a separate argument, but it would be nice if I could omit this.



Going off on a tangent now, I remember first getting into x86 and thinking: "why on earth do new instructions keep being added? How are there not enough already?" but now, I keep wanting to use instructions that were introduced only a few years ago, lol. However, I've restricted myself to only using nothing more recent than SSE2; Notepad++ (the version I have, anyway) doesn't even recognize anything newer and won't format it.

I had heard x86 (especially modern x86-64) was a nighmare to program for, but I find it rather fun; there are plenty of registers that you often don't even need to use because immediate performance is so good, and while there are a bazillion different instructions, once you get to learn most of them, they can really make solving various different problems much easier. (Of course, if I cared about ease of development, I wouldn't touch assembly at all, lol.) Not to mention the documentation is excellent; meanwhile, even with ARM, you need to look for several different documents. As for PowerPC and many other (dead) ISAs; good luck.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240441)
POWER9 was released 2017 and POWER10 is coming next year. Someone forgot to tell it it's dead :P
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240442)
Is "defunct in devices intended for home and small business use" more honest?
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240443)
My Blackbird begs to differ.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240500)
I finally found (or made, rather) a way to find the size of an argument in NASM:

Code:
%macro find_argument_size 1
  %defstr argument %1
 
  %substr test argument 1,4
  %if test ==   'byte'
  %define size  'byte'
  %elif test == 'word'
  %define size  'word'
  %else
    %substr test argument 1,5
    %if test ==   'dword'
    %define size  'dword'
    %elif test == 'qword'
    %define size  'qword'
 
 
    %else
      %substr test argument 1,2
      %if test == 'si' || test == 'di' || test == 'bp' || test == 'sp'
        %substr test argument 3,1
        %if test ==  'l'
        %define size 'byte'
        %else
        %define size 'word'
        %endif
 
 
      %else
        %substr test argument 1,1
        %if test == 'a' || test == 'b' || test == 'c' || test == 'd'
          %substr test argument 2,1
          %if test ==  'x'
          %define size 'word'
          %else
          %define size 'byte'
          %endif
 
 
        %elif test == 'e'
        %define size  'dword'
 
 
        %elif test == 'r'
          %substr test argument 3,1
          %if test ==   'b'
          %define size  'byte'
          %elif test == 'w'
          %define size  'word'
          %elif test == 'd'
          %define size  'dword'
        %else
            %substr test argument 4,1
            %if test == 'b'
            %define size  'byte'
            %elif test == 'w'
            %define size  'word'
            %elif test == 'd'
            %define size  'dword'
            %else
            %define size  'qword'
            %endif
          %endif
        %endif
      %endif
    %endif
  %endif
%endmacro

It's pretty ugly though... :lol: I didn't implement segment registers, but I really don't see that being a problem. However, it only works with lowercase; I don't know if there is some existing functionality to convert to uppercase or lowercase, but if there isn't, I can easily make a solution.

Does anyone know if there is a good website for posting code like this? (I'm not counting github; I think that's more geared toward projects and not code snippets, and it isn't x86 assembly oriented) I figure this could be useful for other assembly programmers. If not, as I now know, web hosting is cheap, lol.
Re: Optimizing multiple conditionals? (specifically modern C
by on (#240503)
GitHub has a "Gist" service for single-file repositories.