Drew Sebastino wrote:
I don't really understand why the Pentium 4 branch prediction hints later went unused
Because it wasn't very useful.
Even 6502 had something equivalent. Any branch instruction takes more time if taken than if not, so the optimization: put the most common case in the not-taken path. (In practice a little more complicated, but I'm illustrating the idea.)
Intel branch instructions have a similar natural difference between branch taken or not. One way is slower than the other. Even without a hint, a compiler could usually reorganize code to put the common case on the fast side of the branch.
So... all the hint did was let you reverse the natural speed order of a branch instruction. Since in many cases you can just reorganize the code, the times where a hint prefix on the branch could really help are relatively rare. It just wasn't that great a tool.
Then you get to a later generation CPU with dynamic branch prediction, and it blows static prediction out of the water. The hint which was now only marginally useful is now entirely useless. The dynamic prediction will do a better job, and can't make use of the hint anyway. The hint is just a wasted byte at that point.
Oziphantom wrote:
I figure even a compiler has more knowledge than a CPU about program flow...
Compilers know a little bit of relevant useful information. For example they are likely to know when something is a simple loop counter, and can optimize that accordingly. There are a lot of cases where the compiler doesn't know much at all about which branch is going to be taken more, though.
Consider a utility that loads a file, processes its bytes, outputs some other file. There's no place in most high level languages to tell the compiler much about what kind of data is expected to be found in the file. You might know that e.g. your files are filled with mostly 0 bytes but the compiler won't.
Profile-guided optimzation can help with this a bit. You record some runs of the program and save the analysis, and then the compiler can use that information to inform its own optimizer. This generally does improve performance a little bit, but it tends to require a lot of setup work (e.g. re-running the profile test every time you build).
This is actually one of the areas where "hand optimization" on the part of the programmer can help most: you can know more about the incoming data than a compiler can, and that knowledge can be applied, whether by revising the algorithm at a higher level, reorganizing the data itself, tweaking the high level code to get a more appropriate result from the compiler, rewriting something critical in assembly, etc.
Drew Sebastino wrote:
About branching, it's unfortunate branch prediction doesn't appear to be too standardized... ...this also makes it more difficult to optimize code (not for any specific processor) dealing with branches.
This is not really a problem. They're not standardized because successive generations are generally just
better at predicting than previous ones. Nothing is lost there, and even if you somehow optimized your code for a previous generation in a way that doesn't carry forward (extremely unlikely), the newer CPU would be faster anyway. It won't matter.
Optimizing branches is a more or less generic problem: the rule of thumb is just to try and set up branches so that the same direction is taken in clusters. Randomly going left and right is the worst case. Going left a little more than right is better. Going only left many times, then switching and going only right many times is the best case.
There really isn't much need (or ability) to get CPU-specific with this kind of optimization, for the most part. There are isolated cases but the bulk of the problem is generic.