tepples wrote:
WedNESday wrote:
Code:
temp = (unsigned long)tempfunc - ((unsigned long)Cache+5)
Worked! Thanks for that, although I did just read on a site that there is no x86 equivalent of the 6502 $20/$4C opcodes.
Which is faster by the way, the $E8 or the $FF $15 way?
(Now I need a site with x86 timings that are POST 80486...)
There are three styles: U-V pipes (Pentium, Pentium MMX, Larrabee), 4-1-1 pipes (Pentium Pro, Pentium II, Pentium III, Pentium M), and NetBurst (Pentium 4). At some point, the pipes become so deep that you might have to use a sort of profile-guided optimization: generate the code several different ways, test them all, and link only the fastest correct version into the final binary.
By the way, I would ignore NetBurst -- the Pentium 4 is an evolutionary dead-end, and modern Intel chips other than Larrabee are all based on Core2 which is an enhancement of the original PPro/P2/P3 out-of-order architecture. I think on Core2 the 4-1-1 decoding template is actually more like 4-1-1-1 or something...but its a lot less important anyways because of micro-op fusion; decoding will usually not be a bottleneck, but rather loads (even cache hits) and execution and branch mispredictions.
Here's a couple of suggestions to get good performance on modern x86 chips:
(1) avoid cache misses when you can, especially L2 cache misses which will waste a hundred cycles or more.
(2) avoid long dependence chains (several instructions in a row each depending on the results of the previous instruction). Dependence chains are okay, but you want to have several of them going at once and keep them short where you can.
(3) don't use obsolete (slow) instructions, and don't use the AH/BH/CH/DH registers. Be careful with INC and DEC for example, they only write to a subset of the flags so they can cause extra dependencies which slow down your code (which you can avoid by using the larger ADD and SUB instructions instead).
(4) don't do any of the "slow" things with loads and stores: unaligned access that crosses a cacheline boundary, large load after small store with overlapping addresses, etc.
(5) Be careful with partial registers. If you want to load a 16-bit value, load it into a 32-bit register with MOVZX EAX,[word ptr foo] or whatever, and then do 32-bit math on it. Its better to do shifts and adds on some 32-bit registers than to write something like MOV AH,[byte ptr foo] ; ADD EBX, EAX.
(6) Try to avoid unpredictable branches. Random branches will be mispredicted 50% of the time, but it can sometimes be even worse. Note that indirect branches (jumps and calls) are hard for the CPU to predict if they can go to a lot of different addresses; the classic case is the opcode dispatch in a bytecode interpreter. You might be able to mitigate the penalty by loading the target address into a register as early as possible (I remember something about jumps and calls through a register being "nearly free" on PPro/P2/P3 if register was loaded around 40 cycles in advance).