tcc816 math optimization

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
tcc816 math optimization
by on (#100128)
tcc816 is still one of the very few and the best free 65816 C compilers available, but it is certainly far from being perfect. One of things that make quite some trouble is that its math routines are really slow. It seems that they were done just to work somehow, without care for performance at all. I'm not too good in 65816 code, so I can improve it only moderately, but I think other people may suggest some routines and optimizations. Would be nice to join efforts to improve what we have for all snes-sdk or PVSnesLib users benefit.

Take a look at the existing code

Obvious improvements are to utilize 16 bit math where possible instead of 8 bit, and to unroll all loops, as few extra K of math code won't do big difference for SNES anyway. I'm sure there are more.
Re: tcc816 math optimization
by on (#100129)
For 16/8 division & modulo and 8*8 multiplication (and maybe 16*8 multiplication?) there's hardware support on the SNES that could be used. See e.g. this code.
Re: tcc816 math optimization
by on (#100130)
I know, but the docs says it won't work in Mode7, and Mode7 is one of the most interesting and distinctive SNES features, as well as one that needs fast math.
Re: tcc816 math optimization
by on (#100131)
Hmm. Which doc is that?

Anyway, there's also the mode7 coefficient matrix math registers ($211B-$2120). And if you really need to do a lot of 2D or 3D coordinate calculation I would suggest looking at the DSP-1.
Re: tcc816 math optimization
by on (#100133)
My mistake, it is related exactly to the matrix registers and the result that could be read from $2134-36. So at least part of the math code could be done using the HW resources. Still would be nice to improve remaing routines.

I know about DSP1, but I need to improve the compiled code performance itself, not a particular case.
Re: tcc816 math optimization
by on (#100134)
Simple improvement for tcc__shldi3 (saves 5 cycles/iteration):

Code:
tcc__shldi3:
lda.b 6,s ; hi word
sta.b tcc__r1
lda.b 8,s ; shift count
beq +
tax
lda.b 4,s ; low word
- asl a
rol.b tcc__r1
dex
bne -
sta.b tcc__r0
rtl
+
lda.b 4,s ; low word
sta.b tcc__r0
 rtl


The same change could be applied for SHR.
Re: tcc816 math optimization
by on (#100136)
Thanks. However, it seems that this routine never gets called - not in Classic Kong, at least (~8000 lines of C code). Or am I missing something?
Re: tcc816 math optimization
by on (#100137)
tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).
Re: tcc816 math optimization
by on (#100140)
I checked how many times these math routines are called in Classic Kong, and results are unexpected (by me).

tcc__mul, tcc__div are called about 10 times.
tcc__udiv is called a bit more, still not much.

Anything else is never called. I have some modulo operations in the game, but none of routines that supposedly do modulo operation are called.
Re: tcc816 math optimization
by on (#100142)
If your modulo operations are "% some_immediate_power_of_two" then they will be (or should be) optimized to an AND operation.
Re: tcc816 math optimization
by on (#100143)
They are immediate not power of two (3, 6, 10, 100, 1000, 10000 etc), also variables. I don't rely on compiler and always use & or shifts in case of power of two values math.
Re: tcc816 math optimization
by on (#100144)
Shiru, doesn't tcc816 support generation of assembly listing? That's an easy way to find out how it implements those divisions/remainders.

mic_ wrote:
tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).

Freudian slip?
Re: tcc816 math optimization
by on (#100145)
I wouldn't say it is easy. It generates assembly mess without any tabs and without C source in comments. Also, the code it generates is huge and unoptimal. Doable, of course, I just don't think it is that important right now how exactly it does (probably calls the division routine), I just didn't expect that most of the math routines are never used in a large program and that basically just three of them needs to be optimized.
Re: tcc816 math optimization
by on (#100149)
@thefox: Could be :P Probably just work-induced afternoon coma though.

Here's what tcc-816 gives me:

Code:
void opt_test3(unsigned int arg)
{
   int j;

   j /= 3;
   j = arg % 10;
}

Compiles to
Code:
opt_test3:
.ifgr __opt_test3_locals 0
tsa
sec
sbc #__opt_test3_locals
tas
.endif
lda -2 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #3
jsr.l tcc__div   ; <-- calls tcc_div
lda.b tcc__r9
sta -2 + __opt_test3_locals + 1,s
lda 3 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #10
jsr.l tcc__udiv   ; <-- calls tcc_udiv
stx.b tcc__r0
txa
sta -2 + __opt_test3_locals + 1,s
.ifgr __opt_test3_locals 0
tsa
clc
adc #__opt_test3_locals
tas
.endif
rtl


And this
Code:
void opt_test2(unsigned int arg)
{
   int j;

   j *= 7;
   j /= 4;
   j = arg % 16;
}

Compiles to
Code:
opt_test2:
.ifgr __opt_test2_locals 0
tsa
sec
sbc #__opt_test2_locals
tas
.endif
lda -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
asl a
asl a
asl a
sec
sbc.b tcc__r0     ; <-- strength reduction: j*7 converted to (j<<3)-j
sta -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
cmp #$8000
ror tcc__r0
cmp #$8000
ror tcc__r0   ; <-- strength reduction: j/4 converted to j>>2
lda.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
lda 3 + __opt_test2_locals + 1,s
and #15   ; <-- strength reduction: arg%16 converted to arg&15
sta.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
.ifgr __opt_test2_locals 0
tsa
clc
adc #__opt_test2_locals
tas
.endif
rtl
Re: tcc816 math optimization
by on (#100153)
Don't have time to sit around counting cycles / working it all out, but I thought I'd share this. Taken directly from "Programming the 65816 including the 6502, 65C02, and 65802":

16-bit multiplication:

Code:
; operand 1: 16-bits in DP location MCAND1
; operand 2: 16-bits in DP location MCAND2
; result: 16-bits in accmulator
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)

MCAND1 = $80     ; DP location $80
MCAND2 = $82     ; DP location $82

mymult:
  lda #0         ; initialise result

- ldx MCAND1     ; get operand 1
  beq done       ; if operand 1 is zero, done
  lsr MCAND1     ; get right bit, operand 1
  bcc +          ; if clear, no addition to previous products
  clc            ; else add operand 2 to partial result
  adc MCAND2

+ asl MCAND2     ; now shift operand 2 left for possible addition next time
  bra -

done:
  rts


16-bit division w/ remainder:

Code:
; 16-bit divide: X / A -> QUOTNT; remainder in X
; QUOTNT is a 16-bit direct page cell
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
; no special handling for divide by zero (returns $FFFF as quotient)

QUOTNT = $80     ; DP location $80

mydiv:
  stz QUOTNT     ; initialise quotient to 0
  ldy #1         ; iniitalise shift count to 1

- asl a          ; shift divisor; test leftmost bit
  bcs +          ; branch when get leftmost bit
  iny            ; else increment shift count
  cpy #17        ; max count (all zeros in divisor)
  bne -          ; loop if not done

+ ror a          ; push shifted-out bit back

; now divide by subtraction
- pha            ; push divisor
  txa            ; get dividend into accumulator
  sec
  sbc 1,s        ; subtract divisor from dividend
  bcc +          ; branch if can't subtract; dividend still in X
  tax            ; store new dividend; carry=1 for quotient

+ rol QUOTNT     ; shift carry->quotient (1 for divide, 0 for not)
  pla            ; pull divisor
  lsr a          ; shift divisor right for next subtract
  dey            ; decrement count
  bne -          ; branch to repeat unless count is 0
  rts


An alternate solution -- and this is what most of us ended up doing on the 6502, 65c02, and 65816 universally -- is to generate a pre-calculated table of all values and simply do table lookups. It's honestly the fastest, aside from obvious multiple-of-2 cases (which we know aren't always the case in this scenario). 6502 folks do this all the time since 200+ cycles per multiply is considered extreme (bringing it down to around 90 cycles).

BTW, you could consider using $4202/3 for multiplication, but the end result may be slower than the above options since there's an 8-cycle delay between when $4203 is set and when you get your result in $4216/7. The same applies for division using $4204/5/6, with a 16-cycle delay after $4206 is set to when you can get your result from $4216/7. Most games I know of didn't use these registers because of that reason, but it depends ultimately on how much math you plan on doing. :-)
Re: tcc816 math optimization
by on (#100156)
koitsu, this is really fun, you are provided exactly the same code from exactly the same source as in the current libtcc.asm version.

mic_'s examples of hardware divider usage take about 60 cycles, by the way, including necessary delays.
Re: tcc816 math optimization
by on (#100164)
Ha! Imagine that. I guess the only difference is that mine has inline comments, haha. :-)

Using the hardware registers on the SNES/SFC is probably your best bet then, as the only thing that can beat that is a large pre-calculated table. The downside is that during the 8 and 16-cycle wait times, you could actually be doing some other things (potentially), but with a subroutine you're just going to have a bunch of NOPs. Possibly using in-line macros (instead of a subroutine) would be more efficient (not just for saving JSR/RTS or JSL/RTL cycles but for those 8 or 16 cycles where you could do other things)? It means more "wasted" ROM space but the trade off is additional flexibility. I'm not sure if 8/16 cycles really matters in your project or not.
Re: tcc816 math optimization
by on (#100215)
This should be 2 cycles faster per iteration than the current version:

Code:
tcc__udiv:
stz.b tcc__r9
ldy #1
- asl a
bcs +
iny
cpy #17
bne -
+ ror a
- sta.b tcc__r5
txa             
sec             
sbc.b tcc__r5 
bcc +           
tax             
+ rol.b tcc__r9
lda tcc__r5     
lsr a           
dey           
bne -           
rtl


Or this, which should be one additional cycle faster in some cases (when "bcc +" is taken at least half of the time):

Code:
tcc__udiv:
stz.b tcc__r9
ldy #1
- asl a
bcs +
iny
cpy #17
bne -
+ ror a
- sta.b tcc__r5
cpx tcc__r5     
bcc +           
txa             
sbc.b tcc__r5
tax             
+ rol.b tcc__r9
lda.b tcc__r5 
lsr a           
dey             
bne -           
rtl
Re: tcc816 math optimization
by on (#100219)
Thanks. What I done now, using code suggestions from this thread, is:

Code:
tcc__mul:
   lda #0
   .repeat 4
   .repeat 4
   ldx.b tcc__r9
   beq ++
   lsr.b tcc__r9
   bcc +
   clc
   adc.b tcc__r10
+   asl.b tcc__r10
   .endr
++
   .endr
    rtl


tcc__udiv:
   stz.b tcc__r9
   ldy #1
   .repeat 16
    asl a
   bcs tcc__udiv1
   iny
   .endr
tcc__udiv1:
    ror a
-    sta.b tcc__r5
   cpx tcc__r5     
   bcc +           
   txa             
   sbc.b tcc__r5
   tax             
+    rol.b tcc__r9
   lda.b tcc__r5
   lsr a           
   dey             
   bne -           
   rtl