Don't have time to sit around counting cycles / working it all out, but I thought I'd share this. Taken directly from "Programming the 65816 including the 6502, 65C02, and 65802":
16-bit multiplication:
Code:
; operand 1: 16-bits in DP location MCAND1
; operand 2: 16-bits in DP location MCAND2
; result: 16-bits in accmulator
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
MCAND1 = $80 ; DP location $80
MCAND2 = $82 ; DP location $82
mymult:
lda #0 ; initialise result
- ldx MCAND1 ; get operand 1
beq done ; if operand 1 is zero, done
lsr MCAND1 ; get right bit, operand 1
bcc + ; if clear, no addition to previous products
clc ; else add operand 2 to partial result
adc MCAND2
+ asl MCAND2 ; now shift operand 2 left for possible addition next time
bra -
done:
rts
16-bit division w/ remainder:
Code:
; 16-bit divide: X / A -> QUOTNT; remainder in X
; QUOTNT is a 16-bit direct page cell
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
; no special handling for divide by zero (returns $FFFF as quotient)
QUOTNT = $80 ; DP location $80
mydiv:
stz QUOTNT ; initialise quotient to 0
ldy #1 ; iniitalise shift count to 1
- asl a ; shift divisor; test leftmost bit
bcs + ; branch when get leftmost bit
iny ; else increment shift count
cpy #17 ; max count (all zeros in divisor)
bne - ; loop if not done
+ ror a ; push shifted-out bit back
; now divide by subtraction
- pha ; push divisor
txa ; get dividend into accumulator
sec
sbc 1,s ; subtract divisor from dividend
bcc + ; branch if can't subtract; dividend still in X
tax ; store new dividend; carry=1 for quotient
+ rol QUOTNT ; shift carry->quotient (1 for divide, 0 for not)
pla ; pull divisor
lsr a ; shift divisor right for next subtract
dey ; decrement count
bne - ; branch to repeat unless count is 0
rts
An alternate solution -- and this is what most of us ended up doing on the 6502, 65c02, and 65816 universally -- is to generate a pre-calculated table of all values and simply do table lookups. It's honestly the fastest, aside from obvious multiple-of-2 cases (which we know aren't always the case in this scenario).
6502 folks do this all the time since 200+ cycles per multiply is considered extreme (bringing it down to around 90 cycles).
BTW, you could consider using $4202/3 for multiplication, but the end result may be slower than the above options since there's an 8-cycle delay between when $4203 is set and when you get your result in $4216/7. The same applies for division using $4204/5/6, with a 16-cycle delay after $4206 is set to when you can get your result from $4216/7. Most games I know of didn't use these registers because of that reason, but it depends ultimately on how much math you plan on doing. :-)