Found a
better opcode list (thank you people who actually document how opcodes affect CPU flags, sheesh). So:
* The
mov ah,al now make sense because you're essentially just wanting to fill memory with a value (16 bits at a time). Got it.
* The
jnc done is indeed to deal with even/odd number of bytes, and in the case of an odd number, to store the last byte (since
stosw copies 16 bits at a time). The carry being set or clear comes from the preceding
shr cx, 1. So okay, got it. Easy enough to do in 65816, but again the length (number in CX) actually matters because of the whole mvn/mvp vs. manual loop situation.
For shits and giggles lets just say we're trying to fill 513 bytes of data (an odd number) with the value $ffff, but only a byte is specified ($ff). Cycle counts per opcode are in brackets (
[]).
Code:
sep #$30 ; [3] Set A/X/Y to 8-bit size
lda #$ff ; [2] $ff = upper and lower half of the 16-bit value we want to fill
tay ; [2] Copy A ($ff) into Y
xba ; [3] Swap upper and lower bytes of A (yes you can do this when A size is 8-bit!)
tya ; [2] Copy Y ($ff) into lower byte of A; 16-bit A now contains $ffff
rep #$30 ; [3] Set A/X/Y to 16-bit size
tay ; [2] Y=$ffff
lda #513 ; [3] 513 bytes to transfer (odd number)
lsr ; [2] Divide by 2, A will now contain 256 ($100) with carry set to 1
bcs OddNum ; [!] If carry set, then lsr had a leftover (i.e. odd number)
tax ; [2] X = number of words (16-bits) to write
tya ; [2] A=$ffff
- sta >ESDILocation,x ; [6] Write $ffff to ESDILocation+X offset (full 24-bit address) -- LOOP
dex ; [2] ...repeat -- LOOP
bne - ; [!] ....until X=0 -- LOOP
sep #$30 ; [3] Set A/X/Y to 8-bit
bra Done ; [3] GTFO
OddNum:
tax ; [2] X = number of words (16-bits) to write
tya ; [2] A=$ffff
- sta >ESDILocation,x ; [6] Write $ffff to ESDILocation+X offset (full 24-bit address) -- LOOP
dex ; [2] ...repeat -- LOOP
bne - ; [!] ....until X=0 -- LOOP
sep #$30 ; [3] Set A/X/Y to 8-bit -- LOOP
sta >ESDILocation ; [6] Write that odd byte ($ff) to ESDILocation (same as ESDILocation,x in this case) -- LOOP
Done:
The cycle count labelled
[!] is 2 cycles if the branch IS NOT taken, otherwise is 3 cycles. In most cases in the above code, it will be taken.
Is this optimised? No. More on that in a moment.
Is it ugly and long? Yes, because most of the work stems from the "setup" -- first the need to take an 8-bit accumulator value and copy the value into the high byte of the accumulator, and then dealing with the number of bytes and so on. If I had made this in a "general subroutine" where you'd simply push the byte you wanted filled onto the stack, followed by another push of the length (in bytes), the routine would be a lot shorter and maybe use less cycles. Not sure.
The reason the code looks doubled (more or less) is because I did not care to elegantly handle the situation where the number of bytes being transferred was odd vs. even (i.e. storing something in a temporary variable to indicate such, etc.). My focus was not on code length but a bit more on cycles. It's totally possible to make this routine shorter and handle the odd vs. even thing in a more sane manner.
ANYWAY... ;-)The most important part here is the cycle counts within the loops (I've labelled them with
-- LOOP, including the odd-byte count where the sep #$30 is actually needed (to write an 8-bit value vs. 16-bit) and the actual write itself).
So let's do the math of the loops, given the number of iterations within the loops that we know:
513 bytes = $100 (256) loop iterations + 9 extra cycles (sep + sta at the end)
Loop itself: 6 cycles for the sta + 2 cycles for the dex + 3 cycles for the bne (except the last iteration which will be 2 cycles)
So: 255 * (6 + 2 + 3) = 2805 cycles
2805 cycles + 2 cycles for the non-branch-taken bne = 2807 cycles
2807 cycles + the 9 extra cycles (sep + sta at the end) = 2816 cycles
To repeat: is it possible to optimise this routine? Absolutely. Getting rid of the
sta >ESDILocation,x (24-bit addressing, STA Absolute Long Indexed X) and turning it into
sta ESDILocation,x (16-bit addressing, STA Absolute Indexed X) is the best choice, since you save 1 cycle per every
sta in that case. The catch is that you have to know in advance what the bank of the 24-bit address ESDILocation is, and that's doable in lots of ways (again: how a programmer chooses to design things).
The reason I chose to use a 24-bit addressing STA is because [ES:DI] -- it's been a while, so if I get this wrong, I apologise -- allows you to write to any segment (ES) starting at offset DI, up to 65536 bytes. I wanted to make this "easy", so I just went with a linear 24-bit addressing store. More realistically though, as I said, setting B (bank) to the destination followed by a 16-bit store would save a cycle per every loop iteration, and is "more akin" to the segmented memory model (hard to explain what I mean by this, sorry, it's late and I'm tired (see below)) in this way, along with the "up to 65536 byte" limitation too.
Another optimisation possibility is to do self-modifying code. I'm not sure how much time this would save so this is speculative. I've written such code, but I really don't enjoy it and try to stay away from it on the 65816. On 6502, with only 64K of addressing space (and often even less RAM and ROM) it's more common.
There is also the possibility of using a stack-based copying method, which may save other cycles. I will admit it has been a long time since I've used
pea/pei to do this, so my familiarity there is a bit rusty, but I'm sure I could figure it out. Note: I used this methodology in the aforementioned link (re: demo/utility group) to do some text scrolling. There's a point where the stack-based method is faster than the manual loop.
So I imagine it is very possible to get the cycle count down to the low-2000s for the same number of bytes (words) transferred.
Now let's bring
mvn / mvp into the picture (rather than a manual loop), but speaking on a general level because I took diphenhydramine and melatonin earlier and I'm really getting dozy (plus angry that I've spent time doing this rather than reading my awesome scifi book in comfy bed, but hey that's my own fault not anyone elses):
The
mvn / mvp opcodes take a whopping
7 cycles PER BYTE moved. Not per word (16-bit), but per byte. The CPU does the move itself (entire CPU blocks/waits until the opcode is done, obviously). Whether it operates internally on a per-byte basis or per-word, I do not know, but I bet you it's per-byte given the nature of how it works. Yeah, it's expensive.
I hope someone else here (probably byuu) comes along and says "holy shit dude, that routine is crap, you are REALLY out of practise" and optimises the hell out of it. That's how the guys at MIT did things on the old PDPs -- they would literally revamp or optimise each other's code to shave off cycles or bytes here and there. Hell, Bill Gates did this too.
Now you see why a little inexpensive DMA chip alongside the 65816 for memory transfers really, really helps.
P.S. -- Despite my sort of annoyed/irritated tone (just how I am, nothing personal), this is probably the most thorough 65816 I've done since working on the SNES/SFC many years ago and my days in the IIGS scene.
BIG nostalgia for me here. Really brought back memories -- especially since my IIGS is sitting in a storage bin behind me. I also had to bust out ActiveGS (didn't have KEGS laying around) to run the IIGS mini-assembler (
call -151 ; !) to write some code to test some of my theories about certain opcodes behaving certain ways in 8-bit vs. 16-bit mode. So thank you, miker00lz, for the trip back in time. I do appreciate it.
P.P.S. -- I edited the initial code at the start to remove use of a temporary DP variable because I realised I could safely use other registers for it.