The Difficulty of ARM Assembly

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
The Difficulty of ARM Assembly
by on (#233531)
I recently started programming on an STM32 microcontroller (ARM Cortex-M0 processor) for college and was naïve enough to try programming in assembly. There's little way to work with immediate numbers, and the limitations always seem to change heavily based on the instruction (sometimes it's an 8 bit value that can be shifted, sometimes it's a regular 16 bit number and sometimes it's even a 12 bit number). There's never any absolute addressing either due to the 32 bit instruction size, and the limitations of relative addressing appear just as random as that of immediate values. It's confusing enough for a human (or me at least) that I wouldn't be surprised if a compiler generated substantially faster code.
Re: The Difficulty of ARM Assembly
by on (#233534)
ARM assemblers usually deliberately allocate a region in the CODE segment near any given routine called a "constant pool" for exactly this reason.
Re: The Difficulty of ARM Assembly
by on (#233554)
Yeah ARM and RISC in general is designed for compilers. The idea being having a smaller instruction set means you can't do any complex paths, which reduces the search space for a compiler, thus the compiler makes about as good code. That being said when I was doing M0 the code gcc was making was horrendous.

Basically the instruction are fixed bit length, so you get number of bits to encode the instruction + param = what ever is left. As the ARM instruction set has been moving more and more CISC they have started to need to create with instruction packing.

Typically when you hand asm some RISC you use an "higher level asm" although I don't know once for ARM, but on MIPS you have MAL which is a helper for TAL.
Re: The Difficulty of ARM Assembly
by on (#233574)
ARMv6-M (the core of Cortex-M0) is a bastardization; it's a stripped down, not very orthogonal copy of ARMv7-M (Cortex-M3). For instance, on the ARMv6-M, much of the instructions can only act on r0-r7, and the destination is forced to be the same than one of the operand —very few instructions can use all registers. And indeed, you can only move an 8-bit litteral, or load a 32-bit constant from a constant pool. If you want something more pleasant to work with, consider a Cortex-M3, much less restrictions, more fun to work with.
Re: The Difficulty of ARM Assembly
by on (#233575)
But how does ARMv6-M compare to ARMv4 Thumb, as used in Game Boy Advance? Wikipedia says it supports "most" Thumb instructions and "some" Thumb-2 instructions. Does Thumb cause more of a problem than it did on GBA?
Re: The Difficulty of ARM Assembly
by on (#233576)
This is probably a stupid question, but for what reason would a processor be forced to use a certain instruction width? (Well, THUMB is either 16 or 32 bit, but cannot be larger than that). ...I was just about to say about I'm just confused as to why there would be a 16 bit instruction that contains the address of a 32 bit number that still needs to be loaded, but I suppose with a 16 bit data bus, the waste of having that address included in the instruction doesn't really matter...

And I was running into problems before I wrote "processor cpu32_6m". :|

And what did Thumb do on the GBA?
Re: The Difficulty of ARM Assembly
by on (#233585)
Drew Sebastino wrote:
This is probably a stupid question, but for what reason would a processor be forced to use a certain instruction width?
Theoretically, it makes things a lot simpler, because you don't need special lookup tables or handling to deal with variable length instructions.

In practice ... it turns out that a fixed length of 32 bit instructions is actually kinda lousy. Cache pressure and memory bandwidth is often the biggest hindrance to any modern CPU, and the simplest way to address that is to make your instructions shorter. (edit: and having to refer to a constant pool that's not in the literal flow of instructions means you need to special case cache prefetch anyway, so there's less benefit) And the cost of evaluating where the program counter needs to be isn't large.

Hence SuperH, THUMB, and MIPS16le. (btw, THUMB's always 16-bit)

Quote:
And what did Thumb do on the GBA?
Basically exactly what I said above: more instructions can fit into the GBA's internal 32KB of 32-bit RAM and 256KB of 16-bit RAM, take less time to execute from the 256KB internal RAM, and take less time to fetch or execute from the cart.
Re: The Difficulty of ARM Assembly
by on (#233605)
basically in the old days, RAM was small and slow. so things like the Z80 and the 68K would use large amounts of die, to have fancy FSMs and variable length instructions to get the most out of small RAM and they had clocks to spend where they don't touch the bus to work out what to do next etc.

RAM got cheaper and faster, and did so faster than transistors got smaller. So RISC ditched the fancy FSMs and Microcode to just hit RAM hard and fast, as this gave more data through the CPU, since it knows its going to get data each clock, the pipeline was simplified and more of the chip could be used for instructions.

Then cooling got better, process shrunk faster than external bus speed could be speed up. Then Cache became the way to solve the slow 'FSB' to which we are back to RAM being precious and packing more in is a lot better. As the CPUs can hit Cache at 100Mhz but RAM at 33mhz. Thus CISC start to pull back, and now even ARM is CISC ditching the RISC purity in favor of power.

The really dumb aspect is making a 32bit cpu with a 32bit instruction size, if you have a 32bit bus and a 16bit cpu it would make a lot more sense. This is what Thumb is, it drops you to a 16bit CPU but still has a 32bit bus. Thus it can get instruction + data every clock, which really boosts your speed. Just you can't go over 64K anymore. Personally I think going to a 24bit CPU would be the sweet spot, honestly when I'm coding I very rarely need more than 65,536 values, normally I'm doing <1000, but I can see for things like spreadsheets etc 65,000 is not enough. However 16,777,216 is probably plenty for 98% of the time. The issue then becomes that it limits you do a measly 16MB, doubling it up to get 48bit pointers however gets you to 281,474,976,710,656 or 32GB which is starting to get "normal" but I still think is overkill.
Re: The Difficulty of ARM Assembly
by on (#233616)
I like coding ARM in ASM. The instructions, addressing modes, and register set are much more powerful than 6502 or the like. Needing the literal pool for 32bit immediates might be a bit unfamilar at first. But if you get familar with it then you have 32bit maths, and memory accesses with auto-increasing addresses, and ALU opcodes that could do obscure things like "IF equal THEN r0=r2 xor (r3*8)" in single opcode & single clock cycle, and there are enough registers to store operands & pointers & loop counters in registers instead of RAM.

At least ARM can do that. THUMB should be able to do most of that, too, but it might come up with some confusing restrictions & its syntax is having confusing rules about whether/which opcodes do update flags. I don't know if THUMB-2 has fixed some of that restrictions and syntax issues.

Using compiler code: What I have seen in commercial games on GBA and NDS consoles isn't optimized at all. You would need to be really confused to create anything equivalent in ASM.

Oziphantom wrote:
This is what Thumb is, it drops you to a 16bit CPU but still has a 32bit bus. Thus it can get instruction + data every clock

Uh, that is vice and versa and still not quite right.
The CPU is 32bit no matter if using THUMB or ARM (it can do 32bit maths and has 32bit address space).

THUMB 16bit opcodes can be faster than 32bit opcodes if your memory is "uncached memory with 16bit databus" (if your memory doesn't have that restriction then THUMB is just smaller, but not actually faster).

If you think that 16bit opcode and 16bit data can be transferred through 32bit databus within a single clock cycle: No, they can't. What you mean might be memory systems with separate data cache and code cache, that might work in a single clock cycle - but that's unrelated to using 32bit ARM opcodes or 16bit THUMB opcodes.
Re: The Difficulty of ARM Assembly
by on (#233621)
the GBA is 16bit bus, no cache right?

I also though that it made its more practical do do 16bit operations, in that you ignore the upper half and just focus on the lower half of registers. But it has been a long time, and a lot of ARM variants since :D maybe it was 16 registers not 16bits...
Re: The Difficulty of ARM Assembly
by on (#233624)
Game Boy Advance has a 32-bit bus to BIOS, IWRAM, and MMIO, and a 16-bit bus to most other memory (ROM, EWRAM, VRAM, CGRAM, and OAM). IWRAM is also fairly small (32768 bytes) yet with fewer wait states than EWRAM or ROM, so if ARM in IWRAM is too big, Thumb in IWRAM may make sense.
Re: The Difficulty of ARM Assembly
by on (#233892)
If I was programming the GBA in assembly, I'd probably dedicate a register as an indexed to a table of constants.
Re: The Difficulty of ARM Assembly
by on (#233897)
You don't need an indexed table of constants, you just use the program counter for that.
There's even a pseudo-instruction for that: `ldr r0,=0x12345678`, which transforms to a PC-relative load to a local literal pool.

Now an indexed table of global variables, that's far more useful.
Re: The Difficulty of ARM Assembly
by on (#233898)
... and that's essentially what a Global Offset Table (GOT) is.
Re: The Difficulty of ARM Assembly
by on (#234147)
Dwedit wrote:
You don't need an indexed table of constants, you just use the program counter for that.
There's even a pseudo-instruction for that: `ldr r0,=0x12345678`, which transforms to a PC-relative load to a local literal pool.

Now an indexed table of global variables, that's far more useful.


How does the assember know where to put the table?
Re: The Difficulty of ARM Assembly
by on (#234148)
Either at the end of a section (like the .text section), or where you put in .pool to manually force a literal pool at that location. Usually a compiler generating ASM functions will stick a literal pool at the end of every function.
Re: The Difficulty of ARM Assembly
by on (#234150)
Just out of curiosity, what ARM assemblers for x86 machines are available? I only found FASMARM, but I figure there has to be more than that.
Re: The Difficulty of ARM Assembly
by on (#234153)
GCC cross compilers for ARM (such as DevKitARM) include the GNU Assembler.
Re: The Difficulty of ARM Assembly
by on (#234400)
I have to say that I've grown to appreciate RISC more in the past year or so. My computer achitecture class is teaching 32-bit MIPS and I got a crash course on ARM thumb-0 in one of my classes last year. I like ARM a bit more though in that its instructions seem more similar to 6502, which I guess you could say is my native assembly language. Is there a general consensus on which is superior (MIPS/ARM) in terms of its application for embedded systems?

It makes me interested in thinking about what something like 65xx or Z80 would look like if it had been extended to 32/64 bit, before the whole RISC/CISC debate was really a thing.
Re: The Difficulty of ARM Assembly
by on (#234404)
Sogona wrote:
It makes me interested in thinking about what something like 65xx or Z80 would look like if it had been extended to 32/64 bit, before the whole RISC/CISC debate was really a thing.

Off-topic, but: you're probably unaware of the ill-fated 65832 (obviously CMOS): https://downloads.reactivemicro.com/Ele ... asheet.pdf

Every time I read that preliminary data sheet I get sad. Still limited to 3 registers (1 "main" register + 2 indexing-only), no native mul/div, blah blah. The additional addressing modes are not "super" helpful either. It really wasn't a CPU that the 90s would have benefit from, so it doesn't surprise me the 65816 was where it pretty much ended. 68K and x86 "won". And since then, we've been "stuck" with x86 and PC architecture, the latter of which at this point is *makes cat vomit noises*.
Re: The Difficulty of ARM Assembly
by on (#234406)
Sogona wrote:
Is there a general consensus on which is superior (MIPS/ARM) in terms of its application for embedded systems?
I mean, honestly, most ISAs are equally ok for embedded systems. 8051s clocked at 100MHz were typical a decade ago..

Currently there seem to be more ARM licensees than any other ISA, but that's just a comment about mindshare, not suitability. Personally, I hope we see more RISC-V cores.

koitsu wrote:
with x86 and PC architecture, the latter of which at this point is *makes cat vomit noises*.
So you've played with Sun's and/or SGI's x86-based offerings?
Re: The Difficulty of ARM Assembly
by on (#234408)
lidnariq wrote:
with x86 and PC architecture, the latter of which at this point is *makes cat vomit noises*.
So you've played with Sun's and/or SGI's x86-based offerings?

To clarify: I'm not particularly fond of x86 (I stopped bothering to follow it with the introduction of the 486, and it seems I picked a good time to bow out, for my own sanity. A lot of present-day x86 code I can't even read due to all the extensions and ridiculousness, throw things like LOCK RMW instruction prefixes on top of the mix for SMP and I'm like "yeah, I'm done". Then you got the IDT, the GDT and LDT, MSRs, PAE, SMM/SMI, VT, *plus* all the nuances of x64... Yeah, no thanks. I sadly ever had the opportunity to play with PowerPC or Alpha at the time, though to be fair those are both RISC.

But I'm *really* not fond of PC architecture in general at this stage. I would classify it as quite possibly the longest-running hack-kludge tech project that humanity has ever kept going. There was a point in the mid-90s where a single person could just about understand the architecture -- now, we're so far from it that nobody can -- yet nearly all (and I do mean ALL!) the legacy support still in place for numerous reasons. Can you even remember all the bus types at this point (I can remember most of them, even wonky crap like MGA). We've learned a lot over the past ~40 years, but from all the corporate committees driving "standards" that are an abomination (APICs, ACPI, TPM, and UEFI all come to mind. I'd almost include USB but that's really not PC-specific) I don't think anyone in PC-land will ever stop and say "we should really just start over" because that'd a huge undertaking and would be killing the cash cow.

For example, if you've ever used a Sparc, there's a lot of things there that architecture-wise felt like no-brainers and just worked -- concepts like OpenBoot and having a native serial console from the get go.

For this reason, I think overall, ARM architecture (obviously also RISC) is at least a breath of fresh air. Admittedly I don't follow it to the level I should, but that may change for me given some things I can't really talk about (professional reasons).
Re: The Difficulty of ARM Assembly
by on (#234410)
What is superior is determined like you'd think, money. How much per cpu, per board, per coder, are there ready libraries so you need less coding...

ARM is popular, but it used to be that MIPS was cheaper for a certain cpu level. Then your use case may need a specific accelerator/DSP/whatever, which limits choices. I've recently done both MIPS and POWER things, PPC is nice too, even if it's not used much in embedded anymore.
Re: The Difficulty of ARM Assembly
by on (#234457)
I looked up ARM Cortex-M0 instruction set and I was disappointed that indexed addressing only has a 5-bit offset, but then I realized that you get "6-bit" range when using 16-bit loads/stores, and "7-bit" range with 32-bit loads/stores, which is enough for object memory slot accessing.
Re: The Difficulty of ARM Assembly
by on (#234464)
koitsu wrote:
But I'm *really* not fond of PC architecture in general at this stage. I would classify it as quite possibly the longest-running hack-kludge tech project that humanity has ever kept going. There was a point in the mid-90s where a single person could just about understand the architecture -- now, we're so far from it that nobody can -- yet nearly all (and I do mean ALL!) the legacy support still in place for numerous reasons.
This is not true, lots of instruction sets have been dropped over the years, they are up to the point of dropping some SSE instructions, so I would think that 8086-MMX would all be dropped as well. The "drivers" offer support for such instructions in that they emulate them on modern CPUs. There was an interesting talk by an AMD engineer at one of the security conferences going on about CPU errata and patching. Where he mentioned "we are finally dropping XXX set of instructions" sadly I've not been able to find in a search. I think it was also the same talk where Apple biged-up their Hunt For Red October levels of security for iCloud.

Also if you want to look at the very current PC landscape, ARM PCs are making another stab. In you can get a HP Envy X2 with either a snapdragon ARM processor or an Intel processor. The rumors keep circling for ARM macs.

However RISC is basically dead, although RISC-V is making a comeback, and The Raspberry Pi foundation just became a silver tier member, so RISC-V 'pi' might be in the works which would be nice. I think RISC-V has a shot, as people love free and the pure open source nature of it has the Linux diehards in a froth.

For a 32bit 6502, the Mega65's version of the 4510 has 24 and 32bit extension, however it is still a "6502", not sure if it also adds the Z register.

PPC is rubbish and it shocking how it is still being used..

MIPS was good but it is now very old and ARM has kept marching towards CISC and has SIMD cores, so bang for buck I imagine ARM will slaughter MIPS. I'm not sure that a PSP will really out do a DS in terms of CPU power, yes the PSP is more powerful but I don't think its 3.5x as powerful for its 3.5x clock rate.
Re: The Difficulty of ARM Assembly
by on (#234466)
I looked at RISC-V and what's up with the weird scrambled immediate encoding?
Re: The Difficulty of ARM Assembly
by on (#234470)
To get more speed, you need more instructions per fetch, so it has a pile of options to encode things smaller and add extra instructions on the tail.. its still RISC of cause ;) There is some info on how it works conceptually and compared to ARM and am64 here https://www.youtube.com/watch?v=Ii_pEXKKYUg
Re: The Difficulty of ARM Assembly
by on (#234517)
psycopathicteen wrote:
I looked at RISC-V and what's up with the weird scrambled immediate encoding?

Immediate operand values on RISC-V are scrambled to reduce multiplexers (muxes) inside the CPU, so that a particular operand (destination or first source) can always appear in the same bit position within the opcode. Each mux adds die area and gate delay. Die area increases power consumption and reduces area that can be used for data cache. Gate delay reduces maximum clock rate.
Re: The Difficulty of ARM Assembly
by on (#234530)
I found this page with an explanation. What's also interesting is that bit 31 is always the sign bit.

https://stackoverflow.com/questions/394 ... g-variants
Re: The Difficulty of ARM Assembly
by on (#234538)
koitsu wrote:
Sogona wrote:
It makes me interested in thinking about what something like 65xx or Z80 would look like if it had been extended to 32/64 bit, before the whole RISC/CISC debate was really a thing.

Off-topic, but: you're probably unaware of the ill-fated 65832 (obviously CMOS): https://downloads.reactivemicro.com/Ele ... asheet.pdf

Every time I read that preliminary data sheet I get sad. Still limited to 3 registers (1 "main" register + 2 indexing-only), no native mul/div, blah blah. The additional addressing modes are not "super" helpful either. It really wasn't a CPU that the 90s would have benefit from, so it doesn't surprise me the 65816 was where it pretty much ended. 68K and x86 "won". And since then, we've been "stuck" with x86 and PC architecture, the latter of which at this point is *makes cat vomit noises*.


Wow, only one extra instruction. They could've used the WDM instruction as an extension to another set of 256 instructions, and did useful stuff like ALU instructions with index registers, and register-register ALU instructions.
Re: The Difficulty of ARM Assembly
by on (#234548)
65xx at 32-bit (or 16-bit for that matter) would work so much better if they just widened the data bus. You'd lose drop-in compatibility, but the alternative - which WDC seems to have chosen - is to let 65xx die chained to the archaic bus interface of the 6502. Imagine if x64 was still using the same pinout from the 8088 because they'd taken "PC compatibility" too literally...
Re: The Difficulty of ARM Assembly
by on (#234593)
93143 wrote:
65xx at 32-bit (or 16-bit for that matter) would work so much better if they just widened the data bus. You'd lose drop-in compatibility, but the alternative - which WDC seems to have chosen - is to let 65xx die chained to the archaic bus interface of the 6502. Imagine if x64 was still using the same pinout from the 8088 because they'd taken "PC compatibility" too literally...

I have always lamented that the 65816 (and '832 also) were bound by requirements to be able to emulate the 6502 and run pre-written 6502 code. Think how much better they could have performed without that requirement. As for the 40-pin requirement, I don't know why that was there, since the '816 was not quite pin-compatible anyway. (The 65802 was an '816 that could be dropped into a 6502 socket and give many of the benefits of the '816, just staying in the first 64K of memory map.) A 48-pin DIP, which was also a standard size, would have at least removed the requirement to multiplex the high byte of the address bus. The 68K used a 64-pin DIP. Unfortunately Apple made 6502 emulation capability a requirement for buying the '816 for their IIGS. It was also a shame that Apple management limited the IIGS to 2.8MHz because they didn't want it to make the MacIntoshes look bad. I don't know if the '832 came with the same requirement from a potential customer, but although it was designed, it never got made.

I propose such a 32-bit 6502, starting with the third post of the 6502.org topic, "Improving the 6502, some ideas." It really just takes the 65816 and expands the data bus, non-multiplexed address bus, and all registers (except maybe the status register) to 32 bits, getting rid of page and bank boundaries and requirements. That way the bank registers, direct-page register, and stack-pointer register become merely offsets and you can still address the entire 4 gigaword space from anywhere. Absolute address modes become the same thing as ZP (or DP), except with the data or program bank offset rather than the direct-page offset. Long addressing is the same except with no offset applied. Operands are always picked up in a single 32-bit memory read cycle. The 6502 flavor is strongly preserved. It has not been made so far; but although I have not gotten into programmable logic and FPGAs, I might emulate it with a PIC microcontroller. The performance that way would be extremely poor, but it would let me experiment with the instruction set.

The 32-bit pseudo-6502 that seems to be most likely to someday reach reality is Michael Barry's 65m32. The link goes to a topic in the AnyCPU forum though because more of the 6502 flavor is lost in this processor. He does make it more efficient, merging the operand with the instruction so they can be fetched in a single cycle in cases where the operand is 24 bits or less. So for example LDA $123456 is all fetched in a single memory cycle. The operand for LDA $12345678 would have to be separate from the instruction.
Re: The Difficulty of ARM Assembly
by on (#234597)
I think the M65 4510 is the closest to being "made", it has a full 6502 emulation mode, but also has a turbo 6502 mode that uses the larger bus to pull in opcode + param data for faster execution. He was able to push it up to 192Mhz but it broke the 8bit feel and the way the "8bit processors worked" so its locked to 48mhz. It has 32bit extensions as well Z register.
Re: The Difficulty of ARM Assembly
by on (#234614)
I just thought of a way to make an improve 6502-like 8-bit CPU. The instruction format will be like this:

bits 0-3: instruction field
bits 4-7: operand field

The operand field contains hardwired combinations of register/register pairs or memory/register pairs.

Code:
0000: a, #imm
0001: a, dp
0010: a, abs
0011: a, (dp)
0100: a, abs,x
0101: a, abs,y
0110: a, (dp),x
0111: a, (dp),y
1000: a, x
1001: a, y
1010: x, #imm
1011: y, #imm
1100: x, abs
1101: y, abs
1110: x, abs,y
1111: y, abs,x
Re: The Difficulty of ARM Assembly
by on (#234656)
I think everyone here should take the time to look at the 6809 before reinventing the wheel. You might be surprised what was available in 1978, and used all the way into the 90s, particularly in arcade games. Programmers manual (PDF with a really crappy web front end)
Re: The Difficulty of ARM Assembly
by on (#234669)
It's too bad Motorola cut the cord on the 6809 so early on.
Re: The Difficulty of ARM Assembly
by on (#234671)
The 6809 had a really nice instruction set, but it didn't really perform any better than the 6502. The 65816 is a much better upgrade IMO, and the SuperCPU ran it at 20MHz over 20 years ago.
Re: The Difficulty of ARM Assembly
by on (#234677)
psycopathicteen wrote:
I just thought of a way to make an improve 6502-like 8-bit CPU. The instruction format will be like this:

bits 0-3: instruction field
bits 4-7: operand field

The operand field contains hardwired combinations of register/register pairs or memory/register pairs.

Code:
0000: a, #imm
0001: a, dp
0010: a, abs
0011: a, (dp)
0100: a, abs,x
0101: a, abs,y
0110: a, (dp),x
0111: a, (dp),y
1000: a, x
1001: a, y
1010: x, #imm
1011: y, #imm
1100: x, abs
1101: y, abs
1110: x, abs,y
1111: y, abs,x

You do realize that is how it basically already works right?
Re: The Difficulty of ARM Assembly
by on (#234683)
A 4-bit operand/mode field is indeed similar to how bits 4-2 from an actual 6502 instruction work in group 1 instructions (bits 1-0 = 00):

00001 (dd,X)
00101 dd
01001 #ii
01101 aaaa
10001 (d),Y
10101 dd,X
11001 aaaa,Y
11101 aaaa,X

Similarly for group 2 RMW instructions (bits 1-0 = 00):

00110 dd
01010 A
01110 aaaa
10100 dddd,X
11110 aaaa,X

But psycopathicteen's proposal is slightly closer to orthogonal, as it allows use of X or Y instead of A as the target in more cases. I'm under the impression that you can't get true orthogonality with indexed addressing modes in an 8-bit opcode; you need prefixes or 16-bit opcodes for that.
Re: The Difficulty of ARM Assembly
by on (#234697)
koitsu wrote:
I think everyone here should take the time to look at the 6809 before reinventing the wheel. You might be surprised what was available in 1978, and used all the way into the 90s, particularly in arcade games. Programmers manual (PDF with a really crappy web front end)


If you go to Download Options on that page, you can get the actual PDF file.
Re: The Difficulty of ARM Assembly
by on (#234702)
Garth wrote:
The 6809 had a really nice instruction set, but it didn't really perform any better than the 6502. The 65816 is a much better upgrade IMO, and the SuperCPU ran it at 20MHz over 20 years ago.


Why does the 65816 perform better? Is it because the 6809 takes an extra instruction fetch on indexing instructions? Or is it just because Motorola didn't sell it at faster speeds?
Re: The Difficulty of ARM Assembly
by on (#234703)
tepples wrote:
A 4-bit operand/mode field is indeed similar to how bits 4-2 from an actual 6502 instruction work in group 1 instructions (bits 1-0 = 00):

00001 (dd,X)
00101 dd
01001 #ii
01101 aaaa
10001 (d),Y
10101 dd,X
11001 aaaa,Y
11101 aaaa,X

Similarly for group 2 RMW instructions (bits 1-0 = 00):

00110 dd
01010 A
01110 aaaa
10100 dddd,X
11110 aaaa,X

But psycopathicteen's proposal is slightly closer to orthogonal, as it allows use of X or Y instead of A as the target in more cases. I'm under the impression that you can't get true orthogonality with indexed addressing modes in an 8-bit opcode; you need prefixes or 16-bit opcodes for that.


Also 2 of the "addressing modes" are register to register modes.

Also what would be nice would be:
-adds without carry
-shifting instructions on X and Y
-barrel shifting on A
-register swaps

Of course I don't know if all these would fit in 256 instructions, but I do know that the 6502 has a lot of unused opcodes, and the 65816 spends a lot of opcodes with long addressing modes and stuff that makes sense for a computer, but not so much for a video game system.
Re: The Difficulty of ARM Assembly
by on (#234710)
psycopathicteen wrote:
Garth wrote:
The 6809 had a really nice instruction set, but it didn't really perform any better than the 6502. The 65816 is a much better upgrade IMO, and the SuperCPU ran it at 20MHz over 20 years ago.


Why does the 65816 perform better? Is it because the 6809 takes an extra instruction fetch on indexing instructions? Or is it just because Motorola didn't sell it at faster speeds?

My information about the 6809 performance comparison is from a very knowledgeable friend who really likes the 6809. He writes, "6809 is hobbled by a wearisome prevalence of dead cycles. Even a simple operation such as an 8-bit load using Absolute address mode takes 5 cycles on 6809, as compared to 4 cycles on 6502. Using Direct-Page/Zero-Page mode the numbers are 4 cycles as compared to 3 cycles."

As for the 65816, my 65816 Forth runs two to three times as fast as my 6502 Forth, at a given clock speed. It's primarily that the '816 is so much more efficient at handling the 16-bit cells than the '02 which has to take 8 bits at a time and increment addresses or indexes in between and such. Here's the simple example of @ (pronounced "fetch"), which takes a 16-bit address placed on the top of the data stack and replaces it with the 16-bit contents of that address.

First for 6502:
Code:
       LDA  (0,X)
       PHA
       INC  0,X
       BNE  fet1
       INC  1,X
fet1:  LDA  (0,X)
       JMP  PUT
; and elsewhere, PUT which is used in so many places is:
PUT:   STA  1,X
       PLA
       STA  0,X


For the '816, the whole thing is only:
Code:
       LDA  (0,X)
       STA  0,X         ; For the '816, PUT is only one 2-byte instruction anyway, so there's no sense in jumping to it.


@ was given such a short name because it's one of the things used most. You can see the difference in the code length, 2 instructions for the 65816 versus 10 for the 6502.

Then there are the 816's extra instructions and addressing modes that improve efficiency, like MVN and MVP (the memory-move instructions), and the stack-relative addressing which helps in looping for incrementing the index and comparing to the limit and getting the indexes for nested loops. (Think of the typical I and J indexes in nested BASIC FOR-NEXT loops, except that you can do it in Forth without taking variable space.) These are just a couple off the top of my head.

Note of course that I'm not talking about running '02 Forth on an '816, something which in itself would not result in any performance gain. The '816 Forth was re-written to take advantage of the 816's extra capabilities.

Then because of the shorter assembly code, it became practical to re-write a lot of secondaries as primitives.
Re: The Difficulty of ARM Assembly
by on (#234796)
One CPU I would love to get into if I could start all over again and if it had become a popular juggernaut in home computers is the Hitachi 6309. It's compatible with the Motorola 6809 but can run at a higher clock speed, and in enhanced mode, it executes in fewer cycles, has more instructions, and adds extra 8-bit accumulators so that 16- and 32-bit math can be done.
Re: The Difficulty of ARM Assembly
by on (#234808)
I just looked up the 6309; looks like a fun processor to work with, but it's unfortunate nothing very popular ever used it.

I have a processor in that same situation, and it's the 65ce02. From what I've read of it, It's mostly less capable than the 65816, being closer to the 65c02, except for one big difference, and that's a third "z" index register, which I'm very envious of... Minimum cycles per instruction is also 1 instead of 2, so performance might actually be better in certain areas. Looks like it was only used for the Amiga serial port card though...
Re: The Difficulty of ARM Assembly
by on (#234811)
Drew Sebastino wrote:
I just looked up the 6309; looks like a fun processor to work with, but it's unfortunate nothing very popular ever used it.

I have a processor in that same situation, and it's the 65ce02. From what I've read of it, It's mostly less capable than the 65816, being closer to the 65c02, except for one big difference, and that's a third "z" index register, which I'm very envious of... Minimum cycles per instruction is also 1 instead of 2, so performance might actually be better in certain areas. Looks like it was only used for the Amiga serial port card though...


Yeah, there were a lot of cool processors made in the late 80s that barely got any use because of all the hype the 68000 got. People just focused way too much on register size and bit size completely ignoring instruction efficiency.
Re: The Difficulty of ARM Assembly
by on (#234821)
Concerning the thread title, all those speculations about existing or non-existing 6xxx variants could be summarized in one question:
What if a hypothetically improved 6xxx were ARM?
And the answer to that question is:
ARM is improved 6xxx.

So why don't you just get a GBA, NDS, DSi, or 3DS and learn ARM-Assembly? Then you would have a NES/SNES-like console with the kind of faster processor that you are talking about.
Re: The Difficulty of ARM Assembly
by on (#234828)
nocash wrote:
So why don't you just get a GBA, NDS, DSi, or 3DS and learn ARM-Assembly?

Copyright technicalities in some countries ban sale of flash adapters for (say) Nintendo DS but not those for NES.
Re: The Difficulty of ARM Assembly
by on (#234842)
Well, that shouldn't stop homebrew. The nice thing about GBA and up is that they can boot homebrew via Link Cable (or via Wifi in later models (and additionally from built-in SD card slot in even later models (but Wifi is definetly most useful for testing/developing homebrew code))).

Having or not-having flashcards is really a non-issue for homebrew on that consoles (unless one wanted to sell homebrew games in cartridge format (or unless one would somehow manage to make a game that exceeds the internal memory limits of the console, though that's unlikely to happen for people who are familar with things like NROM limitations)).

I have been doing quite a lot of coding on those handhelds (mainly rev-engineering stuff), and I have never needed a flashcard for that, and I couldn't even imagine what a flashcard could be useful for (well, except for running code from people who did have exceeded the internal memory limitations, like people who made a "hello world" as move clip).

Concerning ease of use, the DSi should be currently offering the fastest and most comfortable way to boot homebrew, immediately after power-up, and without needing any special hardware like link cables or cartridges. And a used DSi is potentially cheaper than NES flashcarts.
Re: The Difficulty of ARM Assembly
by on (#234848)
Drew Sebastino wrote:
I have a processor in that same situation, and it's the 65ce02. From what I've read of it, It's mostly less capable than the 65816, being closer to the 65c02, except for one big difference, and that's a third "z" index register, which I'm very envious of... Minimum cycles per instruction is also 1 instead of 2, so performance might actually be better in certain areas. Looks like it was only used for the Amiga serial port card though...

Remember that the Z register is really limited and the only addressing mode it's got is (zeropage),z, but I've had tons of situations where even that would be very helpful to have. Other goodies are STX Absolute,y, STY Absolute,x and conditional jumps, though those are really just a code size/speed improvement rather than dramatically changing how programs are written.
Re: The Difficulty of ARM Assembly
by on (#234855)
What assemblers are there for ARM ? I have been trying to find something and google isn't being helpful. Assembler is supposed to be the thing that turns code into program not the code itself...
Re: The Difficulty of ARM Assembly
by on (#234858)
Quote:
What assemblers are there for ARM ? I have been trying to find something and google isn't being helpful. Assembler is supposed to be the thing that turns code into program not the code itself...


I use FASMARM for NDS coding.
Re: The Difficulty of ARM Assembly
by on (#234862)
A lot of GBA and DS devs use GNU Binutils as a cross-assembler and linker. It comes as part of devkitARM.
Re: The Difficulty of ARM Assembly
by on (#234864)
TmEE wrote:
Assembler is supposed to be the thing that turns code into program, not the code itself...

Right. "Assembler" is what the tool is called. The language is "assembly language," not "assembler." The constant misuse of terms does bother me.
Re: The Difficulty of ARM Assembly
by on (#234871)
Nocash, your POV seems to limit homebrew severely. If you have an advanced console, possibly even with 3d ability, why would you make a NES-style game? Any amount of decent art, music or sfx will take more space than the free RAM.
Re: The Difficulty of ARM Assembly
by on (#234886)
Uh, yes, I grew up in 80ties, so I am quite critical about details like "is that program really worth it's memory requirements?" The ratio between code and data could easy vary by factor 1000 (eg. think of a slideshow with uncompressed bitmaps). And the amount and quality of work, artwork, features, etc. may also vary in similar fashion. If the result looks good then it's fine, if the size is reasonable then it's even better, and I think it's most impressive if it's tightly squeezed and still looks good.

Anyways, avaible memory for code/data on handhelds without flashcards:
GBA allows to load 128Kbytes 256Kbyte to RAM.
NDS allows to load 2.75Mbytes to RAM.
DSi allows to load 6.25Mbytes to RAM, and, if really needed, you could have further 128Mbyte on SD card.
I don't think that the available memory sizes do restrict homebrew too much. Or if they do: please blame that on Nintendo, it isn't my fault!

The other issue is that many homebrew programs may rely on loading data from flashcards, not because they do actually need the flashcard for a good reason, but rather because the programmers didn't knew that they could make games without flashcards.
That's no problem if you are developing a game yourself - but it could be a problem if you want to download games made by other people.
Re: The Difficulty of ARM Assembly
by on (#234907)
What? GBA is 384K of total memory (video, fast iwram, and ewram), not exactly 128k. It can load sizes approaching 256k from link cables. I've used every byte of the GBA's memory before, even doing stupid things like storing code in Palette memory (no$gba gives warnings when I jump there).
Re: The Difficulty of ARM Assembly
by on (#234911)
Oops, yes, 256Kbytes for GBA, sorry. When comparing NES/SNES video with 2D engine in NDS and GBA, I meant that the video hardware works similar (so one could focus on learning ASM without needing to worry about video), but of course one could use more colors & sounds than NES, and with the GBA's 128Kbyte 256Kbyte RAM, there's enough space to make a colorful equivalent to a 32K NES game.

3D video in ASM is probably quite rare. Though it might work quite well with 2.5Mbyte RAM. I think there are some "big" PSX games that don't use more than that amount of memory (and have the remaining 600Mbyte+ padded with CD-DA audio tracks or cut-scene movies).

I don't know which countries are banning flashcarts, but one could get started without flashcart, and it should be some work to exceed the RAM memory limit, but if that happens... then one could still consider importing a flashcart one way or another.

TmEE wrote:
What assemblers are there for ARM ? I have been trying to find something and google isn't being helpful. Assembler is supposed to be the thing that turns code into program not the code itself...

I am using the "utility --> assembler" function in no$gba for ARM source code. It's small and simple and fast. A small sample source code file is included in Magic Floor, and a bigger one is in Wifiboot (both on the no$gba webpage).

The downside is that macros are almost completely unsupported, and there's some hardcoded 4MB size limit, and the syntax is somewhat different than normal ARM code (if you are using the ".nocash" directive, without that it should also accept the more common #0xNNN immediate format).

The no$gba assembler isn't so bad if you just want to start coding. But you might prefer other tools for bigger projects. I am not too sure if there are many bigger ARM/ASM projects, most people seem to use compiler code on ARM, sometimes mixed with some ASM functions). As far as I know, the standard package for homebrew on nintendo handhelds would be devkitpro, it's freeware, and it includes compiler & assembler & libraries for gba/nds/dsi.
Basic things like 2D video and sound should work without needing libraries, but libs could be useful for some additional NDS/DSi features, like touchscreen configuration, wifi, or accessing SD cards with FAT filesystem.