What instead of indexed addressing modes?

What instead of indexed addressing modes?
by tepples on 2016-08-14 (#177483)

I've never programmed for any 8080-family CPU before, unless you count very brief inline assembly for 8086 back in 1998 or so. But I'm thinking of learning to program the Game Boy. I gather that it'll be a huge change from 6502 assembly or C. I found ASMSchool and was startled by the dearth of addressing modes.

Say you have x attributes for each of y actors. There are two ways to organize this in memory:

Some CPUs prefer an array of structures, where each actor's properties are contiguous in memory. These include machines with an addressing mode that uses a small offset to a full pointer held in a register, such as the Zilog Z80 (IX-relative modes), Motorola 68000, and ARMv4.
Other CPUs prefer a structure of arrays, where (for example) all X positions, all Y positions, all frame numbers, etc. are contiguous. These include machines with an addressing mode that uses a constant full pointer plus an offset smaller than a pointer, such as the MOS/WDC 6502 family.

In another topic, adam_smasher mentioned that the LR35902 (aka GBZ80) is the oddball of the bunch. The closest thing it has to an indexed addressing mode is (HL), where the array is 256-byte aligned, H points to the table's start, and L points to an entry in the table.

So if a program uses the first few bytes of a 256-byte page for a lookup table, what does it then do with the rest of the page? If it's ROM, does it stick some subroutine or some sequentially accessed data there? Or do Game Boy games just eat the cost of calculating addresses in software, making the work-per-clock ratio between the 6502 and LR35902 even lower than I had guessed?

Re: What instead of indexed addressing modes?
by tokumaru on 2016-08-14 (#177491)

I too always wondered what the options for addressing data on the GB were. A full Z80 looks very versatile, but the Game Boy CPU doesn't look particularly suited to the manipulation of large blocks of memory.

Re: What instead of indexed addressing modes?
by AWJ on 2016-08-14 (#177502)

I'm not a very proficient Z80 programmer but I know that the Z80 indexed instructions are incredibly slow and should be avoided anywhere performance is important (i.e. in a game). The performant way to access memory on a Z80 is the same as the only way to do it on an 8080 or GB: calculate the complete effective address (base + array offset + structure member offset) and store it in a register pair (HL, DE or BC). To walk an array, use the 16-bit INC instruction, the GB's postincrement instructions (which are only available for HL), or ADD the stride to the register pair (the fastest way to do this is to store the stride in another register pair and use a 16-bit ADD).

Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".

Register pressure makes arrays-of-structures painful on the 8080. You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.

The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.

Re: What instead of indexed addressing modes?
by tepples on 2016-08-14 (#177506)

AWJ wrote:

Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".

Yet practical RISC architectures ended up having a 68000-style pointer + short displacement as an available addressing mode for practical struct field access, rather than requiring an explicit addition every time to seek to the element of an array holding the value for a particular actor (in structure-of-arrays) or to the field of an actor's struct (in array-of-structures). In architectures where the load/store stage of the pipeline sits after the ALU stage, such as classical MIPS, address generation like this is essentially free.

In MIPS, the most "by-the-book" RISC design, lw $rt, 4($rs) reads from address rs + 4. ARM is even more flexible:

ldr r0, [r1, #4] reads from address r1 + 4
ldr r0, [r1, #4]! (pre-increment) adds 4 to r1 and reads from the new address
ldr r0, [r1], #4 (post-increment) reads from address r1, then adds 4 to r1
ldr r0, [r1, r2, lsl #2] (register indexed scaled) reads from address r1 + (r2 << 2)

Quote:

And register pressure is one of the first things that the RISCs discarded.

Quote:

You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.

The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.

Say I store 16 bytes' worth of properties for each actor (player or active enemy) in a side-scrolling game across 16 parallel arrays. Position and velocity are already 10 bytes: X (screen, pixel, subpixel), X velocity (pixel/frame, subpixel/frame), Y (pixel, subpixel), Y velocity (pixel/frame, subpixel/frame), and facing direction. On top of that are actor class, state/animation frame, state transition time, and a couple more bytes related to loading the frame's tiles into VRAM. Would I then need to add the actor ID (in one register) to the base of each array to form HL for every single access?

Re: What instead of indexed addressing modes?
by AWJ on 2016-08-15 (#177514)

All right, I guess "structures-of-arrays are faster" was an oversimplification. The case I was thinking of was where you're only accessing one member of each structure in your loop. What matters is that data that you access sequentially is sequential in memory, so that you can use postincrement as much as possible and minimize the instructions spent recalculating addresses. So you'd definitely put the bytes of an array of 16/24/32-bit integers together, not interleave them like you might on a 6502. But if you've got one loop that does collision detection and another loop that does animation and metasprite assembly, and the data sets used by those loops are disjoint, it would make sense to split your "actors" into two structs, one with the collision detection data and the other with the animation data.

Re: What instead of indexed addressing modes?
by adam_smasher on 2016-08-15 (#177520)

The absence of indexed addressing modes seems rough, but you can get a ton of mileage from arranging your data so that the CPU can get at it easily; the 4MHz GBZ80 can probably go pretty close to toe-to-toe with a 1.79MHz 6502 for most 2D game logic...as long as the GBZ80 programmer took care to properly layout their data.

Even if you can't keep everything page-aligned, if you can be sure your table never crosses page boundaries, you can strip out the carry logic from the most general sample code I wrote; that'll save you 12 cycles.

As AWJ suggests, if you're going to be operating on an entire structure at once, you can use the array-of-structures format, ideally writing your code to walk through the members you want to access.

You can also line up related data across pages, so that once you've indexed your first table, you can get related data just by loading a new value into H (4 cycles, if you can just use INC H).

Or, if possible, you could just try to rethink or restructure your engine to not operate on the entire structure at once.

Re: What instead of indexed addressing modes?
by tepples on 2016-08-15 (#177523)

But if I rely on the fact that one field follows another, and I or someone else on my team reorders the fields so that a different subroutine can walk through them with INC, such a reorganization may end up breaking the assumption. As far as I can tell, this means that for every INC or DEC in a subroutine that accesses more than one such field, I'll have to put an assembly-time assertion that the header file didn't change the order of the fields since I wrote that particular subroutine. And when such an assertion breaks, that's a lot of code I'll have to touch and re-test.

I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.

I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?

Re: What instead of indexed addressing modes?
by adam_smasher on 2016-08-15 (#177531)

Quote:

Yes, unfortunately, this is the case. Or, more realistically, one doesn't bother with any sort of compile-time assertions and just keeps everything in their head and prays

Quote:

I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.

Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.

Quote:

I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?

Something like that, sure. My game engine works much that way.

Re: What instead of indexed addressing modes?
by tepples on 2016-08-15 (#177534)

adam_smasher wrote:

Quote:

I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs

Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.

When I played Balloon Kid and Super Mario Land 2, their slow frame rate was something I noticed.

Quote:

Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?

Something like that, sure. My game engine works much that way.

Even that can get troublesome, as ejection tends to touch the entire position, especially on slopes. And then you need five methods for each actor type instead of just move().

Re: What instead of indexed addressing modes?
by AWJ on 2016-08-16 (#177637)

A lot of commercial GB games contain really terrible and inefficient code. I've seen a lot of GB code that's effectively literally-translated 6502 (the first two Final Fantasy Legend games are bad for this--the third is significantly better) I've seen games that put the stack in high RAM--I guess they were told "accessing high RAM is faster" and thoroughly misunderstood what was meant. That's the problem with using a custom CPU architecture. I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...

Re: What instead of indexed addressing modes?
by psycopathicteen on 2016-08-19 (#177749)

I'm guessing most people probably would just copy the slots to a static region of memory.

Re: What instead of indexed addressing modes?
by tepples on 2016-08-19 (#177750)

AWJ wrote:

I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...

Probably not. The Konami-1 variant of the 6809 reportedly just XORs each opcode (not its operand) with a formula based on A3 and A1. That could be compensated at assembly time. Anyone who had coded for a CoCo, Dragon, FM-7, or Williams arcade could probably jump right in.

But then arcade code didn't have to be quite as efficient as console code anyway, as the manufacturer could just throw more hardware at it.

Re: What instead of indexed addressing modes?
by AWJ on 2016-08-19 (#177761)

tepples wrote:

AWJ wrote:

I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...

That's not the CPU I'm talking about. The Konami-1 was just a 6809 with scrambled opcodes, but the Konami-2 had 8 general-purpose output pins that were usually used for ROM banking (indeed they're listed as A16-A23 on schematics) and ISA changes that go beyond simple scrambling. The Konami-1 was used in early-1980s games like Gyruss; the Konami-2 was used until the early 1990s (The Simpsons was one of the last games that ran on it)

Re: What instead of indexed addressing modes?
by tepples on 2016-11-29 (#183675)

In this topic, Axelay recommends 6502-style parallel tables with start addresses aligned to a power of two, which lets instructions like set 5,L and res 5,L point HL at different tables in a page.

Re: What instead of indexed addressing modes?
by Drag on 2016-11-30 (#183724)

It's been a while since I last touched any z80, but every time an indexed array lookup was needed, the pointer always had to be calculated by hand.

The gameboy has a treasure trove of RAM compared to the NES, so it shouldn't be a problem to align your actor memory so that an individual actor's memory doesn't cross a page boundary. Once you do, all you have to do is store your actor's base pointer in a register pair, and the only calculations you'd need to do are on the low byte. If your actors' memory sizes are a power of two, it's a simple matter of ORing the desired index to the low byte, and then later using an AND mask to recover the base pointer so you can OR another index.

That's all for array-of-structures. Structure-of-arrays is the exact same thing, except instead of attribute offsets being 0, 1, 2, 3, etc, they're now $00, $10, $20, $30, etc. You also trade limitations; instead of actor size being a power of two and actor count being free, you now have actor count being a power of two and actor size being free. Use whatever's more efficient for the memory usage, your code will be exactly the same regardless.

If you need more than 256 bytes worth of actors, you should use array-of-structures, because you'll likely be calculating pointers for attributes (8-bit operation if no actors cross a page boundary) more often than you'll be calculating base pointers for actors (16-bit operation always).

As an interesting aside, the ZX Spectrum optimized its framebuffer specifically for Z80 HL addressing; Increasing L moves to the next 8x8 character on the screen, increasing H moves to the next scanline within that character. It might not be the same as DMG-Z80, but it might be worth checking out what ZX Spectrum programmers have to say. That's how I came across this helpful bit-twiddling reference after all.

Re: What instead of indexed addressing modes?
by tomaitheous on 2016-12-02 (#183798)

tepples wrote:

I'm beginning to understand why some Game Boy games ran at 30 fps or slower

There were master system games that ventured into that 30fps territory too. Earlier titles.

A lot of z80 programmers tend to boast about the available regs to work with on the z80 compared to accumulate based processors (65x, 6809, etc). But I always found it to be the complete opposite. Data registers are kind of a moot point on the 65x simply because it has a lot of direct memory addressing modes (and fast mode; direct or zero page). A lot more operations actually have to go through the A reg on the z80, from another reg or from indirection (address regs). I always felt like a constraint of constantly juggling things - way more than what might be done with Acc on the 65x. And the having ZP as off processor address registers (address vectors) - feels soo free in comparison. Even the 68k felt a tiny bit cramped in this respect to the 65x (only 7 address regs; SP is the 8th address reg).

Of course, the context of 65x to me is not limited to the NES - so my view of optimization and use of quick LUTs for logic are probably expanded compared to the NES environment.

Re: What instead of indexed addressing modes?
by tepples on 2016-12-18 (#185030)

C64 vs. Speccy wars concluded that a full Z80 (with IX and IY) has 1/3 of the IPC of a 6502.

Re: What instead of indexed addressing modes?
by tepples on 2017-07-27 (#201040)

In this post, TmEE wrote:

Z80 assembly came next as I needed means to play sound without tying up the main CPU. It was much more painful than 68K which pretty much had spoiled me. x86 still felt worse though... Nowdays I also do Master System and SC-3000 / SG-1000 stuff, whole game in Z80 isn't actually all that bad.

Is that based on IX/IY (SG1K/SMS/GG only) or some other way to step through fields of an actor structure?

Re: What instead of indexed addressing modes?
by TmEE on 2017-07-28 (#201081)

I pretty much only use IX(IXL,IXH) and IY(IYL,IYH) as temporary variables and all else goes by structures aligned to 256 bytes and BC/DE/HL addressing with lot of incrementing the C/E/L rather than direct specifying of elements to access through them. Autoincrement comes for free on 68K and incrementing is faster on Z80 than directly specifying the element to access too, data is always laid out in the order of use to accomodate that approach.

Re: What instead of indexed addressing modes?
by tepples on 2017-07-28 (#201088)

TmEE wrote:

data is always laid out in the order of use

That's what others were recommending but I'm somehow not fully grokking. Say there are 16 bytes of state for each actor in an action platformer:

X position (24 bits; 16.8)
X velocity (16 bits; 8.8 signed)
Y position (16 bits; 8.8)
Y velocity (16 bits; 8.8 signed)
Current frame
Timing state
Facing direction
Height of last hitbox to hit this actor relative to the actor's feet; used for collision response
Health
Actor type ID
VRAM location for actor's sprite cels

Are there some generic rules of thumb for field layout to ensure "data is always laid out in the order of use"? If not, how can I predict "the order of use" in all cases? Do I need to prototype all the routines used by a sample of the actor types in a high-level language, and then reorder the fields to be either after or one bit different from the previous field before translating the routines to Z80/LR35902?

Re: What instead of indexed addressing modes?
by adam_smasher on 2017-07-28 (#201093)

Well, I feel as though I've suggested this before, but the best rule of thumb in general is probably "structs of arrays" rather than "arrays of structs".

Otherwise you do your best, focus on the needs of the most speed-sensitive code, maybe iterate on your design a few times, and if worst comes to worst, you might need to suck it up and manually index - which, as long as you keep your data page-aligned and/or so that it never crosses page boundaries, really isn't too bad: we're talking about, in 6502 terms, no more than a handful of extra cycles. Outside of a vblank handler, it's extremely rare that you really need to worry about that in the average game.

Re: What instead of indexed addressing modes?
by TmEE on 2017-07-28 (#201095)

Jaa, sometimse you just got to suck it up. My process is iterative, with refactors as new ways to improve something present themselves. Prototyping in a higher level language probably gets you somewhere sooner.

Re: What instead of indexed addressing modes?
by tepples on 2017-11-30 (#209070)

Found this via Why "logic" is bullshit (RANT):

In this post, Stef wrote:

Both the 6502 and GB-CPU have a small instruction set, but the advantage is on the 6502 side is with the addressing modes for the instructions and less bottle necks for the processor architecture/design.
[...]
If you're specifically doing bitmap drawing effects (software blitting) at the given GB screen resolution, then relatively speaking the GB 1mhz CPU is might be faster for the task than NES at its native resolution. But in absolute terms, and a normal game engine, I would say that it's behind.

I'm continuing to read through Stef's posts in that topic to see if it addresses the problem I'm seeing, that of random access to the properties of a particular element of an array of objects.

Re: What instead of indexed addressing modes?
by adam_smasher on 2017-11-30 (#209075)

(all cycle counts below for the GBZ80 are /4, for easier comparison with the 6502):

Assuming the index you want is in A and the array is page-aligned and < 256 bytes,

Code:

LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
LD L, A
LD A, [HL]

That's 7 cycles.

Better yet, use arrays of one byte fields, rather than fields of arrays, and you can get rid of the ADD FieldOffset, saving 2 cycles:

Code:

LD H, ArrayBase >> 8
LD L, A
LD A, [HL]

That's 5 cycles.

If the array isn't page-aligned but doesn't wrap past page boundaries, add another two cycles to ADD ArrayBase & $00FF.

Code:

LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
LD L, A
LD A, [HL]

That's 9 cycles.

If the array isn't page-aligned and might wrap, add two cycles on no-wrap, or four on wrap:

Code:

LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
JR NC, .nc
INC H
.nc:
LD L, A
LD A, [HL]

That's 11-13 cycles.

So: if you don't take care with your memory layout, you lose performance. But nothing about what you want to do is impossible or even difficult. It's usually somewhat slower than the 6502 but not terribly so, depending on a bunch of platform-specific factors on both sides (is your base address in zero page? your index in one of the index registers? do you cross page boundaries? what kind of math do you need to do to get the final index? etc etc). Usually this doesn't matter too much, unless you're in a tight loop - in which case you write/arrange your memory for speed, and the GBZ80 holds its own just fine against the 6502 in that context.

Far and away the biggest problem here is that code for a fully general operation is a bit unwieldly. Wrap it in a macro if you'd like.

This is a non-issue.

Re: What instead of indexed addressing modes?
by Sumez on 2017-11-30 (#209076)

This has probably been covered well in the thread already, but I think you can get pretty far without indexed adressing.

As I recently revealed, I've been perusing the entire source code for Donkey Kong, and it's actually surprisingly rare that it uses the IX and IY index registers that the Z80 has over the 8080. Instead it uses a lot of INC and DEC on the 16 bit HL register, and adds the DE register to cycle through object tables. It's in no way as elegant as what we are used to with the 6502, but having access to 16 bit additions on registers used for addressing opens up a whole new toolset - I'm assuming that's also possible on the Game Boy's CPU.

Re: What instead of indexed addressing modes?
by adam_smasher on 2017-11-30 (#209077)

You can, and you can also auto-increment/decrement HL on reads for free.

Re: What instead of indexed addressing modes?
by tepples on 2017-11-30 (#209079)

It might be easier for readers to appreciate how this is a non-issue if we try a concrete example. A 6502 subroutine to simulate movement of an object of under influence of gravity might look like this:

Code:

GRAVITY = 48  ; /256 pixel per frame^2
NUM_ACTORS = 16

; The following arrays are not aligned to the start of a page,
; but none cross a page boundary.
.bss
.align NUM_ACTORS

; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
actor_ysub: .res NUM_ACTORS
actor_y: .res NUM_ACTORS
actor_yhi: .res NUM_ACTORS

; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
actor_dysub: .res NUM_ACTORS
actor_dy: .res NUM_ACTORS

.code

move_actor_x_vertically:
  ; Step 1: Apply acceleration due to gravity
  clc
  lda #GRAVITY
  adc actor_dysub,x
  sta actor_dysub,x
  lda #0
  adc actor_dy,x
  sta actor_dy,x

  ; Step 2: Add velocity to displacement
  clc
  lda actor_dysub,x
  adc actor_ysub,x
  sta actor_ysub,x
  lda actor_dy,x
  adc actor_y,x
  sta actor_y,x

  ; Sign extend the velocity
  lda actor_dy,x
  and #$80
  beq :+
    lda #$FF
  :
  adc actor_yhi,x
  sta actor_yhi,x
  rts

I'd like to see the idiomatic translation of this to Z80 or LR35902. Perhaps what I'm missing is some sort of insight on how "and adds the DE register to cycle through object tables" plays out in practice.

Re: What instead of indexed addressing modes?
by adam_smasher on 2017-11-30 (#209083)

Presumably you'd want to do this to each of your actors in a loop? Then I'd probably do it like this:

Code:

GRAVITY = 48  ; /256 pixel per frame^2
NUM_ACTORS = 16

.bss
; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
.align NUM_ACTORS
Actor_Y: .res NUM_ACTORS * 3

; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
.align NUM_ACTORS
Actor_DY .res NUM_ACTORS * 2

.code

ApplyGravityToVelocities:
  LD HL, Actor_DY
  LD B, NUM_ACTORS
  LD C, GRAVITY
.loop:
  LD A, [HL]
  ADD C
  LD [HLI], A
  JR NC, .nc
  INC [HL]
.nc:
  INC L ; skip past high-byte
  DEC B
  JR NZ, .loop
  RET

ApplyVelocities:
  LD DE, Actor_Y
  LD HL, Actor_DY
  LD B, NUM_ACTORS
.loop
;; add low byte
  LD A, [DE] ; get ysub
  ADD [HL] ; add dysub
  LD [DE], A ; set ysub
  INC E
  INC L
;; add middle byte
  LD A, [DE] ; get y
  ADC [HL] ; add dy + carry(ysub + dysub)
  LD [DE], A ; set y
  INC E
;; adjust high byte
  LD A, [HLI] ; get dy(hi), move to next dy(lo)
  BIT 7, A
  JR Z, .pos
.neg:
  JR C, .next
  LD A, [DE]
  DEC A
  LD [DE], A
.pos:
  JR NC, .next
  LD A, [DE]
  INC A
  LD [DE], A
.next
  INC E
  DEC B
  JR NZ, .loop
  RET

I think that's right? It's not tested, and I'm not totally sure I understand the signed arithmetic bit.

Re: What instead of indexed addressing modes?
by tepples on 2017-11-30 (#209090)

I would want to apply gravity to actors with some types and not to actors with other types. Based on the code you presented, you appear to suggest structuring the actor update loops in a form that conceptually resembles the single-instruction, multiple-data (SIMD) approach used by shaders on modern GPUs:

Apply step 1 to all actors.
Apply step 2 to all actors.
Apply step 3 to all actors.

But if the table contains both actors that fall and actors that do not fall, each step needs to include a determination of whether to apply or skip the step for a particular actor:

Apply step 1 to those actors whose combination of type and state uses step 1.
Apply step 2 to those actors whose combination of type and state uses step 2.
Apply step 3 to those actors whose combination of type and state uses step 3.

In this sort of SIMD-like structure, I don't see how you'd enable or disable individual steps based on an actor's type and state without using additional register pairs for pointers to lookup tables from type and state to bitfields of which steps shall be performed on objects in that type and state.

Please forgive my naivete. It's just that arranging enemy AI using SIMD rather than straight-through code is an entirely new concept to me.

Re: What instead of indexed addressing modes?
by adam_smasher on 2017-11-30 (#209091)

I think you can pretty straightforwardly adapt my code to only apply the operations to one actor - the only real difference would maybe be expanding the Y position to 4 bytes, to index into the array with two quick shifts.

I just wrote the code as I would have written it, without having that additional criteria in mind. I tend to work in the "SIMD" style, and it often does help things on the Game Boy because of the auto-incrementing. But in retrospect I should have kept it more similar to your original code - I probably muddled things, especially because this code didn't end up really needing to be very SIMD-ish.

Code:

;;; A - contains actor # to apply gravity to
ApplyGravityToVelocity:
  PUSH HL
  LD H, Actor_DY >> 8
  RLA ; multiply by two, since each velocity is two bytes
  ADD Actor_DY & $00FF ; add the array base, since we're not necessarily page aligned
  LD L, A
  LD A, [HL]
  ADD GRAVITY
  LD [HL], A
  POP HL
  RET

If it's only gravity that's selective, then you can still do the actual motions using the separate routine I gave, all at once.

Code:

.bss
.align PAGE
Actor_UsesGravity: .res NUM_ACTORS

.code

UpdateActors:
  CALL ApplyGravity
  CALL ApplyVelocities
  RET ; obviously this tail call could be optimized out

ApplyGravity:
  LD HL, ActorUsesGravity ;; page aligned, so L = 0
  LD B, NUM_ACTORS
.loop:
  LD A, [HL]
  AND A
  JR NZ, .next
  LD A, L
  CALL ApplyGravity ; note that I made the above subroutine preserve HL!
.next:
  INC L
  DEC B
  JR NZ, .loop
  RET

Re: What instead of indexed addressing modes?
by tepples on 2018-03-19 (#215542)

That gets complicated when each actor ends up with dozens of different Uses flags, one for each possible line that could be in or not in a particular actor's script, that need to get copied from the actor's prototype. It gets doubly complicated when a line in the Grand Unified Script needs to turn a bunch of Uses flags on and off when an actor goes in or out of a particular state.

Today I decided to look at how commercial games solve the SOA vs. AOS dilemma. But I discovered that very few Game Boy games listed on Data Crystal actually have their RAM map substantially filled in. The first one I found was that of Wario Land, which has an 8-entry actor table at $A200, where the actors occupy $A200-$A213, $A220-$A233, $A240-$A253, ..., $A2E0-$A2F3. I guess this validates the array of 32-byte-aligned actor structures.

Re: What instead of indexed addressing modes?
by Oziphantom on 2018-03-19 (#215546)

I've recently started tumbling down the Z80 rabbit hole.. Only I have a full Z80 at my disposal and 6502 that I can jump back to when the going gets too hot for the Z80

I'm starting to think that the dispatch is the way to solve the issue.
We are use to doing

Get thing,x
and state
bne _next
lda otherData,x
adc moreData,x
sta otherData,x
and state2
bne _next2
...thing2

while for Z80 it might be better to use the "use bits" as a dispatch.
ld use
<<2
add Base Pointer
call

this way you put a function that knows what the use cases are and can handle the bits without having to look them up. Saving the need to index into multiple tables.

Not done too much experimentation yet, as I'm still trying to find an assembler that isn't circa 1982. Or one that is 1987 spec but isn't trapped on a machine that is slow...

Re: What instead of indexed addressing modes?
by tepples on 2018-05-11 (#218362)

My bad luck continues. I searched Google for 8080 record field access and 8080 struct field access, hoping to stumble on some idiom that has become common practice, but most results were some website hosted on port 8080, not Intel 8080. I tried 8080 assembly struct field access, and Google tried to second guess me with "Missing: ~~8080~~"

Re: What instead of indexed addressing modes?
by adam_smasher on 2018-05-11 (#218363)

If you put a search term in quotes Google won't show you results without it.

You might try looking at the output of 8080 or GBZ80 C compilers when interacting with structs to see how they handle the problem.

Re: What instead of indexed addressing modes?
by tepples on 2018-05-11 (#218374)

adam_smasher wrote:

You might try looking at the output of 8080 or GBZ80 C compilers

In other words, the godbolt solution. I'd've tried that if SDCC were any good. See "To C or not to C?" by ISSOtm.

ISSOtm wrote:

[GBDK is] built on an ancient build of SDCC, which is known to generate poor (bloated) and often straight up wrong code.

What C compilers targeting 8080 are any good? Any luck with, say, BDS C?

Re: What instead of indexed addressing modes?
by adam_smasher on 2018-05-11 (#218378)

It might be worth giving SDCC a shot anyway - it's still under active development, and the aforementioned "ancient build of SDCC, which is known to generate poor (bloated) and often straight up wrong code" that GBDK is based on is 17 years old.

Re: What instead of indexed addressing modes?
by tepples on 2018-05-24 (#218979)

In GBDev Discord, ISSOtm announced an RGBDS macro pack to define structs and is trying to figure out how to best distribute it. Alongside this came some practical idioms for struct field access.

Code:

  ; Prep: 3 mcycles each
  ld de,self
  ld bc,other_actor

  ; Random field load/store: 7 mcycles, BC preserved
  ld hl,offsetof(Actor, xsub)  ; 3
  add hl,de                    ; 2
  ld a,[hl]                    ; 2
  ; Compare 6502: 5 cycles (minus 1 for load not crossing page)
  lda actor_xsub,x             ; 4

  ; Random field arithmetic: 8 mcycles, BC preserved
  ld hl,offsetof(Actor, xsub)  ; 3
  add hl,de                    ; 2
  ld l,[hl]                    ; 2
  add a,l                      ; 1
  ; Compare 6502: 4 cycles (plus 1 for crossing page)
  add actor_xsub,x             ; 4

  ; Store constant in field: 8 mcycles, ABC preserved
  ld hl,offsetof(Actor, frame)  ; 3
  add hl,de                     ; 2
  ld [hl],FRAME_JUMP            ; 3
  ; Compare 6502: 7 cycles and A is clobbered
  lda #FRAME_JUMP               ; 2
  sta actor_xsub,x              ; 5

Roughly: 8080 is faster for sequential access, and 6502 is faster for random access. Try counting aaaa,X vs. (dd),Y accesses in your NES program to estimate what parts might become slower or faster respectively.