I've never programmed for any 8080-family CPU before, unless you count very brief inline assembly for 8086 back in 1998 or so. But I'm thinking of learning to program the Game Boy. I gather that it'll be a huge change from 6502 assembly or C. I found
ASMSchool and was startled by the dearth of addressing modes.
Say you have
x attributes for each of
y actors. There are two ways to organize this in memory:
- Some CPUs prefer an array of structures, where each actor's properties are contiguous in memory. These include machines with an addressing mode that uses a small offset to a full pointer held in a register, such as the Zilog Z80 (IX-relative modes), Motorola 68000, and ARMv4.
- Other CPUs prefer a structure of arrays, where (for example) all X positions, all Y positions, all frame numbers, etc. are contiguous. These include machines with an addressing mode that uses a constant full pointer plus an offset smaller than a pointer, such as the MOS/WDC 6502 family.
In another topic,
adam_smasher mentioned that the LR35902 (aka GBZ80) is the oddball of the bunch. The closest thing it has to an indexed addressing mode is (HL), where the array is 256-byte aligned, H points to the table's start, and L points to an entry in the table.
So if a program uses the first few bytes of a 256-byte page for a lookup table, what does it then do with the rest of the page? If it's ROM, does it stick some subroutine or some sequentially accessed data there? Or do Game Boy games just eat the cost of calculating addresses in software, making the work-per-clock ratio between the 6502 and LR35902 even lower than I had guessed?
I too always wondered what the options for addressing data on the GB were. A full Z80 looks very versatile, but the Game Boy CPU doesn't look particularly suited to the manipulation of large blocks of memory.
I'm not a very proficient Z80 programmer but I know that the Z80 indexed instructions are incredibly slow and should be avoided anywhere performance is important (i.e. in a game). The performant way to access memory on a Z80 is the same as the only way to do it on an 8080 or GB: calculate the complete effective address (base + array offset + structure member offset) and store it in a register pair (HL, DE or BC). To walk an array, use the 16-bit INC instruction, the GB's postincrement instructions (which are only available for HL), or ADD the stride to the register pair (the fastest way to do this is to store the stride in another register pair and use a 16-bit ADD).
Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".
Register pressure makes arrays-of-structures painful on the 8080. You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.
The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.
AWJ wrote:
Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".
Yet practical RISC architectures ended up having a 68000-style pointer + short displacement as an available addressing mode for practical struct field access, rather than requiring an explicit addition every time to seek to the element of an array holding the value for a particular actor (in structure-of-arrays) or to the field of an actor's struct (in array-of-structures). In architectures where the load/store stage of the pipeline sits after the ALU stage, such as classical MIPS, address generation like this is essentially free.
In MIPS, the most "by-the-book" RISC design,
lw $rt, 4($rs) reads from address rs + 4. ARM is even more flexible:
- ldr r0, [r1, #4] reads from address r1 + 4
- ldr r0, [r1, #4]! (pre-increment) adds 4 to r1 and reads from the new address
- ldr r0, [r1], #4 (post-increment) reads from address r1, then adds 4 to r1
- ldr r0, [r1, r2, lsl #2] (register indexed scaled) reads from address r1 + (r2 << 2)
Quote:
Register pressure makes arrays-of-structures painful on the 8080.
And register pressure is one of the first things that the RISCs discarded.
Quote:
You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.
The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.
Say I store 16 bytes' worth of properties for each actor (player or active enemy) in a side-scrolling game across 16 parallel arrays. Position and velocity are already 10 bytes: X (screen, pixel, subpixel), X velocity (pixel/frame, subpixel/frame), Y (pixel, subpixel), Y velocity (pixel/frame, subpixel/frame), and facing direction. On top of that are actor class, state/animation frame, state transition time, and a couple more bytes related to loading the frame's tiles into VRAM. Would I then need to add the actor ID (in one register) to the base of each array to form HL for every single access?
All right, I guess "structures-of-arrays are faster" was an oversimplification. The case I was thinking of was where you're only accessing one member of each structure in your loop. What matters is that data that you access sequentially is sequential in memory, so that you can use postincrement as much as possible and minimize the instructions spent recalculating addresses. So you'd definitely put the bytes of an array of 16/24/32-bit integers together, not interleave them like you might on a 6502. But if you've got one loop that does collision detection and another loop that does animation and metasprite assembly, and the data sets used by those loops are disjoint, it would make sense to split your "actors" into two structs, one with the collision detection data and the other with the animation data.
The absence of indexed addressing modes seems rough, but you can get a ton of mileage from arranging your data so that the CPU can get at it easily; the 4MHz GBZ80 can probably go pretty close to toe-to-toe with a 1.79MHz 6502 for most 2D game logic...as long as the GBZ80 programmer took care to properly layout their data.
Even if you can't keep everything page-aligned, if you can be sure your table never crosses page boundaries, you can strip out the carry logic from the most general sample code I wrote; that'll save you 12 cycles.
As AWJ suggests, if you're going to be operating on an entire structure at once, you can use the array-of-structures format, ideally writing your code to walk through the members you want to access.
You can also line up related data across pages, so that once you've indexed your first table, you can get related data just by loading a new value into H (4 cycles, if you can just use INC H).
Or, if possible, you could just try to rethink or restructure your engine to not operate on the entire structure at once.
But if I rely on the fact that one field follows another, and I or someone else on my team reorders the fields so that a different subroutine can walk through them with INC, such a reorganization may end up breaking the assumption. As far as I can tell, this means that for every INC or DEC in a subroutine that accesses more than one such field, I'll have to put an assembly-time assertion that the header file didn't change the order of the fields since I wrote that particular subroutine. And when such an assertion breaks, that's a lot of code I'll have to touch and re-test.
I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.
I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?
Quote:
But if I rely on the fact that one field follows another, and I or someone else on my team reorders the fields so that a different subroutine can walk through them with INC, such a reorganization may end up breaking the assumption. As far as I can tell, this means that for every INC or DEC in a subroutine that accesses more than one such field, I'll have to put an assembly-time assertion that the header file didn't change the order of the fields since I wrote that particular subroutine. And when such an assertion breaks, that's a lot of code I'll have to touch and re-test.
Yes, unfortunately, this is the case. Or, more realistically, one doesn't bother with any sort of compile-time assertions and just keeps everything in their head and prays
Quote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.
Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.
Quote:
I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?
Something like that, sure. My game engine works much that way.
adam_smasher wrote:
Quote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs
Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.
When I played
Balloon Kid and
Super Mario Land 2, their slow frame rate was something I noticed.
Quote:
Quote:
Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?
Something like that, sure. My game engine works much that way.
Even that can get troublesome, as ejection tends to touch the entire position, especially on slopes. And then you need five methods for each actor type instead of just
move().
A lot of commercial GB games contain really terrible and inefficient code. I've seen a lot of GB code that's effectively literally-translated 6502 (the first two Final Fantasy Legend games are bad for this--the third is significantly better) I've seen games that put the stack in high RAM--I guess they were told "accessing high RAM is faster" and thoroughly misunderstood what was meant. That's the problem with using a custom CPU architecture. I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...
I'm guessing most people probably would just copy the slots to a static region of memory.
AWJ wrote:
I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...
Probably not. The Konami-1 variant of the 6809
reportedly just XORs each opcode (not its operand) with a formula based on A3 and A1. That could be compensated at assembly time. Anyone who had coded for a CoCo, Dragon, FM-7, or Williams arcade could probably jump right in.
But then arcade code didn't have to be quite as efficient as console code anyway, as the manufacturer could just throw more hardware at it.
tepples wrote:
AWJ wrote:
I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...
Probably not. The Konami-1 variant of the 6809
reportedly just XORs each opcode (not its operand) with a formula based on A3 and A1. That could be compensated at assembly time. Anyone who had coded for a CoCo, Dragon, FM-7, or Williams arcade could probably jump right in.
But then arcade code didn't have to be quite as efficient as console code anyway, as the manufacturer could just throw more hardware at it.
That's not the CPU I'm talking about. The Konami-1 was just a 6809 with scrambled opcodes, but the Konami-2 had 8 general-purpose output pins that were usually used for ROM banking (indeed they're listed as A16-A23 on schematics) and ISA changes that go beyond simple scrambling. The Konami-1 was used in early-1980s games like Gyruss; the Konami-2 was used until the early 1990s (The Simpsons was one of the last games that ran on it)
In
this topic, Axelay recommends 6502-style parallel tables with start addresses aligned to a power of two, which lets instructions like
set 5,L and
res 5,L point HL at different tables in a page.
It's been a while since I last touched any z80, but every time an indexed array lookup was needed, the pointer always had to be calculated by hand.
The gameboy has a treasure trove of RAM compared to the NES, so it shouldn't be a problem to align your actor memory so that an individual actor's memory doesn't cross a page boundary. Once you do, all you have to do is store your actor's base pointer in a register pair, and the only calculations you'd need to do are on the low byte. If your actors' memory sizes are a power of two, it's a simple matter of ORing the desired index to the low byte, and then later using an AND mask to recover the base pointer so you can OR another index.
That's all for array-of-structures. Structure-of-arrays is the exact same thing, except instead of attribute offsets being 0, 1, 2, 3, etc, they're now $00, $10, $20, $30, etc. You also trade limitations; instead of actor size being a power of two and actor count being free, you now have actor count being a power of two and actor size being free. Use whatever's more efficient for the memory usage, your code will be exactly the same regardless.
If you need more than 256 bytes worth of actors, you should use array-of-structures, because you'll likely be calculating pointers for attributes (8-bit operation if no actors cross a page boundary) more often than you'll be calculating base pointers for actors (16-bit operation always).
As an interesting aside, the ZX Spectrum optimized its
framebuffer specifically for Z80 HL addressing; Increasing L moves to the next 8x8 character on the screen, increasing H moves to the next scanline within that character. It might not be the same as DMG-Z80, but it might be worth checking out what ZX Spectrum programmers have to say. That's how I came across
this helpful bit-twiddling reference after all.
tepples wrote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower
There were master system games that ventured into that 30fps territory too. Earlier titles.
A lot of z80 programmers tend to boast about the available regs to work with on the z80 compared to accumulate based processors (65x, 6809, etc). But I always found it to be the complete opposite. Data registers are kind of a moot point on the 65x simply because it has a lot of direct memory addressing modes (and fast mode; direct or zero page). A lot more operations actually have to go through the A reg on the z80, from another reg or from indirection (address regs). I always felt like a constraint of constantly juggling things - way more than what might be done with Acc on the 65x. And the having ZP as off processor address registers (address vectors) - feels soo free in comparison. Even the 68k felt a tiny bit cramped in this respect to the 65x (only 7 address regs; SP is the 8th address reg).
Of course, the context of 65x to me is not limited to the NES - so my view of optimization and use of quick LUTs for logic are probably expanded compared to the NES environment.
C64 vs. Speccy wars concluded that a full Z80 (with IX and IY) has 1/3 of the IPC of a 6502.
Z80 assembly came next as I needed means to play sound without tying up the main CPU. It was much more painful than 68K which pretty much had spoiled me. x86 still felt worse though... Nowdays I also do Master System and SC-3000 / SG-1000 stuff, whole game in Z80 isn't actually all that bad.
Is that based on IX/IY (SG1K/SMS/GG only) or some other way to step through fields of an actor structure?
I pretty much only use IX(IXL,IXH) and IY(IYL,IYH) as temporary variables and all else goes by structures aligned to 256 bytes and BC/DE/HL addressing with lot of incrementing the C/E/L rather than direct specifying of elements to access through them. Autoincrement comes for free on 68K and incrementing is faster on Z80 than directly specifying the element to access too, data is always laid out in the order of use to accomodate that approach.
TmEE wrote:
data is always laid out in the order of use
That's what others were recommending but I'm somehow not fully grokking. Say there are 16 bytes of state for each actor in an action platformer:
X position (24 bits; 16.8)
X velocity (16 bits; 8.8 signed)
Y position (16 bits; 8.8)
Y velocity (16 bits; 8.8 signed)
Current frame
Timing state
Facing direction
Height of last hitbox to hit this actor relative to the actor's feet; used for collision response
Health
Actor type ID
VRAM location for actor's sprite cels
Are there some generic rules of thumb for field layout to ensure "data is always laid out in the order of use"? If not, how can I predict "the order of use" in all cases? Do I need to prototype all the routines used by a sample of the actor types in a high-level language, and then reorder the fields to be either after or one bit different from the previous field before translating the routines to Z80/LR35902?
Well, I feel as though I've suggested this before, but the best rule of thumb in general is probably "structs of arrays" rather than "arrays of structs".
Otherwise you do your best, focus on the needs of the most speed-sensitive code, maybe iterate on your design a few times, and if worst comes to worst, you might need to suck it up and manually index - which, as long as you keep your data page-aligned and/or so that it never crosses page boundaries, really isn't too bad: we're talking about, in 6502 terms, no more than a handful of extra cycles. Outside of a vblank handler, it's extremely rare that you really need to worry about that in the average game.
Jaa, sometimse you just got to suck it up. My process is iterative, with refactors as new ways to improve something present themselves. Prototyping in a higher level language probably gets you somewhere sooner.
Found this via
Why "logic" is bullshit (RANT):
Both the 6502 and GB-CPU have a small instruction set, but the advantage is on the 6502 side is with the addressing modes for the instructions and less bottle necks for the processor architecture/design.
[...]
If you're specifically doing bitmap drawing effects (software blitting) at the given GB screen resolution, then relatively speaking the GB 1mhz CPU is might be faster for the task than NES at its native resolution. But in absolute terms, and a normal game engine, I would say that it's behind.
I'm continuing to read through Stef's posts in that topic to see if it addresses the problem I'm seeing, that of random access to the properties of a particular element of an array of objects.
(all cycle counts below for the GBZ80 are /4, for easier comparison with the 6502):
Assuming the index you want is in A and the array is page-aligned and < 256 bytes,
Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
LD L, A
LD A, [HL]
That's 7 cycles.
Better yet, use arrays of one byte fields, rather than fields of arrays, and you can get rid of the ADD FieldOffset, saving 2 cycles:
Code:
LD H, ArrayBase >> 8
LD L, A
LD A, [HL]
That's 5 cycles.
If the array isn't page-aligned but doesn't wrap past page boundaries, add another two cycles to ADD ArrayBase & $00FF.
Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
LD L, A
LD A, [HL]
That's 9 cycles.
If the array isn't page-aligned and might wrap, add two cycles on no-wrap, or four on wrap:
Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
JR NC, .nc
INC H
.nc:
LD L, A
LD A, [HL]
That's 11-13 cycles.
So: if you don't take care with your memory layout, you lose performance. But nothing about what you want to do is impossible or even difficult. It's usually somewhat slower than the 6502 but not terribly so, depending on a bunch of platform-specific factors on both sides (is your base address in zero page? your index in one of the index registers? do you cross page boundaries? what kind of math do you need to do to get the final index? etc etc). Usually this doesn't matter too much, unless you're in a tight loop - in which case you write/arrange your memory for speed, and the GBZ80 holds its own just fine against the 6502 in that context.
Far and away the biggest problem here is that code for a fully general operation is a bit unwieldly. Wrap it in a macro if you'd like.
This is a non-issue.
This has probably been covered well in the thread already, but I think you can get pretty far without indexed adressing.
As I recently revealed, I've been perusing the entire source code for Donkey Kong, and it's actually surprisingly rare that it uses the IX and IY index registers that the Z80 has over the 8080. Instead it uses a lot of INC and DEC on the 16 bit HL register, and adds the DE register to cycle through object tables. It's in no way as elegant as what we are used to with the 6502, but having access to 16 bit additions on registers used for addressing opens up a whole new toolset - I'm assuming that's also possible on the Game Boy's CPU.
You can, and you can also auto-increment/decrement HL on reads for free.
It might be easier for readers to appreciate how this is a non-issue if we try a concrete example. A 6502 subroutine to simulate movement of an object of under influence of gravity might look like this:
Code:
GRAVITY = 48 ; /256 pixel per frame^2
NUM_ACTORS = 16
; The following arrays are not aligned to the start of a page,
; but none cross a page boundary.
.bss
.align NUM_ACTORS
; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
actor_ysub: .res NUM_ACTORS
actor_y: .res NUM_ACTORS
actor_yhi: .res NUM_ACTORS
; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
actor_dysub: .res NUM_ACTORS
actor_dy: .res NUM_ACTORS
.code
move_actor_x_vertically:
; Step 1: Apply acceleration due to gravity
clc
lda #GRAVITY
adc actor_dysub,x
sta actor_dysub,x
lda #0
adc actor_dy,x
sta actor_dy,x
; Step 2: Add velocity to displacement
clc
lda actor_dysub,x
adc actor_ysub,x
sta actor_ysub,x
lda actor_dy,x
adc actor_y,x
sta actor_y,x
; Sign extend the velocity
lda actor_dy,x
and #$80
beq :+
lda #$FF
:
adc actor_yhi,x
sta actor_yhi,x
rts
I'd like to see the idiomatic translation of this to Z80 or LR35902. Perhaps what I'm missing is some sort of insight on how "and adds the DE register to cycle through object tables" plays out in practice.
Presumably you'd want to do this to each of your actors in a loop? Then I'd probably do it like this:
Code:
GRAVITY = 48 ; /256 pixel per frame^2
NUM_ACTORS = 16
.bss
; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
.align NUM_ACTORS
Actor_Y: .res NUM_ACTORS * 3
; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
.align NUM_ACTORS
Actor_DY .res NUM_ACTORS * 2
.code
ApplyGravityToVelocities:
LD HL, Actor_DY
LD B, NUM_ACTORS
LD C, GRAVITY
.loop:
LD A, [HL]
ADD C
LD [HLI], A
JR NC, .nc
INC [HL]
.nc:
INC L ; skip past high-byte
DEC B
JR NZ, .loop
RET
ApplyVelocities:
LD DE, Actor_Y
LD HL, Actor_DY
LD B, NUM_ACTORS
.loop
;; add low byte
LD A, [DE] ; get ysub
ADD [HL] ; add dysub
LD [DE], A ; set ysub
INC E
INC L
;; add middle byte
LD A, [DE] ; get y
ADC [HL] ; add dy + carry(ysub + dysub)
LD [DE], A ; set y
INC E
;; adjust high byte
LD A, [HLI] ; get dy(hi), move to next dy(lo)
BIT 7, A
JR Z, .pos
.neg:
JR C, .next
LD A, [DE]
DEC A
LD [DE], A
.pos:
JR NC, .next
LD A, [DE]
INC A
LD [DE], A
.next
INC E
DEC B
JR NZ, .loop
RET
I think that's right? It's not tested, and I'm not totally sure I understand the signed arithmetic bit.
I would want to apply gravity to actors with some types and not to actors with other types. Based on the code you presented, you appear to suggest structuring the actor update loops in a form that conceptually resembles the single-instruction, multiple-data (SIMD) approach used by shaders on modern GPUs:
- Apply step 1 to all actors.
- Apply step 2 to all actors.
- Apply step 3 to all actors.
But if the table contains both actors that fall and actors that do not fall, each step needs to include a determination of whether to apply or skip the step for a particular actor:
- Apply step 1 to those actors whose combination of type and state uses step 1.
- Apply step 2 to those actors whose combination of type and state uses step 2.
- Apply step 3 to those actors whose combination of type and state uses step 3.
In this sort of SIMD-like structure, I don't see how you'd enable or disable individual steps based on an actor's type and state without using additional register pairs for pointers to lookup tables from type and state to bitfields of which steps shall be performed on objects in that type and state.
Please forgive my naivete. It's just that arranging enemy AI using SIMD rather than straight-through code is an entirely new concept to me.
I think you can pretty straightforwardly adapt my code to only apply the operations to one actor - the only real difference would maybe be expanding the Y position to 4 bytes, to index into the array with two quick shifts.
I just wrote the code as I would have written it, without having that additional criteria in mind. I tend to work in the "SIMD" style, and it often does help things on the Game Boy because of the auto-incrementing. But in retrospect I should have kept it more similar to your original code - I probably muddled things, especially because this code didn't end up really needing to be very SIMD-ish.
Code:
;;; A - contains actor # to apply gravity to
ApplyGravityToVelocity:
PUSH HL
LD H, Actor_DY >> 8
RLA ; multiply by two, since each velocity is two bytes
ADD Actor_DY & $00FF ; add the array base, since we're not necessarily page aligned
LD L, A
LD A, [HL]
ADD GRAVITY
LD [HL], A
POP HL
RET
If it's only gravity that's selective, then you can still do the actual motions using the separate routine I gave, all at once.
Code:
.bss
.align PAGE
Actor_UsesGravity: .res NUM_ACTORS
.code
UpdateActors:
CALL ApplyGravity
CALL ApplyVelocities
RET ; obviously this tail call could be optimized out
ApplyGravity:
LD HL, ActorUsesGravity ;; page aligned, so L = 0
LD B, NUM_ACTORS
.loop:
LD A, [HL]
AND A
JR NZ, .next
LD A, L
CALL ApplyGravity ; note that I made the above subroutine preserve HL!
.next:
INC L
DEC B
JR NZ, .loop
RET
That gets complicated when each actor ends up with dozens of different
Uses flags, one for each possible line that could be in or not in a particular actor's script, that need to get copied from the actor's prototype. It gets doubly complicated when a line in the Grand Unified Script needs to turn a bunch of
Uses flags on and off when an actor goes in or out of a particular state.
Today I decided to look at how commercial games solve the
SOA vs. AOS dilemma. But I discovered that very few
Game Boy games listed on Data Crystal actually have their RAM map substantially filled in. The first one I found was
that of Wario Land, which has an 8-entry actor table at $A200, where the actors occupy $A200-$A213, $A220-$A233, $A240-$A253, ..., $A2E0-$A2F3. I guess this validates the
array of 32-byte-aligned actor structures.
I've recently started tumbling down the Z80 rabbit hole.. Only I have a full Z80 at my disposal and 6502 that I can jump back to when the going gets too hot for the Z80
I'm starting to think that the dispatch is the way to solve the issue.
We are use to doing
Get thing,x
and state
bne _next
lda otherData,x
adc moreData,x
sta otherData,x
and state2
bne _next2
...thing2
while for Z80 it might be better to use the "use bits" as a dispatch.
ld use
<<2
add Base Pointer
call
this way you put a function that knows what the use cases are and can handle the bits without having to look them up. Saving the need to index into multiple tables.
Not done too much experimentation yet, as I'm still trying to find an assembler that isn't circa 1982. Or one that is 1987 spec but isn't trapped on a machine that is slow...
My bad luck continues. I searched Google for 8080 record field access and 8080 struct field access, hoping to stumble on some idiom that has become common practice, but most results were some website hosted on port 8080, not Intel 8080. I tried 8080 assembly struct field access, and Google tried to second guess me with "Missing: 8080"
If you put a search term in quotes Google won't show you results without it.
You might try looking at the output of 8080 or GBZ80 C compilers when interacting with structs to see how they handle the problem.
adam_smasher wrote:
You might try looking at the output of 8080 or GBZ80 C compilers
In other words, the godbolt solution. I'd've tried that if SDCC were any good. See
"To C or not to C?" by ISSOtm.
ISSOtm wrote:
[GBDK is] built on an ancient build of SDCC, which is known to generate poor (bloated) and often straight up wrong code.
What C compilers targeting 8080 are any good? Any luck with, say,
BDS C?
It might be worth giving SDCC a shot anyway - it's still under active development, and the aforementioned "ancient build of SDCC, which is known to generate poor (bloated) and often straight up wrong code" that GBDK is based on is 17 years old.
In GBDev Discord, ISSOtm announced an
RGBDS macro pack to define structs and is trying to figure out how to best distribute it. Alongside this came some practical idioms for struct field access.
Code:
; Prep: 3 mcycles each
ld de,self
ld bc,other_actor
; Random field load/store: 7 mcycles, BC preserved
ld hl,offsetof(Actor, xsub) ; 3
add hl,de ; 2
ld a,[hl] ; 2
; Compare 6502: 5 cycles (minus 1 for load not crossing page)
lda actor_xsub,x ; 4
; Random field arithmetic: 8 mcycles, BC preserved
ld hl,offsetof(Actor, xsub) ; 3
add hl,de ; 2
ld l,[hl] ; 2
add a,l ; 1
; Compare 6502: 4 cycles (plus 1 for crossing page)
add actor_xsub,x ; 4
; Store constant in field: 8 mcycles, ABC preserved
ld hl,offsetof(Actor, frame) ; 3
add hl,de ; 2
ld [hl],FRAME_JUMP ; 3
; Compare 6502: 7 cycles and A is clobbered
lda #FRAME_JUMP ; 2
sta actor_xsub,x ; 5
Roughly: 8080 is faster for sequential access, and 6502 is faster for random access. Try counting
aaaa,X vs.
(dd),Y accesses in your NES program to estimate what parts might become slower or faster respectively.
I've since collected a lot of this into a
wiki article.