Flight Minigames (canceled)

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Flight Minigames (canceled)
by on (#127325)
Update: Unfortunately I've stopped working on this game. Attached is the last version of the source.

Looking for a NES version of Flappy Bird?
Flappy Block from Sly Dog Studios
Family Bird from sayonari
"flappy.nes" from Nioreh

----
Flight Minigames is intended to be a collection of 8 6 similar "one button games" for the multicart project where you control the vertical movement of the sprite to avoid obstacles approaching horizontally. Planned modes include clones of the mechanics of The Helicopter Game, Balloon Fight and Flappy Bird.

The most important lesson I learned during competition period is that gradual air resistance feels much better then speed caps. Also fixed point numbers with more then Q16.16 precision is overkill.

I'm not sure what lesson I learned for v0.2, maybe that system stack can be used for saving bytes in ROM.

List of things needed to be done has been moved into a TODO.txt file.
Re: Flight Minigames
by on (#127328)
Balloon Fight uses a denominator of 256 (8 subpixel bits) for positions and velocities.
Re: Flight Minigames
by on (#127329)
Right, G in Balloon Fight is $00.06 per ntsc frame (about 84.7px/s^2) with a speed cap of about 75.1px/s.

I'm not sure exactly what's going on with the upward thrust, but looks like it either adds (on top of G) $00.12 per frame or $00.78 for a single flap.

If only I was able to watch ram values for flappy bird.
Re: Flight Minigames
by on (#127844)
It's been awhile since I last posted but I got somebody with a copy of flappy bird on a rooted tablet to let me screen record some gameplay.

After a bit of video frame slicing and counting pixels I came up with these numbers. Unlike other prior work by Frank Noschese, I'm setting the width of the bird (from tail to beak) to 1 meter.

Bird width: 1 m (reference meter)
Field height: 10.7872 m
Gravity: 29.9426 m/s^2
Speed after flap: 8.6320 m/s
Speed cap: 14.0957 m/s
Horizontal scroll: 3.5493 m/s

Reading the linked article again, I somehow totally missed the video downloads before doing the experiments myself.
Re: Flight Minigames
by on (#128111)
Do you have the most recent build? I think the one you linked too might be an older version.
Re: Flight Minigames
by on (#128120)
The most recent published build is v0.1. Sorry if being under "Older versions" and with a label of "incomplete" is confusing. It should match the version I sent you last in email as they both should have a md5sum of 3f0d3ca924f8c1e2bcd20d1c7a2ca144.

The progress of v0.2 (with 2 complete games) has been slow.
Re: Flight Minigames
by on (#128126)
Ok, I just can't get to a menu like it shows in your screenshot at the top. I'm just trying to get some screenshots for the website.
Re: Flight Minigames
by on (#128540)
I pretty much did nothing in the month of April, and it seems like the only thing I did this week was rewrite the core adder and converted all fixed point numbers from a 6 byte 16.32 format to a 4 byte 9.23 format.

I want to ask for help, but it's hard to formulate how or what one can help with.

Also this is probably the one piece of code that the cpu will spend a good bulk of it's time on.
Code:
;;
; multi purpose multi byte operation
; [API sniped but will be described in full source]
bignum_op:
-
  lda (temp+0), y
  eor temp+5    ; a value of $ff will make this a subtraction
  adc (temp+2), y
  sta (temp+0), y
  iny
  dec temp+4
  bne -
  rts
Re: Flight Minigames
by on (#128542)
I think I am understanding your code. Maybe a bit faster, assuming temp+5 is either 0 or FFh

Code:
 
  ; is it negative?
  bit temp+5
  bpl positive
  ; negative:
  sec
  ldx temp+4
-
  lda (temp+0), y
  sbc (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  beq exit
positive:
  clc
  ldx temp+4
-
  lda (temp+0), y
  adc  (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
exit:
  rts


Store x back to temp+4 if needed.
Re: Flight Minigames
by on (#128543)
I attempting to go the extreme side of the size/speed trade off, aiming for small size. So I thought it was cool when I saw a way to pack addition and subtraction in the same subroutine loop, and with the same loop I'm clearing bytes of memory, which would allow for copying bytes and negating.

Maybe I just need to stop spending all my time on size optimization...
Re: Flight Minigames
by on (#128545)
Unless you are approaching the limits of your chosen mapper, personally wouldn't worry too much about size, unless that is part of the challenge. Even if you need some simple bankswitching, it's 2014; memory is relatively cheap. Also, regarding that routine - if your values are going to be multiples of two bytes always, you could partially unroll it into two lda/adc/sta steps.

Edit: Maybe ignore that last part about unrolling more, may not work with this code.
Re: Flight Minigames
by on (#128555)
One thing I know I will have difficulty with, and have yet to address, is sound engine and musical composition.

My first thought was to use famitone2 and place the entire famitone page in zeropage to reduce the opcode size (my target ROM size is 4 or 8 KiB). I also though about somehow using temple's sound engine, but I'm not clear about the pros and cons of each. In any-case I've got 128 zeropage bytes reserved for sound, and I hope I can get the music code and data to fit in the 2.5KiB ROM space left.

Then of course I believe I lack the musical composition skills anyway.
Re: Flight Minigames
by on (#128569)
My sound engine should comfortably fit into your ZP budget, and as for ROM, this 3.8K NSF containing the entire soundtrack of Concentration Room, Thwaite, Zap Ruder, and an old falling block game should show how ROM-efficient it can get while delivering competent triangle drums and semi-pitched hi-hats. If you want, I can package it with a manual, and I might even have time to compose some happy flappy tunes for you.
Re: Flight Minigames
by on (#128572)
A package with a manual would be nice, even if I ultimately don't use it. Just don't end up using up too much time meant for RHDE.
Re: Flight Minigames
by on (#129015)
So famitone2 may not be ideal to stuff entirely into zeropage. I just counted, according to the var definitions, the number of RAM bytes used for music and one sound effect stream. It's 140 bytes.
I don't know how the sound effect data is formated, but I think it's like a raw list of APU commands or something.

I think should start looking into tepples sound engine. I believe it's pretty much music.s sound.s and musicseq.* in RHDE.

I do like the 64 note range of famitone2 and my edit that allows different tempo and pitch setups. I'm hoping that would be easy to port over those features.
Re: Flight Minigames
by on (#129018)
43110 wrote:
I think should start looking into tepples sound engine. I believe it's pretty much music.s sound.s and musicseq.* in RHDE.

I do like the 64 note range of famitone2

Patterns in my music engine support a TRANSPOSE command that changes the base note for the remainder of a pattern. For example, TRANSPOSE,12 shifts the rest of the pattern up by an octave, and RHDE's combat theme uses this. It just has the two-octave range for a more compact encoding of the common case.

Quote:
and my edit that allows different tempo and pitch setups.

Which "tempo and pitch" setups are you talking about?

(And is this tune happy or crappy?)
Re: Flight Minigames
by on (#129020)
ahh, two-octave local range per pattern range makes sense. The 64 note range I'm talking about is the period table.

Tempo and pitch for adjusting between NTSC, PAL, and Dendy TV systems. I hacked that in famitone2 by using the upper two bits of a variable.

tepples wrote:
(And is this tune happy or crappy?)

umm, both? It'll be good for the flappy mode featuring a butterfly. Unless you want to exclusively use it for that shoot-em-up game.
Re: Flight Minigames
by on (#129322)
43110 wrote:
Also this is probably the one piece of code that the cpu will spend a good bulk of it's time on.
Code:
;;
; multi purpose multi byte operation
; [API sniped but will be described in full source]
bignum_op:
  ldx temp+4 ; edit, don't touch temp+4
-
  lda (temp+0), y
  eor temp+5    ; a value of $ff will make this a subtraction
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex ; edit, don't touch temp+4
  bne -
  rts


If you are using these for four byte values, then this lets you set the size once for a series of operations without having to reset it all the time. But count up how many times they appear ... if they appear too many times, the overhead of clear and carry and setting up the EOR byte will overwhelm the 11 bytes saved, and you are better off with:

Code:
bignm_add:
  clc
  ldx temp+4
-
  lda (temp+0), y
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts

bignm_sub: ; above subtracts 0 from 2 and stores in 0 ~ is that the right way around?
  sec
  ldx temp+4
-
  lda (temp+2), y
  sbc (temp+0), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts


Its only a narrow range of number of calls of bignum_op where it actually nets saving space, when extra calling overhead is taken into account. If there are like two routines that add and two that subtract, you should be ahead, by a few bytes, but even at four routines that do each, the extra call overhead may offset the squeezing of bytes from the called routines.

Similar for clear and copy:

Code:
big_cpy:
  ldx temp+4
-
  lda (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts

big_clr:
  ldx temp+4
  lda #0
-
  sta (temp+0), y
  iny
  dex
  bne -
  rts


... allow clearing and copying blocks at a time (set temp+4=y=0), but used during four byte FP integer operations, allows temp+4 to hold the same size index and allows avoiding set-up overhead. There's a double purpose version using "AND temp+5", but at a net saving of 9 bytes, its a close run thing whether extra set-up overhead swamps the crunching of the factored operation.

In the middle case, if the program flow allows the temp+5 set-up to be stable over multiple operations, so there is not much overhead setting temp+5, but the carry flag is set or cleared more than four times total in calling big_num, you save space just by setting up distinct entry points to do that:
Code:
;;
bignm_sub:
  sec
  bcs +
bignm_add:
  clc
+
  ldx temp+4
-
  lda (temp+0), y
  eor temp+5    ; a value of $ff will make this a subtraction
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts


... though I still wonder whether its:

OP2 - OP1 =: OP1

... or ... OP1 - OP2 =: OP1 ... in which case it should be "lda (temp+2),y", etc.
Re: Flight Minigames
by on (#129340)
I've updated the first posted the work I have done this month (v0.2). There's nothing apparently new in the compiled ROM, but the source has changed a bit. Additionally I'm now planing for 6 games instead of 8.

About what BruceMcF posted:
My motivation for even merging the two multi byte operations was to rid of sequences of jump subroutines, and the major optimization was actually having to do with loading all the prams from three byte tables (see line 1058 in the source of v0.2).

I was trying to preserve X through the whole chain of subroutine calls so that at the top I can do a loop like
Quote:
ldx #0
-
jsr do_math_step
inx
cpx #32
bcc -

but looking again at what I have now, I really should change that to a counter in zeropage, since I had to make an unnecessary stack push in the multiplication subroutine.

Also if you want to see another thing I did with a hard to compute the trade off, I took an idea of having list of addresses in stack and made my game state selector push such a list instead of having a sequence of jsr in a subroutine.
Re: Flight Minigames
by on (#129390)
43110 wrote:
About what BruceMcF posted:

I was trying to preserve X through the whole chain of subroutine calls so that at the top I can do a loop like
Quote:
ldx #0
-
jsr do_math_step
inx
cpx #32
bcc -

but looking again at what I have now, I really should change that to a counter in zeropage, since I had to make an unnecessary stack push in the multiplication subroutine.

Yes, its frequently better to use the X register in inner loops and an index register for outer loops.

Quote:
Also if you want to see another thing I did with a hard to compute the trade off, I took an idea of having list of addresses in stack and made my game state selector push such a list instead of having a sequence of jsr in a subroutine.


Also, if you have a pool of precomputed "RTS addresses" in a known page, you can set up outer factor routines as jump targets:

Code:
;;
routine1: ; routine cannot span a page, cannot start on 0 location in a page
   lda #>rtn1+beg-1
   ldx #>rtn1_end+1 ; rtn1_end generated from source so points to low byte
vec_call:
   sta temp+5
-
   lda vec_pool,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call set of vectors

routine2: ; see warnings above
   lda #>rtn2+beg-1
   ldx #>rtn2_end+1
   bne vec_call
...


And further crunching in storage of the vector sequences is possible from vectors being allowed to overlap:

Code:
;;
vec_pool:
rtn1_beg:
   doop1-1 ; first operation of routine1
   doop2-1
   doop3-1
rtn2_beg:
   doop4-1 ; first operation of routine2
   doop5-1
rtn1_end:
   doop6-1 ; last operation of routine1
rtn2_end:
   doop7-1 ; last operation of routine2
   ; ...


The fixed space overhead of that particular stack stuffing is 12 bytes (there are of course post-indexed versions that allow the vectors to be anywhere in visible memory). The outer routines break even compared to identical subroutine factor code at six unique operations calls per outer routine, with one byte saved per inner operation call after that point. And of course any overlapping operations in the pool are basically free space to all but one of the calling routines.

So that rewards programming that relies on lots of little factors that are heavily re-used, lego-block style, and factor design that favors the use of overlapping "phrases" of the operation inner factors.
Re: Flight Minigames
by on (#129397)
I had to read that several times before I got it. While I am aiming for subroutine reuse, I don't think things will logically interleave like that. But you did mention putting subroutines in the same page, and I then saw that the address really only have 10 bits for a 4KiB ROM. That means I'm now wasting 6 bits for every address in the list, and that's not too far from having a whole implied byte automatically inserted by the list loader.

Edit: Nevermind, I miscounted. 4KiB is 12 bits, causing only 4 bits of waste per address.

but before I start arranging assembly code like a falling block puzzle game, I need to program the pieces first.

On a related note, I have about 640 unused bits from the note_period table. What can I put there that uses only the upper 5 bits of 128 bytes?
Re: Flight Minigames
by on (#129401)
43110 wrote:
I had to read that several times before I got it. While I am aiming for subroutine reuse, I don't think I'll have things logically knitted like that. But you did mention putting subroutines in the same page, and I then saw that the address really only have 10 bits for a 4KiB ROM. That means I'm now wasting 6 bits for every address in the list, and that's not too far from having a whole implied byte automatically inserted by the list loader.

But you also want the list loader to be concise.

Without the "falling block puzzle game" side of it, you can have counted vectors anywhere in the ROM without a count byte by using the high bit to flag the beginning of the list, but after working that through, its a bit more intricate and brittle than I like.

Vectors with an index to their own end in front don't let you play the "sliding block puzzle" games, but they are easier to follow what is going on.

And there really is not any binding constraint with the vector pool approach in terms of how many vectors it supports. If you have so many lists of subroutine calls that you've used up one vector pool, you've saved more than enough space to cover the overhead for a second vector pool page.

Code:
;;
;    NB. A "vector pool" is a page with a set of subroutine address vectors in it.
;    The leading byte is the position in the page of the last byte in the vector.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool1,x
   tax
-
   lda vpool1,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page

routine2:
   ldx #<vector2
   jmp go_pool1

routine3: ; NB. routine3 is only four ops, so there is no space saving from the vector
   jsr op3_1
   jsr op3_2
   jsr op3_3
   jmp op3_4

routine4:
   ldx #<vector4
   jmp go_pool1

; ... and so on, until & unless the vpool1 page is full up, then (supposing it was 30 routines that filled it up):

routine31: ; the first entry in vector pool 2.
   ldx #<vector31

go_pool2:
   lda vpool2,x
   tax
-
routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool2,x
   tax
-
   lda vpool2,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page

routine32:
   ldx #<vector2
   jmp go_pool2
...


... so the limit of distinct vectors to a single binary page is not an overall limit on the number of vectors altogether.

And if (as supposed above), 30 vectors filled up a vector page, then using the vector page saved 30 bytes, since each vector pool call is one byte shorter than passing a full address, which

But if you want to really crunch ... yeah the addresses can be packed. EDIT ~ yeah, 4K is 12 bits, so not just can't be bothered packing more than two, CAN'T pack more then two ~ 2 bits to 4 bits fixed put in place w/out more edit markets ~ the original was to crunch routines into the 4K "Golden RAM" in the C64 above the ROM in the memory map and below the I/O, but it was reconstructed from memory, and I got some of the details off ~ plus multiplied when I should have been dividing (oops). /EDIT

If there are an odd number of routines, its the address at the head of the list that is not unpacked, so in the vector pool, an odd list would like like (dummy, so op_# is the place of the op in the sequence, not the actual name, psuedo-assembler syntax):
.byte
.word op1
.byte <op_2, <op_3, ((>op_2 .AND. $0F)*16)+(>op_3 .AND $0F)
.byte <op_4, <op_5, ((>op_4 .AND. $0F)*16)+(>op_5 .AND $0F)
.byte <op_6, <op_7, ((>op_6 .AND. $0F)*16)+(>op_7 .AND $0F)
...

... with the packed address business best constructed as a macro in your favorite 6502 macro assembler, to get it right once and avoid lurking typos.

Note that the high bit of the address is stripped by the packing process and replaced by the routine.

Code:
;;
;    NB. The leading byte is the size of the packed string of vectors in bytes.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   lda vpool1,x
   sta temp+5
   tax
-
   lda vpool1,x
   tay
   and #$0F
   ora #$80
   pha

   lda vpool1,x
   pha
   dex
   cmp temp+5
   beq +

   lda vpool1,x
   tya
   lsr a
   lsr a
   lsr a
   sec
   ror a
   pha

   lda vpool1,x
   pha
   dex
   cmp temp+5
   bne -
+
   rts ; call subroutine vectors in turn from stack page


44 vs 16 bytes, so 28 bytes extra space required by the processing. 56 pairs of subroutine vectors crunched together and that breaks even.

So that is worth considering if the vector pool page is more than half full.

And since each vector pool has its own engine, if you have own full vector pool page and one only quarter full, you can put packed lists in the full one, and unpacked lists in the only partly full one.

Personally, if I was using any, I'd use the simpler one, because its easier to just define the routines using ".word" assembler instructions with the label for the routine.

This is all a kind of building a Forth-like inner interpreter without a Forth assembler/compiler system running in the NES.
Re: Flight Minigames
by on (#129463)
Well I tried, but Summer is on me and I have no time until the start of the next school year in August. I look forward to the multi-cart production.

Before I go, here's the list of games I had planned in this:
  • Game A: like the Helicopter game
  • Game B: An upside-down version of Balloon Fight physics. What should the character be?
  • Game C: A Flying Saucer that travels in a sine wave unless you press A, (I can place the sine table in the high byte of the note period table)
  • Game D: A paper airplane that always travels ±45°.
  • Game E: An electric spark that travels on one of two lanes. The A button switch between the two.
  • Game F: flappy butterfly A fluttering butterfly.
Re: Flight Minigames
by on (#129471)
BruceMcF wrote:
Code:
;;
;    NB. A "vector pool" is a page with a set of subroutine address vectors in it.
;    The leading byte is the position in the page of the last byte in the vector.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool1,x
   tax
-
   lda vpool1,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page
... {etc.}


Note that a second way to use the redundant 4 bits per address is as data.

List: .byte <op1, >op1 .AND $0F, <op2, >op2 .AND $0F, ...

Call:
A-reg: index to first address in list
X-reg: index of first address of following routine

Code:
routine1:
   lda #<lst_rtn1
   ldx #<lst_rtn2
   jmp call_vec
...

call_vec:
   sta temp+5
-
   dex
   lda vec_pool,x ; high byte address, data in high nybble
   and #$0F
   ora #$80
   pha
   dex
   lda vec_pool,x ; low byte address
   pha
   cpx temp+5
   bne -
   rts


This packs operation scripts of up to 128 routines total into a page, since the length information is entirely embedded by the call. Having an all-address pool makes it reasonably straightforward to use the high nybble of the even address as 128 nybbles of data.

I'm thinking of butterfly as something that would fly with a bit of unpredictable imprecision, so it's something where a wrap-around sequence of +/- trim values could be laid out to make it "fluttery", but various processes that could be given character with a cycle of signed 4 bit integers (-8 through +7) could work as well.

nnyb2int: ; signed high nybble to signed integer
php ; remember sign
lsr a
lsr a
lsr a
lsr a
plp
bpl +
ora #$F0
+ rts

; flutvflo = $07 for 16, 8-step cycles; $0F for 8, 16 step cycles, $1F for 4, 32 step cycles
flutvflo:
.byte $0F

inc_flutter: ; increment wrap around based on mask in flutvflo:
inc flutter
lda flutter
bit flutvflo
bne +
clc
sbc flutvlfo ; this is a trick ~ "clc / sbc N" is the same as "sec / sbc {N+1}", and the mask+1 is the wrap-around
sta flutter
+ lsr a
tax
inx
lda vec+pool,x ; data is in high nybble of high byte of address
rts

Have a good summer!
Re: Flight Minigames
by on (#130791)
I been studying a self made disassembly of a old Nintendo game, and it's been an interesting experience, so I guess I'll bump this thread for the question I want to ask instead of making another thread. I hope I can get this game started again.

Suppose I do all the logic of the entire game by lots of small subroutines. Some subroutines might be used only once and inlined or be called as the last thing in another subroutine and jmp'ed to. A conditional jump might be optimized into a direct branch by rearranging code, or a conditional return might branch to a rts above the routine if the one below is to far away.

Is there some tool that can generate a call graph and automatically make these optimizations? If not, what parser can I work off of to quickly make my own optimizer? (I'm ok with python stuff)
Re: Flight Minigames
by on (#130799)
Years ago, I wrote a program in Python to make a call graph of a ca65 program. This is what the graph looked like. Should I try posting the program?
Re: Flight Minigames
by on (#130823)
Sure, why not.
Re: Flight Minigames (canceled)
by on (#132142)
I've given up on this project so that I can clear my mind for the planning of another nes game, and to give time to gear myself up for the next competition.