I am making an RPG, and the max HP and the current HP are two different 16-bit variables. I want to display this value on the screen, but in decimal. So I want to read these bytes:
$20 and $00
and I want this:
8192
to show up on screen. I know of a way I could do this, but it'd take like 3 seconds, and that's SLOW. What do you guys suggest I do for this?
new wiki >
programming libraries >
16-bit binary to decimal > approx. 700 cycles to convert a 16-bit positive integer to an array of 5 decimal digits
Thank you, I'm sorry, but 1/4 of this is C, and I don't know C. What's the main idea? Like what does the routine mainly do? Does it divide the hex digits by something, and output a decimal number? This is a little confusing. I'm sorry, once again, I don't know C, so a lot of that is just nonsense to me. I know a little C, but not much. I know like how to make a currency conversion DOS program, but that's it.
The content of the code block on
the page is the entire content of a "bcd.s" file for use with CA65. I see no C, just pure 6502 assembly language. Perhaps you're getting hung up on the first four lines of the subroutine, which contain expressions that set the initial values. These expressions use
CA65 expression syntax, which (granted) may remind a fellow of expressions in the C language.
I've added an explanation of the algorithm to the code block.
I'm sorry, I'm really physicly ill right now, so I have no idea what's going on during your routine. I don't mean to be difficult, but that is made with an assembler that I am really not familiar with. I use WLA-DX, and it is nice raw 6502. This section:
bcdConvert:
lda #$80 >> ((BCD_BITS - 1) & 3)
sta curDigit
ldx #(BCD_BITS - 1) >> 2
ldy #BCD_BITS - 5
You were right, I have NO idea what's happening here. Would you translate that to raw 6502 for me? I seriously have NO clue what that does. I looked at the syntaxt page, and I still don't know. Sorry, don't mean to be bothersome. I'm just ill with the flu or something, and I am so tired, I can't comprehend anything. Sorry again...
EDIT: Also:
ceil(log10(2) * x)
what is that?
Quote:
I'm sorry, I'm really physicly ill right now, so I have no idea what's going on during your routine. [...] Sorry, don't mean to be bothersome. I'm just ill with the flu or something, and I am so tired, I can't comprehend anything. Sorry again...
Here's an idea: turn off the computer and get some rest.
>> is a right shift. $10 >> 1 = $80.
& is bitwise AND. $11 & $10 = $10.
log10() is the base-10 logarithm function.
ceil rounds up to the next whole value.
Use google for further information about logarithms.
Celius wrote:
I'm sorry, I'm really physicly ill right now, so I have no idea what's going on during your routine. I don't mean to be difficult, but that is made with an assembler that I am really not familiar with. I use WLA-DX, and it is nice raw 6502.
I believe that WLA-DX supports a similar expression syntax. Look in the
WLA-DX manual for a section "11... Arithmetics".
Quote:
This section:
bcdConvert:
lda #$80 >> ((BCD_BITS - 1) & 3)
sta curDigit
ldx #(BCD_BITS - 1) >> 2
ldy #BCD_BITS - 5
You were right, I have NO idea what's happening here. Would you translate that to raw 6502 for me?
If you have BCD_BITS set to 19 (the default), it evaluates to the following:
Code:
lda #$20
sta curDigit
ldx #4
ldy #14
Quote:
Also:
ceil(log10(2) * x)
what is that?
log10() is used to mean a base-10 logarithm. For instance, log10(2) = 0.30103.
ceil() means round up to to the nearest integer (e.g. ceil(1.1), ceil(1.5), ceil(1.9) and ceil(2.0) all evaluate to 2).
I always used the conversion routine I reverse-engineered from Final Fantasy. It works fine, and support up to 6 digits (24 bits).
Bregalad wrote:
I always used the conversion routine I reverse-engineered from Final Fantasy.
But did you rewrite it from scratch, or did you flat out copy it? You claim it supports up to 999,999, but in how many cycles?
It does quick repeat substraction using one different table per byte weight and per digit.
tepples' routine seems pretty good. 700 hundred cycles is pretty fast for a 16-bit conversion.
I wasn't able to get it to run OK though, back when he first posted it, 'cause of those expressions. At first I tried to evaluate them myself but got wrong results, and since then I haven't tried again. But now that he posted what they avaluate to as default I'll give another try soon.
I'm still trying to understand the logic too, but it does seem promissing.
The logic is the same as a plain old long division routine, except instead of subtracting powers of 2, it subtracts what the powers of 2 end up as after conversion from BCD to binary.
I've come up with my own new routine now, and it takes less time than 1 Vblank. I had a variable decremented to be #$FF at the end of the routine, and in my NMI, I checked to see if the variable was decremented, and if it wasn't, increment VblankCounter. The VblankCounter remained to be 0, so it took less than 1 vblank to do that. I will comment the code and put it up later, like tommorow. What I do is I check how many Thousands there are, then Hundreds, then Tens, then Ones. It actually works nicely. There might be a bug or two in it that I need to fix, but I don't think so. It should be pretty fast. Also, I only have to check to $270F, because that's 9999, and that's the max HP for my RPG. The monsters have more of course, but their HP won't be shown on screen. It might be, but I'll think of that later. I'll post the code tommorow.
If you don't want to cycle-count the whole thing (as I did), one thing you can do to profile a subroutine is wait for sprite 0, turn on grayscale mode, call the subroutine, and turn off grayscale mode. Then run the modified program and see how tall the white bar is. Every 9 scanlines on NTSC are worth 1023.0 cycles.
Celius wrote:
I've come up with my own new routine now, and it takes less time than 1 Vblank. I had a variable decremented to be #$FF at the end of the routine, and in my NMI, I checked to see if the variable was decremented, and if it wasn't, increment VblankCounter. The VblankCounter remained to be 0, so it took less than 1 vblank to do that.
That is a
horrible way to time the routine. Because of precision, I mean. "Less than 1 VBlank" is terribly vague, and not necessarily fast. I usually count the cycles on my routines, but that can get pretty boring for longer stuff with more conditional branches. tepples' idea is very good. It's easy to translate scanlines to CPU cycles, and you can easily see the variations in execution time as the grayscale bar varies it's height. I'll try that sometime.
You can easily write a routine to print how long something takes. Wait for NMI (not VBL, since polling $2002 can miss frames), run the code you want to time, then wait for the next NMI while counting how many cycles it takes. Counting cycles in this case means counting how many times your loop iterates, which translates to 2+4+3 = 9 cycles per iteration, plus whatever an outer loop needs when incrementing the high byte of the counter. This is a bit more complex to initially implement. It could even be extended to cycle accuracy by having the loop check successive NMIs, refining from its initial ~9 cycle accuracy measurement.
Or just set the bg color to red, run it 20 times, set it back to black, and see how many scanlines it takes.
Well, I counted the cycles manually, and to be quite honest, it's not as fast as tepples'. But the amount of time it takes is relative. The max amount of time it can take (if the number going through is 9999 in decimal,) is around 1300 cycles. I think that's pretty okay. I'm going to comment my code and put it up in a little while. I don't know if you will fully understand it if you just read the raw code. I only use these instructions:
beq - 2
beq (if succeeds) - 3
bmi - 2
bmi (if succeeds) - 3
bne - 2
bne (if succeeds) - 3
clc - 2
adc immidiate- 2
sec - 2
sbc immidiate- 2
inc zeropage - 5
dec zeropage - 5
jmp absolute - 3
lda immidiate - 2
lda zeropage - 3
sta zeropage - 3
rts - 6
The RTS is at the end of the routine, of course. I have this bad habbit of making sure the routine doesn't come out with unwanted values, so for instance, I'll do things like this:
lda SomeVar
lsr a
lsr a
lsr a
lsr a
sta SomeVar
lda SomeVar
See the unneccissary lda SomeVar at the end? Yeah, I do that all the time, and it's a bad habbit, I know. You may see things like that in my routine. So don't freak out when you see that.
Edit: Here it is:
http://www.freewebs.com/the_bott/subroutines.asm . I've put comments on it. I'm a little out of it right now because I have a fever over 100 degrees, and I just took some medicine, so I may have put really bad comments on it. Tell me if you don't understand what's going on. And you'll notice that I have how many cycles it takes for each section of the code. And I also have how many cycles it takes for it to go through 10 times. I've counted each branch as 2 cycles, but don't murder me for it. The amount of cycles is really relative. But if every part of the code was executed, and the 16-bit number going through was $270F, it'd be around 1300 cycles.
Also, you'll notice I have completely stupid names for lables. I always do that. I name lables off the top of my head. Sorry, if that bothers any of you.
An easy way to know how fast a code execute is to write $1f to $2001 at its begining and $1e to $2001 at its end. The white bar on the screen is the smaller, the faster is your programm. That is VERY usefull, I use this VERY often. It give also good indiction about when the routine is called in the frame, etc...
You'll most probably NEED to write longer number for EXPs and Gils. Once you got the concept for 16-bit, it isn't much complicated to do it for 24-bit. Such routines don't need to be *very* quick, since you'll call them hadly ever more than once a frame. But it shouldn't be too slow, too. I think you don't have to worry much about speed unless your routines takes a whole frame to execute or so.
Sorry, my code that I posted has a bug, but I fixed it. I can't explain it now. I am really out of it, and I feel so dizzy, it sucks. I have another bug and I can't figure out what's wrong. I will talk more later, when I feel better. Hopefully that's soon. Sorry, once again.
Since we're sharing BIN2DEC routines, here's mine:
Code:
BinToDec:
lda #$00
sta Decimal+0
sta Decimal+1
sta Decimal+2
sta Decimal+3
sta Decimal+4
ldx #$10
BitLoop:
asl Binary+0
rol Binary+1
ldy Decimal+0
lda Table, y
rol
sta Decimal+0
ldy Decimal+1
lda Table, y
rol
sta Decimal+1
ldy Decimal+2
lda Table, y
rol
sta Decimal+2
ldy Decimal+3
lda Table, y
rol
sta Decimal+3
rol Decimal+4
dex
bne BitLoop
rts
Table:
.db $00, $01, $02, $03, $04, $80, $81, $82, $83, $84
This converts a 16-bit binary value into 5 decimal digits, each one using a byte. It's not as fast as tepples', it always takes 920 cycles no matter the input value. I'm sure this can be optimized, some day I'll figure out how. =)
As for timing the routines, I like the graphical way, it's quick and easy. But the other method that can give you the exact cycle count is pretty good too, but requires some special setup, if I understand it correctly.
To count cycles, Blargg method is technically interesting, but in practice, nobody would do that. Because implement it is ANOYING, and it prevents you to use any other routine in parallel during the frame.
Tokumaru, your method is interesting. Do you found it is the best to have a routine always slow, or that can vary from slow to fast ? If I remember correctly, this has been discuted LONG ago, and you were saying that a routine is better if it always take long time to accomplish, but that the time doesn't vary regarless of parameters.
Okay, I'm still a little out of it, but I have a question. What is your "equation" to do hex to decimal? What is the main idea of your routine? Do you get what I'm asking?
Quote:
To count cycles, Blargg method is technically interesting, but in practice, nobody would do that. Because implement it is ANOYING, and it prevents you to use any other routine in parallel during the frame.
I thought the point was to time a piece of code in isolation. If you use modular design methods, you can write code once and use it many times afterwards without any rewriting. This allows you to invest more time than you would normally for a limited-scope routine. If one wrote what I described, one would be able to accurately time any routine with very little effort. Make a few changes and retime, "Ahhh, that's 10 clocks faster." etc.
As always, it is the designer's job to choose what's appropriate for the job at hand. I only offer ideas that might be useful in some contexts.
Bregalad wrote:
To count cycles, Blargg method is technically interesting, but in practice, nobody would do that. Because implement it is ANOYING, and it prevents you to use any other routine in parallel during the frame.
I guess it's interesting to test standalone routines (such as binary to decimal conversions), but doing it to a piece of code within a bigger game would suck.
Quote:
Tokumaru, your method is interesting. Do you found it is the best to have a routine always slow, or that can vary from slow to fast ? If I remember correctly, this has been discuted LONG ago, and you were saying that a routine is better if it always take long time to accomplish, but that the time doesn't vary regarless of parameters.
Yeah, we had a long talk about this. This code is the best I came up with after that talk. I'm really not sure if I prefer slow but constant code. The logical choice would be to have stuff execute faster when possible. But when you have a lot of those, the total time can get pretty unpredictable.
If you were able to code a game where all the logic is constant-timed, you'd never loose a frame, 'cause you just know how much time everything will take. I know it's impossible to code a full constant-timed game engine, but that was just hypothetical.
I woudn't like to be surprised if by any chance all the variable-timed routines in my game decided to have their worst cases all together. I know it is unlikely, but anyway.
So I'm not really sure what I prefer. I try to mix variable and constant-time stuff together, so that the absolute worst case (all routines in the worst case) won't be so heavy.
Celius wrote:
What is your "equation" to do hex to decimal? What is the main idea of your routine?
Celius, the basic idea behind this method is that you shift bit by bit out of a number (the binary one) into another (in this case, the decimal one). For example, if you shifted all bits from a binary number into another binary number, you'll get the same value (it would be useless, sure but I said that to show you can reconstruct a number from the bits shifted out of another).
Each shift doubles the value (x2). To make the second number decimal, not binary/hex, this code changes every number higher than 4 (5, 6, 7, 8 or 9). That is because if those numbers were doubled normally, they'd result in $A, $C, $E, $10 and $12, digits that woudn't match their decimal representation (digits 0, 1, 2, 3 and 4 are doubled to 0, 2, 4, 6 and 8 wich are the same in decimal and in hex, so they need no special handling). 5 is changed to $80. Bit 7 is set so that it is shifted to the next byte and represents the "1" from the number "10" (ten decimal), wich is 5 doubled. 6 is changed to $81, so that when it is doubled (shifted) the "1" of bit 7 moves to the next byte and becomes the "1" of the number "12" (twelve decimal) and the "1" in bit 0 is doubled to become the "2" of "12". That's the idea for all the other digits.
You basically force an overflow when there wouldn't be any. Hex digits only overflow after digit $F. To simulate the decimal system we force them to overflow after digit $9. This routine only does it before the number is doubled, so the transformations are applied from digit 5 on. And I like to do it all through a table instead of checking if the digit is bigger than 4. You could check if the number is higher than 4 and apply the changes accordingly, and you'd get a variable-time routine.
Well, the advantage of eachother :
- Executing time that vary a lot : You don't have to break your head to do your code faster as possible, since it will usually not run the slowest it could. Unfortunataly, your game may be a victim to slowdowns if the worst case is encountered (typically a lot of ennemies on screen, wich does more sprites to draw, more collisions to check and more AI to execute).
- Executing time that doesn't vary at all : You can make sure your game will never slow down, and you can do stuff like check for sprite zero AFTER doing game logic, wich allow a status bar at the bottom of the screen or so. Unfortunately, you'll have to always break your head and waste ROM to have your code as fast as possible, and since you'll have to never get past of a frame, you'll have to reduce your engine's possibilities.
I think Konami and Nintendo games does EVERYTHING in the NMI, so make sure that game logic will never be longer than a single is necessary to not have interrupt chains, so another NMI occuring before the first returned.
Most other developpers did it the "standard" way (I say standard because it is the way I use, Nintendo or Konami may not find it standard).
Has anyone ever have Mario, Zelda, Metal Gear, Castlevania, Contra or whatever slow-downed ? The only game I remember to have serious slowdowns is Megaman 3.
I ask how Contra handle so much ennemies and bullets without slowing down. The game I'm developping now supports 8 objects per screen, that can be bullets, ennemies or items. The ennemies and some items are loaded, and the other would be automatically generated from AI code (not done yet). But each object takes a lot of time, for both AI and sprite-mazing. Especially when moving fast, an object have to check for collision with each pixels after other else the object would penetrate in walls.
A constant-timed game is just a dream. No serious game could be like that. But even when I make variable-time routines I try not to variate so much. Of course there is no other option for the active objects, you must handle all that are present, and that varies a lot. Unless you wasted some time for objects that are not present, but that feels stupid.
As for objects moving too fast and going throug walls, I don't worry to much about it now. At first I thought I should check all the pixels between 2 points, so I wouldn't miss a thing. But that seemed like a huge waste of time. In that project, the worlds are composed by 16x16-pixel blocks, so I just restrict movement to 16 pixels at a time. There is no way you'll miss a block like this. and 16 pixels per frame is pretty fast... I don't think I'll ever want something moving this fast. I guess it depends on the details of the world and the speed the object is moving.
Speaking of code profiling, I have a version of Virtuanes with some code profiling features that Norix sent to me a while back. It's pretty nice, you write to a couple registers at the start/stop of the routine you want to profile and it displays the cycles used (also the average and maximum if it runs several times). Anyone want that?
Memblers wrote:
Speaking of code profiling, I have a version of Virtuanes with some code profiling features that Norix sent to me a while back. It's pretty nice, you write to a couple registers at the start/stop of the routine you want to profile and it displays the cycles used (also the average and maximum if it runs several times). Anyone want that?
Sound cool, and I like Virtua NES a lot.
Tokumaru, to be quite honest, I have no idea what's going on in your routine. You should comment it, so I can follow it a bit better. You use so many shifts, it's hard to follow. I understand when you say If you shift SomeVar ($81) left, you'll have 1 in the carry flag, and 2 in SomeVar, which represents 12 in decimal, which divided by two equals 6. Yes, I get that, but what good does that do? I'm sorry, I just don't get your routine. Would you comment it for me? I won't make this another "Object Collision" thread, I promise. Unlike then, I actually understand the 6502 now. I didn't realize it, but back then, I actually didn't understand the 6502 at all.
I could comment the code, but that would be fairly pointless. The actual explanation of how the thing works is in the post. The comment would be like "shift bit out of binary number", "fix the units digit from table" "shift bit into units digit", and repeat that for the tens, hundreds, etc. The real trick is in the values in the table, and I explained to you what they mean in the post.
I can't blame you for not understanding the explanation, I suck at explaining things.
You can view the binary number as a series of 1's that have been doubled a few times. The number 7 for example (binary 111), is a one that has been doubled twice (and turned into 4) plus a one that has been doubled once (2) and a one that has not been doubled. That's the rule for binary numbers. The bits are numbered from 0 to 7. This number is the ammount of times a one has been doubled, if there is a one at that position.
The idea here is to reconstruct the number, doubling each bit as many times as in the original number, but force the results to be represented with decimal digits. I first shift out the highest bit (bit number 15, whose value is 32768, or 2^15) and put that in the units digit. The whole decimal number will be shifted 16 times (that's why there are so many shifts), so that bit can represent the number 32768 after all the shifts.
The first few digits double just fine: 1 turns into 2, 2 turns into 4, 3 turns into 6 and 4 turns into 8, but from 5 on we get results that do not match the decimal representation. That's why the values are transformed, so that they do match the decimal representation. 5 is turned into 128 (binary 10000000) right before the shift (doubling), so that the one is carried over to the next digit (since 10 should be represented by 2 digits, the one has to be carried over to the next). 6, wich would turn into $C if shifted normally, is changed to 129 (binary 10000001), so that the top one is carried over and the lower one is doubled to 2, resulting in a perfect "12", wich is the decimal double of 6.
We basically force the numbers to double to their decimal representation, not he binary/hex one. Whenever an unwanted result is going to happen, we change the value so that we get the desired result.
LADADIDA!!!!!!!!!!!!!!!!!!!!!!!!
I am singing because I'm so happy!!
Okay, I just think it's kind of cool, but... I've just come up with a BinToDec routine that takes <DrumRole> 500 cycles!! I am going to go to bed soon, but I should have it posted by tomorrow! There are NO loops, and it takes 500 cycles every time. Well, if there are no loops, then of course it will. The code is big, though. I haven't bug tested it yet, but I'm sure it will work properly. There are a couple of cons about it: it uses about 20 bytes of RAM, it takes up alot of space. Well that's 2. I'll see if I can spot any more tommorow.
If you want a
small hex->bcd routine (in regards to both code size and memory usage), here's one kevtris found a while ago (and optimized significantly).
Code:
hex2bcd_input: .res 3 ;(adjust as necessary)
hex2bcd_output: .res 4 ;(adjust as necessary)
hex2bcd:
LDY #24 ;set this to the number of bits worth of input
LDA #0
STA hex2bcd_output+0
STA hex2bcd_output+1
STA hex2bcd_output+2
STA hex2bcd_output+3 ;clear result regs - insert others here
BEQ rotate
loop: LDX #3 ;this needs to be 1 - len(hex2bcd_output)
bcdfix: LDA hex2bcd_output,X
ADC #$03
BIT val_08
BEQ :+
STA hex2bcd_output,X
: LDA hex2bcd_output,X
ADC #$30
BPL :+
STA hex2bcd_output,X
: DEX
BPL bcdfix
rotate: ROL hex2bcd_input+0
ROL hex2bcd_input+1
ROL hex2bcd_input+2 ;insert any others here
ROL hex2bcd_output+3 ;etc.
ROL hex2bcd_output+2
ROL hex2bcd_output+1
ROL hex2bcd_output+0 ;rotate results
DEY
BNE loop
RTS
val_08: .byte $08 ;because we need to BIT #$08, but you can't BIT immediate
The only trouble is that it's somewhat slow - a 24-bit number took around 3500 cycles (>30 scanlines) to convert. A reduced routine to work on a 16-bit number took around 1750 cycles (~15 scanlines), about half as long.
Celius wrote:
I've just come up with a BinToDec routine that takes <DrumRole> 500 cycles!! I am going to go to bed soon, but I should have it posted by tomorrow!
I'd like to see that! =)
Quote:
There are NO loops, and it takes 500 cycles every time. Well, if there are no loops, then of course it will.
Well, you could have no loops but have some decisions here and there. Branches not always are loops.
Quote:
There are a couple of cons about it: it uses about 20 bytes of RAM, it takes up alot of space.
Yeah, that is a lot. How much ROM does it take? If it's completely unrolled, I guess a lot.
I'm usually a fan of speed, not size. I hope this didn't sound gay. But I guess it did. Well, you know what I'm talking about: I'd rather have a quick routine, that takes a bunch of ROM, 'cause usually my programs have to be fast.
Okay, well it's up! It works, I've tested it, and after manually counting out all the cycles, it was about 520 to 550. I added slight changes to it since I counted, and I'll just assume it's around there. It's right here:
http://www.freewebs.com/the_bott/bintodec.asm
The main idea of this routine is quite simple.
I have a couple of tables with decimal representations of certain hex values. I have one table that defines every decimal representation for every $1000. Meaning I'll have the decimal representation for $1000, $2000, $3000, $4000, etc. I have that in one table. And I have the decimal representations for every $0X00. Same idea, just for like $0100, $0200, etc. Then I have two more tables for $00X0 and $000X. I don't know if you've ever really thought about it, but like $FFFF is the same as $F000 + $0F00 + $00F0 + $000F. And that would mean 65535 is the same as: 61440 + 3840 + 240 + 15. 61440 being the decimal representation of $F000, 3840 being $0F00, etc. Then I simulate pen-and-paper math. I set it up just like I'd write all 4 of those numbers down in seperate rows:
Code:
61440
3840
240
15
I add the last digit of each decimal number together, just like you would on pen-and-paper addition. But what if it reaches over $09? It would be $0A! You know what this means? We have another lookup table. If you look at the table, it goes:
...$8,$9,$10,$11,$12
just like decimal numbers would. So the lower 4 bits of the result is what you store in the Final-Result-Ones-Digit. The result will be the 5 in 65535. Then you shift the high 4 bits over, and you add that to the next column of numbers. It's just like pen-and-paper addition. Do you understand? What do you think? I know there are alot of unneccisarry things in the routine, and I'll fix them. If you spot something that's totally unnecissary, I'm pretty sure I already know. I'll fix it later, I just put that up for now to show you the idea.
Hum... I like your idea! It uses some of the ideas from my routine (jumping from $9 to $10) but takes some shortcuts, like the "pre-convertion" where you take the values from tables. You work on a byte-by-byte basis instead of bit-by-bit. Very interesting.
I haven't tried the code yet, as it's not compatible with the assembler I use, so I'll have to change it a little, but I will try it soon.
This is a very clever idea you got. Of course it isn't perfect, but the algorithm sound very good.
I think the only problem is doing the adition in decimal manually at the end. This doesn't seem very optimized, but is seems like there isn't really much alternate way here, since your method relies on the componant of the hex numbers in decimal to be added. At least this is a very interesting idea.
Hum... I think I have an idea to improve that algorithm... I'm thinking of a way to arrange the tables that might improve the speed of the addition at the end. I don't really know if I have something, but I might. I'll try it later.
Well I'm glad you all like it, or think it's interesting! I am proud of it, but the idea is so simple. I think I'm going to go to the next level (In BinToDec routines..) and come up with a DecToBin routine. I don't know if anyone has really thought about this, or has a need. But I do. I am going to make a NES program that will make my square root table for me for my raycaster. I know the logic behind finding the square root of a number. A perfect square is much easier than an unwhole square root. It involves long divisions, and multiplications of unwhole numbers. Actually, the equation is:
SqRt = ((N/G) + G)/2 = G loop 5
SqRt - Square root
N - Number
G - Guess
loop 5 - do equation 5 times.
The last part is the always differing part. It depends on what the guess is. You literally guess what the square root is, though. You guess something like 5 if you're finding the square root of 30. You do one loop, and you use the result as the next guess. It really works.
tokumaru wrote:
Hum... I like your idea! It uses some of the ideas from my routine (jumping from $9 to $10) but takes some shortcuts, like the "pre-convertion" where you take the values from tables. You work on a byte-by-byte basis instead of bit-by-bit. Very interesting.
But is it worth the extra ROM space vs. working bit-by-bit, when bit-by-bit can finish a 5-digit (16-bit) number in 700 cycles?
tepples wrote:
But is it worth the extra ROM space vs. working bit-by-bit, when bit-by-bit can finish a 5-digit (16-bit) number in 700 cycles?
Would that be a typo? Do you mean byte-by-byte?
I think that it really depends on alot of things: If you care how big routines are, what kind of game you are making, how fast your game needs things to be. This was just recently discussed in the "Bit Reversing" thread. I took your side, and it came across to me that you should not worry about code size, you should worry about code speed, and game data size. By game data, I mean like levels, and music, etc.
Celius wrote:
tepples wrote:
But is it worth the extra ROM space vs. working bit-by-bit, when bit-by-bit can finish a 5-digit (16-bit) number in 700 cycles?
Would that be a typo? Do you mean byte-by-byte?
No, bit-by-bit.
My binary to decimal subroutine runs through a tight loop 15 times to generate four BCD digits, and then it uses the remainder as the least significant digit. In fact, the same algorithm implemented in C on the ARM7TDMI CPU of the Game Boy Advance (which has a hardware multiplier but no hardware divider) is second only to reciprocal multiplication in speed.
Another thing: Is a manually unrolled loop worth the loss in maintainability?
Oh sorry, I didn't catch that "vs." in there. Now it all makes sense, and I see how it's not a typo. Okay, as I've said before, I do not understand how your method works, because I never really use any of those expressions. I always just use plain old 6502. The only thing that I will use is like SomeVar+X, where X is whatever value you want. Otherwise, I won't use like C-esk expressions, because I do not understand them.
And I agree, you're routine is overall better, because it is way smaller, and is only slightly slower than mine. It depends if you have the space to spare. If you can spare it, and have absolutely no more than 600 cycles left in your NMI routine or something, than you should go with my routine. But if the case isn't that, it's probably better to stick with yours. But it's best to stick with one you understand. That's why I will stick with mine.
Oh, and is my code really unmaintanable?
Celius wrote:
Otherwise, I won't use like C-esk expressions, because I do not understand them.
Keep the idea around for when it is needed, because it has it's place. Like in my speech synth I had a hell of a time figuring out how to get my sample addresses into the DPCM register's format. After manually padding the sample files I ended up using expressions which are fairly simple.
Code:
sample_addr_index:
.word (voc00-$C000)/$40
.word (voc01-$C000)/$40
.word (voc02-$C000)/$40
.word (voc03-$C000)/$40
sample_length_index:
.word (voc01-voc00)/16 ; a
.word (voc02-voc01)/16 ; ae
.word (voc03-voc02)/16 ; aw
.word (voc04-voc03)/16 ; ar
BCD routine looks good, I always appreciate the fastest code. Because I like there to do be room left for more stuff to do. Is it free for anyone to use?
Memblers wrote:
BCD routine looks good, I always appreciate the fastest code. Because I like there to do be room left for more stuff to do. Is it free for anyone to use?
At least mine is (like zlib).
Well, I'd be kind of a jackass to say it isn't free. I'd prefer if you wrote your own code, but the idea is free to use. As long as I am remembered for coming up with it... I don't think that would happen, but if you're going to use it, maybe you could say my name somewhere in the credits. You could get away with not doing it, I wouldn't go disassemble your game and look for my routine, but I'd appreciate some credit..
I was reading this thread the other day and I got interested in trying to come up with a fast binary-to-decimal algorithm.
Warning: I am not a 6502 programmer. I don't have any 6502 assemblers lying around =)
The code below is completely and utterly untested and probably won't assemble without some fixing up. So even if you think the idea is sound, you will want to throw my code away and write your own from scratch. I just hacked this up so I could make a good guess about how large/fast/slow it is.
....pretending for a moment that this code works as-is (which is highly unlikely!)...
it would be a 244-byte routine that converts a 16-bit unsigned integer into decimal in about 269 cycles or less.
(Not counting the cost of actually writing out the decimal char once you decide what digit it is).
Code:
; the algorithm goes like this:
; t = 40000; if (N >= t) { N -= t; d[0]+=4; }
; t = 20000; if (N >= t) { N -= t; d[0]+=2; }
; t = 10000; if (N >= t) { N -= t; d[0]+=1; }
; putchar(d[0])
;
; t = 6000; if (N >= t) { N -= t; d[1]+=6; }
; t = 3000; if (N >= t) { N -= t; d[1]+=3; }
; t = 2000; if (N >= t) { N -= t; d[1]+=2; }
; t = 1000; if (N >= t) { N -= t; d[1]+=1; }
; putchar(d[1])
;
; t = 600; if (N >= t) { N -= t; d[2]+=6; }
; t = 300; if (N >= t) { N -= t; d[2]+=3; }
; t = 200; if (N >= t) { N -= t; d[2]+=2; }
; t = 100; if (N >= t) { N -= t; d[2]+=1; }
; putchar(d[2])
;
; t = 60; if (N >= t) { N -= t; d[3]+=6; }
; t = 30; if (N >= t) { N -= t; d[3]+=3; }
; t = 20; if (N >= t) { N -= t; d[3]+=2; }
; t = 10; if (N >= t) { N -= t; d[3]+=1; }
; putchar(d[3])
; putchar(n)
; In each group of tests (i.e. the 3 or 4 tests for one digit),
; notice that at most 2 of the tests succeed for any given input:
; first digit: 0-6 = {none, 1, 2, 2+1, 4, 4+1, 4+2}
; other digits: 0-9 = {none, 1, 2, 3, 3+1, 3+2, 6, 6+1, 6+2, 6+3}
; The if-tests that are short-circuited execute in a mere 4 cycles
; most of the time (!!!). For a few input values (those where the
; high byte is equal and we have to compare the low byte also) it
; will take 13 cycles.
; Worst-case, it takes about 28 cycles when not short-circuited.
; Some of the later tests are cheaper (everything 200 and less)
; There is also an overhead of 3 bytes per digit (to reset Y to zero).
; Quick back-of-the-envelope calculation for the number of cycles:
; Upper bound: 3+(28+28+13) + 3+(28+28+13+13) + 3+(28+28+4+4) + 6+(16+12+4+4)+3 = 269
; Typical: 3+(28+17+4) + 3+(28+17+ 4+ 4) + 3+(22+14+4+4) + 6+(16+9+4+4)+3 = 197
; The code size is approximately 245 bytes (no table needed fortunately).
; It needs two bytes in the zero page (one of which is a read-only constant of zero).
; EVEN IF EVERY SINGLE JUMP WAS NOT TAKEN (which is impossible) the routine would
; take at most 316 cycles.
; keep d in zeropage memory?
; NH in X
; NL in zeropage (ZP_NL)
; TODO: X must contain high byte
ldy ZP_ZERO ; 2 : 3
cpx HI_40000 ; 2 : 2
bmi L2 ; 2 : 2 if (NH < HI(40000)) goto L2
bne L1 ; 2 : 2 if (NH != HI(40000)) goto L1
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_40000-1) ; 2 : 2
L1: bmi L2 ; 2 : 2 if (NL < LO(40000) goto L2
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_40000 ; 2 : 2
tax ; 1 : 2
tya ; 1 : 2 add 4 to digit!
adc 4 ; 2 : 2
tay ; 1 : 2
L2:
cpx HI_20000 ; 2 : 2
bmi L4 ; 2 : 2 if (NH < HI(20000)) goto L4
bne L3 ; 2 : 2 if (NH != HI(20000)) goto L3
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_20000-1) ; 2 : 2
L3: bmi L4 ; 2 : 2 if (NL < LO(20000) goto L4
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_20000 ; 2 : 2
tax ; 1 : 2
iny ; 1 : 2 add 2 to digit!
iny ; 1 : 2
L4:
cpx HI_10000 ; 2 : 2
bmi L6 ; 2 : 2 if (NH < HI(10000)) goto L6
bne L5 ; 2 : 2 if (NH != HI(10000)) goto L5
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_10000-1) ; 2 : 2
L5: bmi L6 ; 2 : 2 if (NL < LO(10000) goto L6
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_10000 ; 2 : 2
tax ; 1 : 2
iny ; 1 : 2 add 1 to digit!
L6:
; *************************************************
; TODO: need code here to output first digit from Y
; *************************************************
ldy ZP_ZERO ; 2 : 3
cpx HI_6000 ; 2 : 2
bmi L8 ; 2 : 2 if (NH < HI(6000)) goto L8
bne L7 ; 2 : 2 if (NH != HI(6000)) goto L7
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_6000-1) ; 2 : 2
L7: bmi L8 ; 2 : 2 if (NL < LO(6000) goto L8
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_6000 ; 2 : 2 note: cf=0 after this
tax ; 1 : 2
tya ; 1 : 2 add 6 to digit!
adc 6 ; 2 : 2
tay ; 1 : 2
L8:
cpx HI_3000 ; 2 : 2
bmi L10 ; 2 : 2 if (NH < HI(3000)) goto L10
bne L9 ; 2 : 2 if (NH != HI(3000)) goto L9
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_3000-1) ; 2 : 2
L9: bmi L10 ; 2 : 2 if (NL < LO(3000) goto L10
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_3000 ; 2 : 2 note: cf=0 after this
tax ; 1 : 2
iny ; 1 : 2 add 3 to digit!
iny ; 1 : 2
iny ; 1 : 2
L10:
cpx HI_2000 ; 2 : 2
bmi L12 ; 2 : 2 if (NH < HI(2000)) goto L12
bne L11 ; 2 : 2 if (NH != HI(2000)) goto L11
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_2000-1) ; 2 : 2
L11: bmi L12 ; 2 : 2 if (NL < LO(2000) goto L12
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_2000 ; 2 : 2 note: cf=0 after this
tax ; 1 : 2
iny ; 1 : 2 add 2 to digit!
iny ; 1 : 2
L12:
cpx HI_1000 ; 2 : 2
bmi L14 ; 2 : 2 if (NH < HI(1000)) goto L14
bne L13 ; 2 : 2 if (NH != HI(1000)) goto L13
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_1000-1) ; 2 : 2
L13: bmi L14 ; 2 : 2 if (NL < LO(1000) goto L14
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_1000 ; 2 : 2 note: cf=0 after this
tax ; 1 : 2
iny ; 1 : 2 add 1 to digit!
L14:
; **************************************************
; TODO: need code here to output second digit from Y
; **************************************************
ldy ZP_ZERO ; 2 : 3
cpx HI_600 ; 2 : 2
bmi L16 ; 2 : 2 if (NH < HI(600)) goto L16
bne L15 ; 2 : 2 if (NH != HI(600)) goto L15
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_600-1) ; 2 : 2
L15: bmi L16 ; 2 : 2 if (NL < LO(600) goto L16
sta ZP_NL ; 2 : 3
txa ; 1 : 2
sbc HI_600 ; 2 : 2 note: cf=0 after this
tax ; 1 : 2
tya ; 1 : 2 add 6 to digit!
adc 6 ; 2 : 2
tay ; 1 : 2
L16:
cpx HI_300 ; 2 : 2
bmi L18 ; 2 : 2 if (NH < HI(300)) goto L18
bne L17 ; 2 : 2 if (NH != HI(300)) goto L17
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
lda ZP_NL ; 2 : 3
sbc (LO_300-1) ; 2 : 2
L17: bmi L18 ; 2 : 2 if (NL < LO(300) goto L18
; At this point NH must be zero and we can skip a few calculations.
iny ; 1 : 2 add 3 to digit!
iny ; 1 : 2
iny ; 1 : 2
L18:
cmp 200 ; 2 : 2
bmi L19 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 200-1 ; 2 : 2
iny ; 1 : 2 add 2 to digit!
iny ; 1 : 2
L19:
cmp 100 ; 2 : 2
bmi L20 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 100-1 ; 2 : 2
iny ; 1 : 2 add 1 to digit!
L20: sta ZP_NL ; 2 : 3
; *************************************************
; TODO: need code here to output third digit from Y
; *************************************************
ldy ZP_ZERO ; 2 : 3
lda ZP_NL ; 2 : 3
cmp 60 ; 2 : 2
bmi L21 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 60-1 ; 2 : 2
tax ; 1 : 2
tya ; 1 : 2 add 6 to digit!
adc 6 ; 1 : 2
tay ; 1 : 2
tya ; 1 : 2
L21:
cmp 30 ; 2 : 2
bmi L22 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 30-1 ; 2 : 2
iny ; 1 : 2 add 3 to digit!
iny ; 1 : 2
iny ; 1 : 2
L22:
cmp 20 ; 2 : 2
bmi L23 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 20-1 ; 2 : 2
iny ; 1 : 2 add 2 to digit!
iny ; 1 : 2
L23:
cmp 10 ; 2 : 2
bmi L24 ; 2 : 2
; NOTE: cf=0 so sbc will borrow, so subtract one to compensate
sbc 10-1 ; 2 : 2
iny ; 1 : 2 add 1 to digit!
L24:
sta ZP_NL ; 2 : 3
; **************************************************
; TODO: need code here to output fourth digit from Y
; **************************************************
; *****************************************************
; TODO: need code here to output final digit from ZP_NL
; *****************************************************
Edit: DOH... I think my estimate is a little low. I forgot that taken branches cost a cycle or two. Still, there should be no more than about 15 of those, and no more than about 9 in the 'upper bound' case. So its probably still well under 300 cycles.
mozz wrote:
I was reading this thread the other day and I got interested in trying to come up with a fast binary-to-decimal algorithm.
Warning: I am not a 6502 programmer. I don't have any 6502 assemblers lying around =)
The
cc65 toolchain is rawther popular here.
Quote:
Code:
; the algorithm goes like this:
; t = 40000; if (N >= t) { N -= t; d[0]+=4; }
; t = 20000; if (N >= t) { N -= t; d[0]+=2; }
; t = 10000; if (N >= t) { N -= t; d[0]+=1; }
; putchar(d[0])
;
; t = 6000; if (N >= t) { N -= t; d[1]+=6; }
; t = 3000; if (N >= t) { N -= t; d[1]+=3; }
; t = 2000; if (N >= t) { N -= t; d[1]+=2; }
; t = 1000; if (N >= t) { N -= t; d[1]+=1; }
; putchar(d[1])
;
; t = 600; if (N >= t) { N -= t; d[2]+=6; }
; t = 300; if (N >= t) { N -= t; d[2]+=3; }
; t = 200; if (N >= t) { N -= t; d[2]+=2; }
; t = 100; if (N >= t) { N -= t; d[2]+=1; }
; putchar(d[2])
;
; t = 60; if (N >= t) { N -= t; d[3]+=6; }
; t = 30; if (N >= t) { N -= t; d[3]+=3; }
; t = 20; if (N >= t) { N -= t; d[3]+=2; }
; t = 10; if (N >= t) { N -= t; d[3]+=1; }
; putchar(d[3])
; putchar(n)
; In each group of tests (i.e. the 3 or 4 tests for one digit),
; notice that at most 2 of the tests succeed for any given input:
; first digit: 0-6 = {none, 1, 2, 2+1, 4, 4+1, 4+2}
; other digits: 0-9 = {none, 1, 2, 3, 3+1, 3+2, 6, 6+1, 6+2, 6+3}
That's pretty much what my code does, except it does 8, 4, 2, 1 on each digit, not 6, 3, 2, 1. Does the lack of a 4, 2, 1 case speed things up significantly?
tepples wrote:
That's pretty much what my code does, except it does 8, 4, 2, 1 on each digit, not 6, 3, 2, 1. Does the lack of a 4, 2, 1 case speed things up significantly?
I think 6,3,2,1 is better than 8,4,2,1 for the following reason: If the digit turns out to be a 7, only two if-tests will succeed in my version (the 6-test and the 1-test). In the other version, the 4-test, 2-test and 1-test would all succeed, which means it would run slower.
That's what I meant by this bit:
Quote:
; In each group of tests (i.e. the 3 or 4 tests for one digit),
; notice that at most 2 of the tests succeed for any given input:
; first digit: 0-6 = {none, 1, 2, 2+1, 4, 4+1, 4+2}
; other digits: 0-9 = {none, 1, 2, 3, 3+1, 3+2, 6, 6+1, 6+2, 6+3}
Do you mind if I ask, how fast is your code? If speed is more important than size, unrolling it the way I did probably makes sense. The routine weighs around 245 bytes and ought to execute in well under 300 cycles (perhaps around 285 cycles??).
mozz wrote:
tepples wrote:
That's pretty much what my code does, except it does 8, 4, 2, 1 on each digit, not 6, 3, 2, 1. Does the lack of a 4, 2, 1 case speed things up significantly?
I think 6,3,2,1 is better than 8,4,2,1 for the following reason: If the digit turns out to be a 7, only two if-tests will succeed in my version (the 6-test and the 1-test). In the other version, the 4-test, 2-test and 1-test would all succeed, which means it would run slower.
If the digit turns out to be an 8 or 4, how fast does your version go? And have you tried 6, 3, 2, 1 vs. 7, 4, 2, 1 (
what the US post office uses)? I just chose 8, 4, 2, 1 because it let me construct the number a bit at a time.
Quote:
Do you mind if I ask, how fast is your code?
About 700 cycles, but the way it's written allows it to be easily extended to 24-bit (up to 16 million) numbers as well.
Sorry for my almost useless reply, and while I've trouble to understand most of your algoritms, I'd just say in an actual game, speed isn't that much important. Of course, a number doesn't have to waste 25% of the frame just to be converted in decimal, but I think 500 cycles or 700 cyles will make no difference in practice. Due to the limited amound of various numbers you can upload to the PPU in VBlank, you'd never want to convert more numbers anyway in the internal logic than those who will get refreshed on the display.
Also, space isn't that much important, but I don't think unrolling loops is a real big deal, because you'd never want to convert more than about 4 numbers per frame.
Quote:
About 700 cycles, but the way it's written allows it to be easily extended to 24-bit (up to 16 million) numbers as well.
That is quite important, on the other hand. In RPG, there is a lot of large numbers involved, especially money related numbers (your current founds and the price list in shops). 24-bit to 6 digit conversion routine will be required in most RPGs.
tepples wrote:
If the digit turns out to be an 8 or 4, how fast does your version go?
It looks like 2 tests would pass then (6+2=8, 3+1=4)
I'd have to agree with mozz that 6,3,2,1 is the slightly better way to go. When you look at the numbers:
Code:
digit 6,3,2,1 tests 8,4,2,1 tests
================================
9 2 2
8 2 1
7 2 3
6 1 2
5 2 2
4 2 1
3 1 2
2 1 1
1 1 1
0 0 0
total 14 15
It always comes down to either 1 or 2 checks... except for the worse-case '7' digit for the 8,4,2,1 setup which requires a 3rd check that isn't necessary with the 6,3,2,1 setup.
Bregalad wrote:
In RPG, there is a lot of large numbers involved, especially money related numbers (your current founds and the price list in shops). 24-bit to 6 digit conversion routine will be required in most RPGs.
Unless you store the numbers in base 100. For instance, the number 314,159 stored in base 100 looks like this:
Code:
.byt 59, 41, 31
Arithmetic in base 100 is straightforward (untested code). Here's addition:
Code:
base100_addBtoA:
clc
lda numA
adc numB
cmp #100
bcc :+
sbc #100
:
sta numA
lda numA+1
adc numB+1
cmp #100
bcc :+
sbc #100
:
sta numA+1
lda numA+2
adc numB+2
cmp #100
bcc :+
sbc #100
:
sta numA+2
rts
Conversion of a byte of base 100 into a pair of digits is as easy as 2 table lookups: a /10 table and a mod 10 table.
Tetramino uses base 100 for the lower 2 digits of the score and base 256 for the rest.
Quote:
Tetramino uses base 100 for the lower 2 digits of the score and base 256 for the rest.
Sound like complicated. I'd go rather for either one or the other method, but not a mix of both.
I still think it is nicer to have numbers in binary, because in RPG shops it is easier to multiply the price per the number of items when the player buy some items, compare if the player has enough money to buy stuff and add/substract numbers to the main money.
Of couse it is absolutely possible to do everything in BCD, but I just prefer in plain decimal, and this decrease the risk to implement math errors. I don't think player will be happy to figure out that they actually don't pay the right price for items just because the programmer did something wrong with its BCD arithmetics and didn't figure it out. Of course that shouldn't happen as long as the programmer knows what he does, but I'm not too good as this kind of stuff. Binary to decimal routines can be tested pretty much easily by testing a large range of various numbers, and binary maths is just easier.
Bregalad wrote:
That is quite important, on the other hand. In RPG, there is a lot of large numbers involved, especially money related numbers (your current founds and the price list in shops). 24-bit to 6 digit conversion routine will be required in most RPGs.
Even simpler is that you notice stuff like the score counters on most games, it's impossible to score 1 point. Because you do score one point, it's just that there's always 2 or more zeros at the end of the score. Seems to be pretty standard.
Don't know if that would apply as much to RPGs though, you might want to find 36 gold but still have the cheapest item in the store be 100 gold.
Bregalad wrote:
Quote:
Tetramino uses base 100 for the lower 2 digits of the score and base 256 for the rest.
Sound like complicated. I'd go rather for either one or the other method, but not a mix of both.
It actually works rawther well in Tetramino because only soft drop (down on Control Pad) and hard drop (up on Control Pad) can modify the lower digits. Line clearing scores are always 100, 300, 700, or 1500, partly for the reasons that Memblers pointed out.
Memblers wrote:
Even simpler is that you notice stuff like the score counters on most games, it's impossible to score 1 point. Because you do score one point, it's just that there's always 2 or more zeros at the end of the score. Seems to be pretty standard.
Well, it seem so in most game. But I found that pretty supid, so I prefer unsing unrounded numbers. I have my monster drop 225 points in the game I'm writting right now. That doesn't sound much wrong. Moslty 5/0 will be used for the lest significant digit, but at least I'm going to make a full use of my 6-digit BCD score variable. (yes, I do everything in BCD because the score is the only number with more than two digits displayed, but I wouldn't do that in a RPG where a lot of numbers are displayed).
I have another idea for a BinToDec routine, it's speed varies, though.
Anyone can come up with code for this idea, I don't care. I don't even know if I'm the first to come up with this.
Take a value like $FFFF. See how many times $2710 goes into it. Multiply that result by $6000. Let's see, that would be...
$24000
Now let's see how many times $3E8 goes into it. That times $600 would be
$18600
How many times does $64 go into it? Multiply that by $60, and get
$F5A0
Now, last but not least, how many times does $A go into it? Multiply that by $6 and get:
$9996
Add those 4 numbers together
Code:
$24000
$18600
$F5A0
$9996
And get $55536.... Now what? Add the original hex value to it ($FFFF), and get $65535. Isn't that cool? It's just too bad that the idea may not make a faster routine... You can try, it's a new idea at least. (I think...).
Celius: "See how many times $2710 goes into it" means using division, which on the 6502 is either repeated subtraction or long division. On 6502, generic 16 bit/16 bit division is a comparatively slow operation, and so is 16 bit*16 bit multiplication. Some of the BCD routines posted here are about as fast as
one such division.
Your divisors are all powers of ten:
- $2710 is 10000
- $03E8 is 1000
- $0064 is 100
- $000A is 10
That's a start. But if you select your divisors such that the quotient is always 0 or 1, each division reduces to a single subtraction. For example, the divisors might be 40000, 20000, 10000, 8000, 4000, 2000, 1000, etc. Then each comparison result becomes one bit of the BCD output. This is how my code works, and mozz's works similarly.
I was about halfway done working on a routine when I posted. I didn't think my code would really be a fast routine for a BinToDec routine, but I thought maybe someone else would come up with better code with my idea. What I was doing was I was checking how many 10000s went into a number, and if 1 went in, that means 10 1000s go into it, and 100 100s go into it, and 1000 10s go into it. When no more 10000s went into it, I'd check how many 1000s go into what was remaining. If 1 1000 goes into it, that means 10 more 100s go into it, and 100 more 10s go into it. You get the point, I'm sure. I wouldn't do a whole check for every divisor. I already know by telling it that 10 100s go into it as well as 1 1000, I'm saving A LOT of time. It didn't seem like it was going to be THAT slow. It'd probably take around 600 cycles though, and my other routine's faster, so I wouldn't use it.
By the way, I sort of got what you were saying, but I'm just curious, why would you divide a 4 hex-digit number by a 5 hex-digit number? Maybe I wasn't getting what you were saying. Like $40000, why would you use that as the divisor in a 16-bit BinToDec routine?
Celius wrote:
tepples wrote:
But if you select your divisors such that the quotient is always 0 or 1, each division reduces to a single subtraction. For example, the divisors might be 40000, 20000, 10000, 8000, 4000, 2000, 1000, etc. Then each comparison result becomes one bit of the BCD output.
By the way, I sort of got what you were saying, but I'm just curious, why would you divide a 4 hex-digit number by a 5 hex-digit number? Maybe I wasn't getting what you were saying. Like $40000, why would you use that as the divisor in a 16-bit BinToDec routine?
I said 40000 ($9C40), not 262144 ($40000). Did I slip in a $ by mistake? If you prefer reading hex, the sequence is
- Digit 10^4: $9C40, $4E20, $2710
- Digit 10^3: $1F40, $FA0, $7D0, $3E8
- Digit 10^2: $320, $190, $C8, $64
- Digit 10^1: $50, $28, $14, $0A
- Digit 10^0: Remainder
No, it's just that you jumped from 10000 to 8000, so it seemed to be hex. I don't know why you'd go from 10000 to 8000 unless it was hex.
Here's how my algorithm works. Suppose it prints 3-digit numbers and I want it to print the number N = 730.
It would start with a digit D = 0, and assuming that 0 <= N <= 999 (i.e. no more than 3 digits).
First, it would check if N >= 600. If so, it would decrease N by 600, and increase D by 6. Now you have a number 0 <= N <= 399 (in our case, it would be 130).
Next you check if N >= 300. If so, decrease N by 300, and increase D by 3. 130 fails this test so we do nothing here. We now know N <= 299.
Then you check if N >= 200. Again, 130 fails this test. By this point we know that N <= 199.
Then you check for N >= 100. Since its 130, you decrease N by 100 and increase D by 1. Now you have N <= 99 (in other words, a two digit number) and you know that this first digit was D (in our case, 7).
Repeat for the next digit, using 60, 30, 20, 10.
Anyway, it actually prints 5-digit numbers but that is the general idea.
I cannot undestand why the comparaison method would be so fast. It sounds rather silly to me.
The only one that sound good is Celius idea by first converting each hex nybble to decimal, and then add all componant in decimal, or simply do the other way arround, that I've always used, that is first separate all componant of the decimal number by substaction.
In other word you'll want to convert the number 765432 in decimal :
While the number is greather than the HEX equivalent of 100000, increase digit 6.
While the number is greather than the HEX equivalent of 10000, incerease digit 5.
etc, etc, up to digit 1.
This is accurate and you can add the number of digits you want very easily.
Also it takes very little space to implement. Only con, it is kind of slow.
This is a very good idea. And not terribly large when implemented unrolled, for maximum speed.
Thanks to that, i finally understand tepples' code too. I really don't know what would be best: having more than 2 subtractions (in the case of a 7) but have a bit that can be used to build the digit (as in tepples' routine) or minimizing the subtractions but have to actually add values to the digit (as in the routine mozz presented). What would be quicker: 2 additions or 4 shifts?
Bregalad wrote:
Only con, it is kind of slow.
I have another con: Execution time varies A LOT! Converting 00000 will be insanelly faster than doing 59999. Can you just imagine a game that shows a lot of numbers, and this game starts pretty fast and smooth. But as the player advances and collects more items, increases the score, etc, etc the game gets slower... OK, this is a bit drastic. I know that even if you have a lot of numbers on screen it's not like you'll compute them all every frame. It's just that I hate huge variations in execution time, so that's a big con for me.
Anyway, even Strategy games and RPG wich shows a lot of numbers will only see a few at the same time, or refresh a few in VBlank. So even in the worst case where you'll have to convert several numbers of '999999' to show them in VBlank, this will probably not take more time than a dozen of scanline, wich sound kinda correct.
Celius wrote:
No, it's just that you jumped from 10000 to 8000, so it seemed to be hex. I don't know why you'd go from 10000 to 8000 unless it was hex.
Each number that is trial-subtracted corresponds to one bit of the BCD output, if you haven't figured it out yet.
tokumaru wrote:
It's just that I hate huge variations in execution time, so that's a big con for me.
So test your game with all values at maximum. For games maintaining a constant frame rate, it is often better for the worst-case execution of a routine to be improved, even if it reduces the best-case performance.
Mozz, your algorithm works very much like mine did way way early on in the topic. What I did was I checked how many 10000s went in, then 1000s, then 100s, 10s, then 1s. I did it kind of crappily, and your routine is obviously much better. It's the same idea, kind of. My old one did $270F in 1200 cycles. The higher the number, the slower it is. So that means that it would be REALLY slow for $FFFF. Now that I think about it, I think it only did up to 4 digits....
I get your idea, mozz. I'm almost sure I understand yours, tepples. I just have to do a manual test to see if I have it.
Celius wrote:
Mozz, your algorithm works very much like mine did way way early on in the topic. What I did was I checked how many 10000s went in, then 1000s, then 100s, 10s, then 1s.
That's similar to the routine Bregalad uses, not mozz. The ideas presented by mozz and tepples are much faster. To represent 9999 your first routine and Bregalad's routine will perform 9 subtractions per digit. Their routines would perform only 2 subtractions per digit, and that's a very big difference.
I know that they do a bigger subtraction so there will only be 1 per digit, but it's like the same idea about increasing the Xs place digit by a certain amount depending on what you subtracted from it. It's kind of a dumb comparison to his routine, but, whatever.
Mozz's algorithm interested me so I put together a working version and tested it thoroughly. The current interface accepts a 16 or 8 bit value and outputs a 2, 3, 4, or 5 byte decimal value. It assembles to 251 bytes and has a BSD-style license which poses almost no restrictions on use. Here is the source and a test program which assemble cleanly with NESASM:
xa_to_decimal.asm
xa_to_decimal_test.asm
UPDATE: Added ability to specify ASCII or other output without any added overhead (adds constant to each output byte, except the ones). Also added entry points to get fewer output digits, i.e. if your number is in the range 0-999, you could ask for only 3 decimal digits and save 100 clocks. This makes it very flexible, and didn't hurt speed at all.
xa_to_5_digits
XA to 5-byte decimal in ten_thousands, thousands, Y, X, A
Best: 99, Average: 169.5, Worst: 231
xa_to_4_digits
XA to 4-byte decimal in thousands, Y, X, A
Best: 79, Average: 127.6, Worst: 173
xa_to_3_digits
XA to 3-byte decimal in Y, X, A
Best: 54, Average: 74.8, Worst: 94
a_to_3_digits
A to 3-byte decimal in Y, X, A
Best: 43, Average: 51.5, Worst: 60
a_to_2_digits
A to 2-byte decimal in X, A (Y is unaffected)
Best: 28, Average: 34.4, Worst: 40
Any chance of a 24-bit version?
You know, because even 16-bit values have their limits...
Quietust wrote:
Any chance of a 24-bit version?
I was wondering that myself. I'm guessing this will significantly slow down the routine. I mean, it's a jump from 5 digits to 8 digits.
I'll probably attempt to implement yxa_to_5_digits on up to yxa_to_8_digits. The main problem is that these exceed the 6502's registers, so will require lots of swapping with memory. The comparisons will also be more involved with three bytes. It'd be nice to have a whole family of routines through 24 bits.
Yeah, and I liked your idea of entry points, where you can use the same routine for a range of different precisions as needed. And a big routine could always be cut down depending on how large your numbers are.
Extending to 24-bit input and 8 decimal digits out didn't turn out as bad as I thought. Here are the new routines I added and their timings (and the previous routines, for completeness):
Code:
yxa_to_8_digits: Best: 186, Average: 344, Worst: 475
yxa_to_7_digits: Best: 170, Average: 322, Worst: 440
yxa_to_6_digits: Best: 139, Average: 254, Worst: 348
yxa_to_5_digits: Best: 104, Average: 179, Worst: 236
xa_to_5_digits: Best: 99, Average: 170, Worst: 231
xa_to_4_digits: Best: 79, Average: 128, Worst: 173
xa_to_3_digits: Best: 54, Average: 75, Worst: 94
a_to_3_digits: Best: 43, Average: 52, Worst: 60
a_to_2_digits: Best: 28, Average: 34, Worst: 40
Support for 6 or more digits is in yxa_to_decimal.asm, and the original source has been modified as well:
xa_to_decimal.asm
yxa_to_decimal.asm
Unfortunately the test program had to become more extensive in order to try every possible 24-bit input value. I ended up adding my NES CPU emulator and having it run the routine for every value. This allowed the test program to conveniently generate the timing information automatically. For the 8 digit output, it took 2.5 minutes for the program to run, which translates to about 53 minutes of emulated NES time! (average of 344 clocks * $FFFFFF / 1789773 = 3224 seconds)
Quote:
I liked your idea of entry points, where you can use the same routine for a range of different precisions as needed. And a big routine could always be cut down depending on how large your numbers are.
Yeah, the multiple entry points was a lucky side-effect of the algorithm. As you say, any of the upper digit code can be removed if you don't need it in a particular program.
I liked the method you used, but didn't like the high code size (613 bytes) and the forcing of using all of the available registers to hold the values (which can get messy and confusing at times), so I threw together a looped routine that worked on the same principle but took up less space. Code size came up at 280 bytes, and while it is significantly slower than your version, it's still way faster than the one I posted earlier. It currently doesn't support multiple entry points for variable output digits, but that's easy enough to add by just replacing the initial LDY in each routine with a different value corresponding to the proper offsets in the data tables.
This one assembles with CA65:
Code:
.define HEX_BYTE2(val) .LOBYTE(.HIWORD(val))
.define HEX_BYTE1(val) .HIBYTE(.LOWORD(val))
.define HEX_BYTE0(val) .LOBYTE(.LOWORD(val))
hex2bcd_output: .res 8
hex2bcd_input: .res 3
.proc hex2bcd_init
LDY #7
: STA hex2bcd_output,Y
DEY
BPL :-
RTS
.endproc
.proc hex2bcd_24bit
LDY #0
loop: LDA hex2bcd_otable,Y
BMI done
TAX
SEC
LDA hex2bcd_input+0
SBC hex2bcd_b0table,Y
PHA
LDA hex2bcd_input+1
SBC hex2bcd_b1table,Y
PHA
LDA hex2bcd_input+2
SBC hex2bcd_b2table,Y
BCS okay
PLA
PLA
BCC end
okay: STA hex2bcd_input+2
PLA
STA hex2bcd_input+1
PLA
STA hex2bcd_input+0
LDA hex2bcd_output,X
CLC
ADC hex2bcd_vtable,Y
STA hex2bcd_output,X
end: INY
BNE loop
done: LDA hex2bcd_input+0
CLC
ADC hex2bcd_output+7
STA hex2bcd_output+7
RTS
.endproc
.proc hex2bcd_16bit
LDY #9
loop: LDA hex2bcd_otable,Y
BMI done
TAX
SEC
LDA hex2bcd_input+0
SBC hex2bcd_b0table,Y
PHA
LDA hex2bcd_input+1
SBC hex2bcd_b1table,Y
BCS okay
PLA
BCC end
okay: STA hex2bcd_input+1
PLA
STA hex2bcd_input+0
LDA hex2bcd_output,X
CLC
ADC hex2bcd_vtable,Y
STA hex2bcd_output,X
end: INY
BNE loop
done: LDA hex2bcd_input+0
CLC
ADC hex2bcd_output+7
STA hex2bcd_output+7
RTS
.endproc
.proc hex2bcd_8bit
LDY #19
loop: LDA hex2bcd_otable,Y
BMI done
TAX
SEC
LDA hex2bcd_input+0
SBC hex2bcd_b0table,Y
BCC end
okay: STA hex2bcd_input+0
LDA hex2bcd_output,X
CLC
ADC hex2bcd_vtable,Y
STA hex2bcd_output,X
end: INY
BNE loop
done: LDA hex2bcd_input+0
CLC
ADC hex2bcd_output+7
STA hex2bcd_output+7
RTS
.endproc
hex2bcd_b2table:
.byte HEX_BYTE2(10000000)
.byte HEX_BYTE2( 6000000),HEX_BYTE2( 3000000),HEX_BYTE2( 2000000),HEX_BYTE2( 1000000)
.byte HEX_BYTE2( 600000),HEX_BYTE2( 300000),HEX_BYTE2( 200000),HEX_BYTE2( 100000)
.byte HEX_BYTE2( 60000),HEX_BYTE2( 30000),HEX_BYTE2( 20000),HEX_BYTE2( 10000)
.byte HEX_BYTE2( 6000),HEX_BYTE2( 3000),HEX_BYTE2( 2000),HEX_BYTE2( 1000)
.byte HEX_BYTE2( 600),HEX_BYTE2( 300),HEX_BYTE2( 200),HEX_BYTE2( 100)
.byte HEX_BYTE2( 60),HEX_BYTE2( 30),HEX_BYTE2( 20),HEX_BYTE2( 10)
hex2bcd_b1table:
.byte HEX_BYTE1(10000000)
.byte HEX_BYTE1( 6000000),HEX_BYTE1( 3000000),HEX_BYTE1( 2000000),HEX_BYTE1( 1000000)
.byte HEX_BYTE1( 600000),HEX_BYTE1( 300000),HEX_BYTE1( 200000),HEX_BYTE1( 100000)
.byte HEX_BYTE1( 60000),HEX_BYTE1( 30000),HEX_BYTE1( 20000),HEX_BYTE1( 10000)
.byte HEX_BYTE1( 6000),HEX_BYTE1( 3000),HEX_BYTE1( 2000),HEX_BYTE1( 1000)
.byte HEX_BYTE1( 600),HEX_BYTE1( 300),HEX_BYTE1( 200),HEX_BYTE1( 100)
.byte HEX_BYTE1( 60),HEX_BYTE1( 30),HEX_BYTE1( 20),HEX_BYTE1( 10)
hex2bcd_b0table:
.byte HEX_BYTE0(10000000)
.byte HEX_BYTE0( 6000000),HEX_BYTE0( 3000000),HEX_BYTE0( 2000000),HEX_BYTE0( 1000000)
.byte HEX_BYTE0( 600000),HEX_BYTE0( 300000),HEX_BYTE0( 200000),HEX_BYTE0( 100000)
.byte HEX_BYTE0( 60000),HEX_BYTE0( 30000),HEX_BYTE0( 20000),HEX_BYTE0( 10000)
.byte HEX_BYTE0( 6000),HEX_BYTE0( 3000),HEX_BYTE0( 2000),HEX_BYTE0( 1000)
.byte HEX_BYTE0( 600),HEX_BYTE0( 300),HEX_BYTE0( 200),HEX_BYTE0( 100)
.byte HEX_BYTE0( 60),HEX_BYTE0( 30),HEX_BYTE0( 20),HEX_BYTE0( 10)
hex2bcd_otable:
.byte 0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,$FF
hex2bcd_vtable:
.byte 1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1
How hard do you think it would be to update your test program to test and profile these functions? When I timed the 24-bit version I got around 1500 cycles for a 5-digit number, but then I noticed that b2table was crossing a page boundary.
Oh, I did follow nothing ! Are you guys still doing the method who substract constant numbers and compares them in order to find the main digit (unsure I understand this perfectly).
Using A, X and Y register as the input number isn't optimal in my opinion.
The best way is to have only a pointer in z-page that points where the actual binary number in memory to be converted is, and to call the routine to a different label in function of the number of digit you want.
Just compare the two folowing piece of code :
Code:
lda #<Number
sta PointerL
lda #>Number
sta PointerH
jsr ConvertBinNmr
or
Code:
lda Number+2
tay
lda Number+1
tax
lda Number
jsr ConvertBinNmr
Well, both aren't so much different, but I just found the first one look more professional. Total subjective behaviour, though.
Bregalad wrote:
Well, both aren't so much different, but I just found the first one look more professional. Total subjective behaviour, though.
I'd say it's a good opportunity for writing a macro for calling the routine, you can have it look as professional as you want while still being the fastest and smallest method (and only take up one line of source, which is great if you do it a lot, less chance for typos also).
Why does your 2nd example do LDA number+2 / TAY btw? Just do LDY number+2.
Nice work everyone, and thanks. Maybe some day I'll have a program now that displays actual decimal numbers from hex (I can think of plenty of uses, XMODEM transfer status for one).
Quietust's table-based version passed the exhaustive test. I then optimized it to run about twice as fast and fit in one 256-byte page. Here is the original source and three progressively optimized versions, with comments at the top about all the changes:
hex2bcd_opt.zip
This code could also have multiple entry points added for the desired number of output digits. Maybe I'll do that tomorrow.
UPDATE ... or today.
I think I've pretty much beat this to death here. I added entry points for arbitrary numbers of digits, and an intermediate optimized version with just the 24-bit code, if size is your biggest concern (it comes in at 192 bytes). The other fast code from yesterday is also included.
to_decimal_asm.zip
Summary so far:
Code:
xa_to_decimal.asm:
yxa_to_8_digits: Best: 186, Average: 344, Worst: 475
yxa_to_7_digits: Best: 170, Average: 322, Worst: 440
yxa_to_6_digits: Best: 139, Average: 254, Worst: 348
yxa_to_5_digits: Best: 104, Average: 179, Worst: 236
xa_to_5_digits: Best: 99, Average: 170, Worst: 231
xa_to_4_digits: Best: 79, Average: 128, Worst: 173
xa_to_3_digits: Best: 54, Average: 75, Worst: 94
a_to_3_digits: Best: 43, Average: 52, Worst: 60
a_to_2_digits: Best: 28, Average: 34, Worst: 40
610 bytes (279 if you leave out 6-8 digit support)
hex2bcd.asm
hex2bcd_24bit_8dig Best: 600, Average: 815, Worst: 940
hex2bcd_24bit_7dig Best: 572, Average: 777, Worst: 877
hex2bcd_24bit_6dig Best: 452, Average: 608, Worst: 687
hex2bcd_24bit_5dig Best: 332, Average: 439, Worst: 497
hex2bcd_16bit_5dig Best: 288, Average: 383, Worst: 441
hex2bcd_16bit_4dig Best: 222, Average: 289, Worst: 329
hex2bcd_16bit_3dig Best: 130, Average: 165, Worst: 191
hex2bcd_8bit_3dig Best: 55, Average: 67, Worst: 78
hex2bcd_8bit_2dig Best: 48, Average: 55, Worst: 60
277 bytes
hex2bcd_small.asm
hex2bcd_24bit_8dig: Best: 768, Average: 1073, Worst: 1226
hex2bcd_24bit_7dig: Best: 741, Average: 1036, Worst: 1164
hex2bcd_24bit_6dig: Best: 621, Average: 867, Worst: 974
hex2bcd_24bit_5dig: Best: 501, Average: 698, Worst: 784
hex2bcd_16bit_5dig: Best: 471, Average: 661, Worst: 754
hex2bcd_16bit_4dig: Best: 381, Average: 529, Worst: 594
hex2bcd_16bit_3dig: Best: 261, Average: 360, Worst: 404
hex2bcd_8bit_3dig: Best: 201, Average: 270, Worst: 309
hex2bcd_8bit_2dig: Best: 141, Average: 191, Worst: 214
192 bytes
See? Get a few smart people interested in solving a problem, and look what happens!
Blargg and Quietust, thank you for following through on this. Now everybody who actually writes their own NES games can benefit. This is exactly what I hoped would happen.
Edit: @Bregalad and Celius: the algorithm I proposed works on a similar idea to
binary search (see wikipedia article).
For each digit, it essentially does a binary search for the value of the digit. With 10 possible values for the digit, you need to do at most ceil(log2(10)) == 4 tests to figure out which one it is.
One subtle part is this: Once you know what the value of the digit is, you have to subtract that number from N anyways so that you can get on with figuring out the next digit. So rather than handle the "high" case and the "low" case differently for each comparison, I keep things simple by just subtracting as I go! That effectively makes both cases the same.
So if you have 4 digits worth of number to figure out, and you start out comparing the number to 6000, then either it's less than 6000 and we do nothing (the low case), or its >= 6000 (the high case) in which case we subtract that 6000 right away and add 6 to temporary where we store the value of the digit we will output. Because of the subtraction, the remaining part of the number is now less than 4000, whether we subtracted 6000 from it or not, which makes it simple to do the next step (comparing it to 3000).
The other subtle part, which Disch elaborated on earlier, is how I chose the 4 points to compare the digit to such that at most two of the comparisons will succeed. 3 comparisons is unfortunately not enough in some cases (that could only distinguish 2^3 == 8 different values). If we had 16 possible digits to look for, we would need the four points to be 8,4,2,1 but since we only have 10 possible digits that the result might be, 6,3,2,1 are slightly faster when the digit happens to be a 7 (because 6+1 = 7, where the other case is 4+2+1 = 7)
mozz wrote:
the algorithm I proposed works on a similar idea to binary search (see wikipedia article). For each digit, it essentially does a binary search for the value of the digit. With 10 possible values for the digit, you need to do at most ceil(log2(10)) == 4 tests to figure out which one it is.
I actually started out by trying to implement a binary search, but it was quite unwieldy.
Quote:
Once you know what the value of the digit is, you have to subtract that number from N anyways so that you can get on with figuring out the next digit. So rather than handle the "high" case and the "low" case differently for each comparison, I keep things simple by just subtracting as I go! That effectively makes both cases the same.
That didn't work in your code since the subtraction of the low byte is bypassed when the high byte is greater than the comparison value. Below, when X > 23, it jumps to gt60000. In your code, the SBC would only have been done in place of the CMP #112, if X = 23. Also the way I coded it, I didn't keep re-loading the lowest byte, so even if I found a way to use the SBC in place of a comparison, I'd have to undo the subtract if the number was found lower than the digit value.
Code:
cpx #23
bcc lt6000
bne gt6000
cmp #112
bcc lt6000
gt6000:
sbc #112
tay
txa
sbc #23
tax
tya
On the other hand, Quietust's hex2bcd (and my further optimizations of it) do as you describe: subtract the digit value from a temporary copy of the value, and if the result isn't negative, writes this new value back. I enjoyed the simplicity of this when working with hex2bcd.
Quote:
since we only have 10 possible digits that the result might be, 6,3,2,1 are slightly faster when the digit happens to be a 7 (because 6+1 = 7, where the other case is 4+2+1 = 7)
I delved deeper into this in some ways. I first noted that only 3 out of 10 digits end in a 1, therefore 70% of the time the last comparison will not match, so the code can be optimized for the loop ending with a non-match. Also, supporting multiple entry points, I used the digits 7, 4, 2, 1 for the ten thousands place (in both versions), allowing the 16-bit case to check digits 4, 2, 1, and the 24-bit case to check for digits 7, 4, 2, 1. This allowed the 24-bit code to jump to the 16-bit code after checking for 7. The 7, 4, 2, 1 set is almost as good as 6, 3, 2, 1, and doesn't ever use more than two for a given digit.
The code still needs to be documented a bit better and perhaps hex2bcd have the multiple entry points moved to another file (or added as macros as I did to hex2bcd_small), since they bloat it a bit.
blargg wrote:
That didn't work in your code since the subtraction of the low byte is bypassed when the high byte is greater than the comparison value. Below, when X > 23, it jumps to gt60000. In your code, the SBC would only have been done in place of the CMP #112, if X = 23.
Whoops! That's what happens when I write code at three in the morning and never test it.
blargg wrote:
I delved deeper into this in some ways. I first noted that only 3 out of 10 digits end in a 1, therefore 70% of the time the last comparison will not match, so the code can be optimized for the loop ending with a non-match. Also, supporting multiple entry points, I used the digits 7, 4, 2, 1 for the ten thousands place (in both versions), allowing the 16-bit case to check digits 4, 2, 1, and the 24-bit case to check for digits 7, 4, 2, 1. This allowed the 24-bit code to jump to the 16-bit code after checking for 7. The 7, 4, 2, 1 set is almost as good as 6, 3, 2, 1, and doesn't ever use more than two for a given digit.
That is very clever. I can see how 24-bit support might be useful, e.g. for experience points in an RPG. Having entry points for known number of digits is a great idea too. Note that if you aren't sure how large the number is but its usually small, you could do one short-circuit test at the start of the routine (e.g. if the high byte is zero, jump over a bunch of code).
blargg wrote:
Quietust's table-based version passed the exhaustive test. I then optimized it to run about twice as fast and fit in one 256-byte page. Here is the original source and three progressively optimized versions, with comments at the top about all the changes:
hex2bcd_opt.zip
Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit? Is there something gained by not handling it straight within the 16-bit case? And why do you not do the same thing when going from 16-bit to 8-bit (you end up doing the 200 case twice)?
For reference, the following code appears to work properly and is only 233 bytes long. When I ran it on the 24-bit number 0, it took only 581 cycles to complete, though I don't know if larger numbers were made slower in the process.
Code:
hex2bcd_output: .res 8
hex2bcd_input: .res 3
.proc hex2bcd_init
LDY #7
: STA hex2bcd_output,Y
DEY
BPL :-
RTS
.endproc
.scope hex2bcd
conv_24bit:
LDY #9
CLC
loop1: LDA hex2bcd_input+0
SBC b0table+9,Y
LDA hex2bcd_input+1
SBC b1table+9,Y
TAX
LDA hex2bcd_input+2
SBC b2table,Y
BCS good1
DEY
BPL loop1
BMI conv_16bit
good1: STA hex2bcd_input+2
STX hex2bcd_input+1
LDX otable+9,Y
LDA hex2bcd_output,X
ADC vtable,Y
STA hex2bcd_output,X
LDA hex2bcd_input+0
SBC b0table+9,Y
STA hex2bcd_input+0
CLC
DEY
BPL loop1
conv_16bit:
LDY #8
CLC
loop2: LDA hex2bcd_input+0
SBC b0table,Y
TAX
LDA hex2bcd_input+1
SBC b1table,Y
BCS good2
DEY
BPL loop2
BMI conv_8bit
good2: STX hex2bcd_input+0
STA hex2bcd_input+1
LDX otable,Y
LDA hex2bcd_output,X
ADC vtable+3,Y
STA hex2bcd_output,X
DEY
BPL loop2
conv_8bit:
LDA hex2bcd_input+0
CMP #200
BCC :+
SBC #200
INC hex2bcd_output+5
INC hex2bcd_output+5
: CMP #100
BCC :+
SBC #100
INC hex2bcd_output+5
: LDX #0
CMP #60
BCC :+
SBC #60
LDX #6
: CMP #30
BCC :+
SBC #30
INX
INX
INX
: CMP #20
BCC :+
SBC #20
INX
INX
: CMP #10
BCC :+
SBC #10
INX
CLC
: ADC hex2bcd_output+7
STA hex2bcd_output+7
TXA
ADC hex2bcd_output+6
STA hex2bcd_output+6
RTS
vtable: .byte 6,0,1,2,5,0,1,2,5,0,1,3
otable: .byte 5,5,4,4,4,4,3,3,3,3,2,2,2,2,1,1,1,1,0
.define HEX_BYTE2(val) .LOBYTE(.HIWORD(val))
.define HEX_BYTE1(val) .HIBYTE(.LOWORD(val))
.define HEX_BYTE0(val) (.LOBYTE(.LOWORD(val))-1)
b0table:
.byte HEX_BYTE0( 300),HEX_BYTE0( 600)
.byte HEX_BYTE0( 1000),HEX_BYTE0( 2000),HEX_BYTE0( 3000),HEX_BYTE0( 6000)
.byte HEX_BYTE0( 10000),HEX_BYTE0( 20000),HEX_BYTE0( 40000),HEX_BYTE0( 70000)
.byte HEX_BYTE0( 100000),HEX_BYTE0( 200000),HEX_BYTE0( 300000),HEX_BYTE0( 600000)
.byte HEX_BYTE0( 1000000),HEX_BYTE0( 2000000),HEX_BYTE0( 3000000),HEX_BYTE0( 6000000)
.byte HEX_BYTE0(10000000)
b1table:
.byte HEX_BYTE1( 300),HEX_BYTE1( 600)
.byte HEX_BYTE1( 1000),HEX_BYTE1( 2000),HEX_BYTE1( 3000),HEX_BYTE1( 6000)
.byte HEX_BYTE1( 10000),HEX_BYTE1( 20000),HEX_BYTE1( 40000),HEX_BYTE1( 70000)
.byte HEX_BYTE1( 100000),HEX_BYTE1( 200000),HEX_BYTE1( 300000),HEX_BYTE1( 600000)
.byte HEX_BYTE1( 1000000),HEX_BYTE1( 2000000),HEX_BYTE1( 3000000),HEX_BYTE1( 6000000)
.byte HEX_BYTE1(10000000)
b2table:
.byte HEX_BYTE2( 70000)
.byte HEX_BYTE2( 100000),HEX_BYTE2( 200000),HEX_BYTE2( 300000),HEX_BYTE2( 600000)
.byte HEX_BYTE2( 1000000),HEX_BYTE2( 2000000),HEX_BYTE2( 3000000),HEX_BYTE2( 6000000)
.byte HEX_BYTE2(10000000)
.endscope
hex2bcd_24bit = hex2bcd::conv_24bit
hex2bcd_16bit = hex2bcd::conv_16bit
hex2bcd_8bit = hex2bcd::conv_8bit
Quote:
Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit?
The ten thousand values subtracted are 70000, 40000, 20000, and 10000. In the 24-bit case, once it's eliminated 70000, the number could still be 69999 ($01116F). If it jumped to the 16-bit code immediately, the still non-zero most significant byte would be ignored, making the value look like 4463. Of course I had the benefit of the validator, so I could just adjust the overlap until it worked, then figure out why. :)
Quote:
When I ran it on the 24-bit number 0, it took only 581 cycles to complete,
In the most recent posted code I list the worst-case values so you can easily invoke it when testing them in a program. I tested your changes and they broke the 24-bit and 16-bit routines for the reason described above. I went ahead spent a while (ugh, 3 hours now that I check) optimizing the routine more.
I was able to eliminate the entry for 200 from the table and have the 16-bit case do an optimized check for this at the end, simply seeing if the high byte is non-zero, in which case it subtracts 200 without any further checks. If the high byte is zero, then it uses the normal hex2bcd_8bit code.
I found another optimizatin using the 7/4/2/1 group: if the 7 test passes, then the 4 test can be skipped. I incorporated this into the ends of the 16-bit and 24-bit match loops, having them skip the last entry if the next-to-last entry matched.
I was able to reduce the vtable size further by changing the millions to the 1/2/4/7 pattern, allowing the 16-bit and 24-bit routines to use the same vtable from the beginning. I had to fix a bug in the HEXBYTE macros: rather than subtract 1 from the low byte, I needed to subtract 1 from val in all of them.
I rearranged the match handlers to fall through to the loop, and for the non-match loop case to fall through to the next smaller handler (i.e. 24 falls into 16, 16 falls into 8), eliminating more branches.
Finally, I added short-circuiting that immediately jumps to the 16- and 8- bit handlers once the upper bits of the value become zero. This is just two branches that can be removed if desired, as they slightly increase the worst-case performance.
I removed the init routine from the source since it was cluttering it up. I also wasn't able to use the local symbol names as my ca65 gave a diagnostic (same for using :+, which sucks), and I don't feel like checking for a new version, getting it to compile, and seeing if these are fixed.
hex2bcd5.asm
Best case is for value 0. Worst case values listed.
hex2bcd_24bit Best: 106, Average: 760, Worst: 962 (n=16768641)
hex2bcd_16bit Best: 77, Average: 357, Worst: 439 (n=34640)
hex2bcd_8bit Best: 55, Average: 67, Worst: 78 (n=240)
252 bytes
If the two lines marked "optimization only" are removed: (note improved worst-case)
hex2bcd_24bit Best: 589, Average: 788, Worst: 924 (n=15964654)
hex2bcd_16bit Best: 278, Average: 361, Worst: 421 (n=34640)
248 bytes
blargg wrote:
Quote:
Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit?
The ten thousand values subtracted are 70000, 40000, 20000, and 10000. In the 24-bit case, once it's eliminated 70000, the number could still be 69999 ($01116F). If it jumped to the 16-bit code immediately, the still non-zero most significant byte would be ignored, making the value look like 4463. Of course I had the benefit of the validator, so I could just adjust the overlap until it worked, then figure out why.
Oh yeah, forgot about that.
To be fair, I never would have actually seen the 256-299 case, since my score and timer all go by multiples of 50; as for the 65536-69999 case, I just wasn't patient enough to try large enough values.
Come to think of it, it would probably be worth it to just divide the score and bonus timer by 10 internally and just pad them with an extra 0.
** BUMP **
Was looking for this lib, but the links are all dead. Specifically I'm looking for the one previously at this link:
it's now http://h1.ripway.com/blargg/temp/to_decimal_asm.zip
which was the 2nd link in blargg's post here:
http://nesdev.com/bbs/viewtopi ... 7351#17351
If someone has this, could they re-link it somewhere? Or if it exists somewhere else, could someone give a link?
Thanks!
For any old links of mine, try replacing "www.io.com/~greens" with "h1.ripway.com/blargg". So this one becomes
http://h1.ripway.com/blargg/temp/to_decimal_asm.zip
I was mainly interested in using this to display a kind of coordinate thing up top in my game. I used the one for 2 digits and it works fine for me, so thanks. No more hex output!
I just revised my Bin to Dec routines, and I thought I'd post them up if anyone is interested:
http://www.freewebs.com/the_bott/BinToDec.asm
The major con to this is that it compiles to 1000 (exactly) bytes, but it's the same speed every time, no matter what the number is:
8 bits: 118 cycles
16 bits: 263 cycles
24 bits: 475 cycles
I like those speeds quite a bit, but it's true it's quite space-hogging (it's not a big deal for me though).
One could take out the 24 bit routine (and the tables it uses) to cut it all down to roughly 700 bytes (maybe less), or one could even cut the 8 bit routine out if it didn't seem worthwhile.
For anyone who hasn't been following the thread since 3 years ago, my method revolves around the fact that a value like $3A5D is the equivalent to $3000 + $A00 + $50 + $D. It uses tables of pre-calculated hex to decimal values for each hex digit place. So one of the tables contains the pre-converted decimal value for $0000, $1000, $2000, $3000, ... $E000, and $F000, while another contains the values for $000, $100, $200, $300 ... $E00, and $F00, etc. It simulates pen-and-paper decimal addition using the data from these tables, and returns each decimal digit in a ZP variable. It uses 11 bytes of zero page, also. So all you have to do is put the hex values in Hex0, Hex1, and Hex2 (just the first for 8 bits, the first 2 for 16, and all for 24 bit conversion), and it returns the values in DecOnes, DecTens, DecHundreds, DecThousands, DecTenThousands, DecHundredThousands, DecMillions, and DecTenMillions (obviously depending on the bit width of the hex value).
Just thought I'd post for anyone who's interested.