I just noticed this document on kindred's homepage:
http://www.crazysmart.net.au/kindred/fi ... nst_op.pdfIs this table based on some official Sony document or on logic analyzer traces? The total cycles taken by each instruction agree with bsnes/higan, but the order of operations for many instructions (particularly complex ones like the calls) is completely different.
Seems likely, given the "Kris Bleakley 2014" in the corner, that he'd be the person to chase down to ask...
Is that Overload? Just want to make sure before I send a PM...
Whois data for the domain implies yes.
Yes, Kris Bleakley is Overload. He is the author of Kindred (formerly Super Sleuth.)
The cycle timings in higan are from blargg. He wrote test ROMs that would test every possible instruction (obviously ones like SLEEP/STOP were not possible), and would report errors if your cycle orderings were wrong.
I can't recall if he said he tested CALL instructions and the like, or just the ALU instructions. I do remember that INCW/DECW were the most surprising ones of all (to me, anyway. I get why it's nicer for the CPU design, but boy is DECW weird.)
Unfortunately, I lost those test ROMs, and nobody else seems to have them.
So, in absense of a test ROM to the contrary, I'd say we should stick with blargg's results that are in higan.
(Aside: I'm still really disappointed there's no external pin for the S-SMP to trigger interrupts. I would really like to emulate SLEEP properly.)
Okay, "many instructions" appears to have been an exaggeration. As far as I can see after a bit more comparison, there are two differences between bsnes behaviour and Overload's document:
1) According to Overload, there's no such thing as an "internal operation" on the SPC700. Like the 6502, every cycle is either a read or a write; the ones which bsnes treats as "internal" are mostly re-reads from the previously read address. However, there's apparently something special about the
MOV (X)+,A opcode--I've PMed Overload to ask for clarification, but it looks like he's saying it
does do a dummy read from (X) (just like all other store instructions do) but that read somehow bypasses the internal registers, thus it can't be detected using the reset-on-read timer registers.
2) Call and return instructions; the timing of the stack accesses is totally different between bsnes and Overload's document. Verifying these via pure software testing would be stupendously difficult, because the stack is fixed to page $01 and there aren't any MMIO registers there. You'd have to point one of the DSP voices at stack memory, play a sample timed such that the DSP reads the stack just as the SMP is writing to it, and examine the echo buffer. Frankly the timing for these instructions in bsnes look like guesses to me (the "internal" operations in
RET come
after the stack pops? Highly unlikely) and Overload's look a lot more realistic.
On a tangent, I notice one of the fork maintainers that you
haven't yet banned from your forum
called you out for something that I think I've also pointed out (and I've changed in bsnes-classic). higan treats the SMP "output" communication ports (the ones the S-CPU can read via $2140-2143) as if the S-CPU can directly read four bytes of APU RAM, which seems utterly impossible hardware-wise. If I understand the APU correctly (I've never done any SNES sound programming, or sound programming period) here's a fairly easy way to test whether higan is correct:
- On the SMP, write something distinctive to $F4-$F7, e.g. DE AD BE EF
- Set the DSP echo buffer so that it wraps around and spills into zero page (e.g. start at $F900 with a size of $800)
- Play a nice complex looping sample with echo enabled
- Once the sample is playing, on the S-CPU side, read $2140-2143 in a loop and display the read values onscreen
If higan's implementation is correct, the values read by the S-CPU will constantly change as the DSP writes into the echo buffer. If I'm correct, the S-CPU will only see the values written by the SMP in step 1.
I can write a program to do this, but someone who's already got a bit of experience with APU programming (Revenant?) can probably do it a lot more easily and quickly.
> According to Overload, there's no such thing as an "internal operation" on the SPC700. Like the 6502, every cycle is either a read or a write; the ones which bsnes treats as "internal" are mostly re-reads from the previously read address.
That's ... I really have trouble believing something that significant snuck by everyone for that long, especially with blargg's tests to determine individual cycle ordering for SPC700 opcodes. To be clear, I'm not saying Overload is wrong. I would just be stunned that blargg missed that detail when he tested this.
> and Overload's look a lot more realistic.
I can't say whether mine are right or wrong, but I know better than to design things based on how they "should" work. That proves to actually be correct approximately 15% of the time.
If Overload confirms he used a logic analyzer to trace these out, or if he made some test ROMs he can provide, then that's evidence enough to go with his design.
If not, then both approaches are guessing, and we should add this to a list of things that need hardware verification.
> higan treats the SMP "output" communication ports (the ones the S-CPU can read via $2140-2143) as if the S-CPU can directly read four bytes of APU RAM, which seems utterly impossible hardware-wise.
You may be right. Prove to me it's wrong and I'll make the change. Again, the guiding philosophy with the SNES core was, "don't make changes because they
feel right, make them after verifying on hardware." I'm not changing that methodology after hitting 100% compatibility with this approach.
You're welcome to make changes you haven't verified are correct with your fork of v073 from 2010 if you like.
> I notice one of the fork maintainers that you haven't yet banned from your forum
If you're gonna bring that up here, you should mention why I did that:
https://board.byuu.org/viewtopic.php?p=12684#p12684https://board.byuu.org/viewtopic.php?p=8333#p8333If I let you talk to me like that, then I'd have to let everyone on my forum talk to me like that, or admit that helpful people get special privileges.
byuu wrote:
That's ... I really have trouble believing something that significant snuck by everyone for that long, especially with blargg's tests to determine individual cycle ordering for SPC700 opcodes. To be clear, I'm not saying Overload is wrong. I would just be stunned that blargg missed that detail when he tested this.
Almost all the "internal operations" indicated by Overload either read from PC or from the stack. The stack never has side effects on reads, and the PC only does if you're executing code out of the timer registers, which is fairly impossible to do in a controllable way.
I spotted another group of instructions where higan differs from Overload's doc: the
(indirect),y addressing mode.
higan does:
Code:
read direct page address
idle
read indirect address LSB
read indirect address MSB
read data
(write data)
Overload has:
Code:
read direct page address
read indirect address LSB
read indirect address MSB
idle (re-read indirect address MSB)
read data
(write data)
For what it's worth, Overload's version matches the 6502 and 65816 (though the idle cycle is conditional on those CPUs) It also makes more sense--Y is added to the indirect address, not to the direct page address.
This document dates to 2007-04-21. That date corresponds fairly closely with the release of bsnes 0.020, which has this in the changelog:
Quote:
- Corrected all S-SMP cycle timings to be hardware accurate. Thanks to blargg for creating an amazing test ROM that tested every possible opcode
All of the cycle timings in higan except one match this document exactly, including the ones which are indicated in the document as "guesses" (i.e., you guessed it, the various stack instructions). The one place higan
doesn't match the 2007 document is
(indirect),y (the document matches Overload's findings). I wonder if it's possible that you used the
(indirect,x) timings for
(indirect),y by mistake?
Quote:
If Overload confirms he used a logic analyzer to trace these out, or if he made some test ROMs he can provide, then that's evidence enough to go with his design. If not, then both approaches are guessing, and we should add this to a list of things that need hardware verification.
I asked Overload via PM and he says he used a logic analyzer.
Quote:
You may be right. Prove to me it's wrong and I'll make the change. Again, the guiding philosophy with the SNES core was, "don't make changes because they feel right, make them after verifying on hardware." I'm not changing that methodology after hitting 100% compatibility with this approach.
That was meant to be a beg to anyone else reading this thread to write that test ROM for me, not to you to change higan.
I can write a test ROM for the I/O behavior tomorrow night if nobody else does.
Revenant wrote:
I can write a test ROM for the I/O behavior tomorrow night if nobody else does.
Thanks, that would be great. For testing purposes, if you revert
this commit in bsnes-classic then it should behave the same way as higan (i.e. CPU reads from $2140-2143 will access the actual APU RAM)
Note that I reply in the order I read things. Later comments are after having read more.
> Almost all the "internal operations" indicated by Overload either read from PC or from the stack.
I noticed that in his PDF. But until he chimes in here to tell us he's confirmed this on real hardware (and hopefully how), it's conjecture.
> The stack never has side effects on reads, and the PC only does if you're executing code out of the timer registers, which is fairly impossible to do in a controllable way.
If it's completely impossible to observe then, loathe as I am to say this (I'll suppress my gag reflex for it) ... it's not really relevant to emulation then ... so we'll need a logic analyzer to prove things before I'll change them.
If it is possible, then we can make a test ROM to prove the behavior.
> I wonder if it's possible that you used the (indirect,x) timings for (indirect),y by mistake?
Probably. It's also possible that test ROM from blargg didn't test (indirect),y; or that incredulously, the SPC700 didn't match the 65xx (since, you know, it's not a 65xx. Just a cleanroom clone of one with disguised opcode mnemonics.) I sure wish someone had that test ROM from him still.
> I asked Overload via PM and he says he used a logic analyzer.
Any chance he can chime in here himself?
I believe you, but I'd like the official record to be more than "a guy said another guy said he did that in a private message. Go find a random thread on an NES development forum to see for yourself."
I'm sorry to be a pain about this. You need to understand that I tried the opposite approach of just adding things that felt right to my Game Boy core, because it's not a system I'm set up to run my own hardware tests on and well ... it's been a disaster. Compatibility is worse today than it was when I started.
But ... again, I'll trust you and start working on this change. The way I usually do this is to change idle() so that it does a dummy read from PC without incrementing. For the weird ones where it reads from some other place instead of idling, we can replace idle() with the reads they are doing instead.
> That was meant to be a beg to anyone else reading this thread to write that test ROM for me, not to you to change higan.
That's fair! It's good to raise issues about corner cases instead of just assuming we are right already.
And just to reiterate the above again, I'm not saying I am right. If I am, I'm not going to gloat. If I'm not, I'll fix it and thank you and the tester for finding the issue. I just want proof before making changes.
I know every bsnes/higan forker loves to put passive aggressive comments into their Git changelogs, but one of these days ... even if you guys are right most of the time ... eventually, you're gonna be wrong. And now you're gonna have a comment about how it's literally impossible my behavior was correct, and lo and behold.
Notice how I never say in absolute terms that I'm right. That's because I don't know until I confirm it for myself. It's really better to be cautious about these things. No matter how unreasonable you think I am, everyone can see for themselves how SNES emulation used to be prior to my unorthodox approaches.
(sorry for double post, but the former was too long already.)
Went through the document, tried to identify all discrepancies, noted concerns with Overload's document.
Overall, I don't think idle() is going to cut it. Even in the cases where it's a direct idle, the address bus indicates it's re-reading the last fetched byte. So it sounds like having a memory address register to keep track of the last read-from address would do the job ... but then there's cases like CALL that seem to have an idle cycle that reads from the stack *before* actually reading from the stack.
So right now, I'm not really sure the best way to handle these idle cycles correctly.
My biggest technical concern right now is ... when there are multiple reads from the same address, how do we know -definitively- which one is the value actually used? Like say you fetch the direct page address from PC+1, then there's another read from PC+1 ... what if the value differs between the two reads? Which read matters? I know, not likely, but still ... I'm guessing the second case.
Discrepancies found:
Code:
AbsoluteBitModify:
idle() => read(absolute)
AbsoluteIndexedWrite:
consistency cleanup: perform absolute+index in-place
DirectWriteDirect:
idle() => load(target)
DirectWriteImmediate:
idle() => load(direct)
DirectReadWord:
idle() => load(direct+0)
IndirectIndexed(Read,Write):
idle() => load(direct+1)
IndirectXIncrement(Read,Write):
second idle() => ... load(X) again?
something about not affecting registers on the cycle before ...
IndirectXWriteIndirectY:
second idle() => load(X)
Pull,PLP:
second idle() => reads from SP without incrementing SP ...
Push:
second idle() => unknown purpose
BBC,BBS,BNEDirect,BNEDirectDecrement,BNEDirectX:
second fetch(), idle() => wrong ordering
BRK:
move first idle() to top of function
push PC, P before reading VA
JSPDirect:
second idle() => goes after push; reads from SP
JSRAbsolute:
second and third idle() => goes after push; reads from SP
JST:
two idle()s go to top, one after read(absolute)
RTI,RTS:
two idle()s go to top; second is a stack read
STW:
consistency cleanup: turn into StoreWord(A)
Document concerns:
Code:
when there is an operand fetch followed by an idle cycle (eg DO fetch) ...
Which fetch actually counts for DO? The first or the second?
Same question for 19b
17. Implied i: it's not made clear which instructions take how many cycles.
I know the answers, but they should be documented, yes?
19b. Relative rel: what does it mean for the data bus to have Offset on cycle 4?
20b. Stack s
Where is cycle 3 reading from?
CLR, SET addr:bit do not appear to be documented?
what the christ is with condition (5)? Madness >_>
byuu wrote:
what the christ is with condition (5)? Madness >_>
I wonder if
all the redundant reads are actually artifacts of the memory controller that Sony bolted on (remember that most SPC700 chips are MCUs with completely internal RAM and ROM and no external bus at all) and don't affect the internal registers. I'm waiting for a PM from Overload confirming my speculation, but based on the number of pins connecting the SMP to the DSP, it looks like it's a 6502-style bus, which has no way to distinguish a "read" from an "internal operation".
i.e. the SPC700 core itself is doing an idle cycle, and its internal registers don't see it as a read, but the external bus has no way to indicate that, so from outside the chip it looks like a re-read of the last address that was read from (the address pins are effectively "open bus").
In that case it would be safe to emulate them as true idle cycles, because no addresses that
aren't SMP-internal have any side effects on reads.
Semi-offtopic: for the 6502, 65816 and SPC700, the opcode switch tables look a
lot nicer and more logical if instead of putting the opcodes in numerical order you arrange them like this:
Code:
case 0x00:
case 0x20:
case 0x40:
case 0x60:
case 0x80:
case 0xa0:
case 0xc0:
case 0xe0:
case 0x01:
case 0x21:
[....]
case 0x1f:
case 0x3f:
case 0x5f:
case 0x7f:
case 0x9f:
case 0xbf:
case 0xdf:
case 0xff:
It shouldn't make any difference to code generation/performance, because the compiler should see that there are 256 contiguous cases (even if they're in jumbled order) and compile it as a single indexed indirect jump.
> i.e. the SPC700 core itself is doing an idle cycle, and its internal registers don't see it as a read, but the external bus has no way to indicate that, so from outside the chip it looks like a re-read of the last address that was read from (the address pins are effectively "open bus").
That's a good working theory. It would also answer my question that the first cycle to read the same address would be the value used, and not the second.
What I'm wondering is if we can find a way to emulate this so that we can just say idle(); without having to say -where- it's reading every time. That would be really nice. Even better, if we can get the logic out of the SPC700 CPU core and into the bus handler of the SMP class. The main limitatin there are instructions like RETI where it's reading from the stack pointer -before- the first increment on its idle cycle ... a value that wouldn't be on the bus from a previous read.
> Semi-offtopic: for the 6502, 65816 and SPC700, the opcode switch tables look a lot nicer and more logical if instead of putting the opcodes in numerical order you arrange them like this:
I agree completely. The reason I went numerically was to fill in the table initially, where it would be very easy to spot any missing instructions. I never really bothered to redesign the tables to a grouped ordering after finishing. Partly because there'd be bikeshedding about which groups should go before which other groups.
For similar bikeshed-avoidance reasons, I've been ordering my actual instruction implementations alphabetically.
Well in any case if you haven't seen it and were interested, I've been cleaning up my SPC700 core since the start of v102 or so. Still a work in progress, but you can see it here:
https://gitlab.com/higan/higan/tree/mas ... sor/spc700You probably won't be happy to find that I've reverted the BitField stuff. It didn't actually affect performance, and I didn't like that I couldn't pass BitField parameters to functions, because typeof BitField<0> != BitField<3>.
I did get rid of the "global state" aa, rd, dp, sp, bit, ya nonsense that was a hangover from when each cycle was a step inside of a state machine back in ... gods ... v020 or so.
The worst of it is still the disassemblers. The WDC65816 core is actually using snprintf still >_>
byuu wrote:
> i.e. the SPC700 core itself is doing an idle cycle, and its internal registers don't see it as a read, but the external bus has no way to indicate that, so from outside the chip it looks like a re-read of the last address that was read from (the address pins are effectively "open bus").
That's a good working theory. It would also answer my question that the first cycle to read the same address would be the value used, and not the second.
What I'm wondering is if we can find a way to emulate this so that we can just say idle(); without having to say -where- it's reading every time. That would be really nice. Even better, if we can get the logic out of the SPC700 CPU core and into the bus handler of the SMP class. The main limitatin there are instructions like RETI where it's reading from the stack pointer -before- the first increment on its idle cycle ... a value that wouldn't be on the bus from a previous read.
> Semi-offtopic: for the 6502, 65816 and SPC700, the opcode switch tables look a lot nicer and more logical if instead of putting the opcodes in numerical order you arrange them like this:
I agree completely. The reason I went numerically was to fill in the table initially, where it would be very easy to spot any missing instructions. I never really bothered to redesign the tables to a grouped ordering after finishing. Partly because there'd be bikeshedding about which groups should go before which other groups.
For similar bikeshed-avoidance reasons, I've been ordering my actual instruction implementations alphabetically.
Well in any case if you haven't seen it and were interested, I've been cleaning up my SPC700 core since the start of v102 or so. Still a work in progress, but you can see it here:
https://gitlab.com/higan/higan/tree/mas ... sor/spc700You probably won't be happy to find that I've reverted the BitField stuff. It didn't actually affect performance, and I didn't like that I couldn't pass BitField parameters to functions, because typeof BitField<0> != BitField<3>.
I did get rid of the "global state" aa, rd, dp, sp, bit, ya nonsense that was a hangover from when each cycle was a step inside of a state machine back in ... gods ... v020 or so.
The worst of it is still the disassemblers. The WDC65816 core is actually using snprintf still >_>
I'm wondering whether it'd be better just to stick most of the instruction implementations directly inside the switch table, and only use functions for things that are actually reused across multiple instructions (the algorithms for adc/sbc/etc., and addressing modes that have the same timing across multiple instructions)
You can still parameterize SetFlag etc. with bitfields, you just pass the mask or the bit index (which are public static members of the bitfield classes) instead of a reference to the bitfield object. I.e. you can't do this:
Code:
void SetFlag(Flag &flag, bool value)
{
flag = value;
}
SetFlag(C, true)
SetFlag(V, false)
but you can do this:
Code:
void SetFlag(uint8 mask, bool value)
{
psw = (value ? psw | mask : psw & ~mask);
}
SetFlag(C.mask, true)
SetFlag(V.mask, false)
If you think the ".mask" is ugly you can hide it in a macro or something
> I'm wondering whether it'd be better just to stick most of the instruction implementations directly inside the switch table
Especially if we used fallthrough and grouped ordering as you suggested previously, then one might as well. Doesn't really make much of a difference as I would sure hope that a single function call used only in one place would be inlined.
Of course if you really want to compress the code, there's always the microcoding approach. It would make understanding the code near impossible, but eg Bisqwit managed to get a 6502 core (with illegal ops) into less than 100 lines of code that way (and no, none of the lines were >80 characters.)
> only use functions for things that are actually reused across multiple instructions
I continue to dither on the instructions only used once.
Like, you can have sta direct,x; stx direct,y; sty direct,x so then I have storeDirect(uint8& reg, uint8& index) but then you can only have sta (direct+x), so is it better to have storeIndexedIndirect() [or just outright calling it STA_IDPX or something], or storeIndexedIndirect(uint8& reg, uint8& index)?
> You can still parameterize SetFlag etc. with bitfields
Yeah, that's how I was doing it before.
> If you think the ".mask" is ugly you can hide it in a macro or something
I used to have a deathly allergy to the existence of any and all macros (save for the L lastCycle(); hack), but lately I think it's a whole lot nicer for the innard of a CPU core to be able to say "A, X, Y" instead of "regs.a, regs.x, regs.y" everywhere.
Okay, I tried to update the SPC700 code to follow Overload's cycle timings. Hopefully I got everything right. He's missing some instructions, and doesn't list cycle counts for some, and Direct, X Relative, CBNE has one cycle listed as RWB 0/1 that can only ever be a 1 (it's only one instruction and it never writes.) And of course, I'm not perfect, sadly ._.
I am using idle(uint16 address) to indicate that the cycle is "internal", but it has the same functionality as "read".
I also had to lose the very handy fetch/load/store/push/pull helper functions, because those all fell back on regular reads, and three of them affected registers. Instead I moved to only using read/write with two functions, page(uint8) and stack(uint8) that just transform into a full 16-bit address, and use those inside read/write.
Not crazy about it, but in this way, it's way easier than having no-increment/decrement versions of fetch/push/pull, and some way to signal load/store as being idle cycles.
Also dropped the address++ semantics and instead went with address+0, address+1, since sometimes you'd read from one of the two addresses twice. For both PC and S, I always inc/dec when doing the actual read anyway, and then use +1,-1 as appropriate to match the address bus locations in the PDF. In this way, the weirdness is isolated to the idle() cycle addresses.
I merged more functions to reduce repetition, and expanded the three-letter acronym functions (XCN => ExchangeNibble).
Probably the last redesign change I still want to make is to consider merging the read/modify/write variants with some sort of flag inside of them like "Compare/Write"; and then also use that to drop the op function comparisons I use to specialize some of the functions now.
As we know, the logic analyzer can't show us which read actually matters when the same address is read twice in a row, the first or the second one. This could end up wrong for any instructions like that, and I'm not sure there's a way we can ever find out which is correct.
I put source comments on the weird (x)+ cycle 3 case. We
really need a test ROM for this one. spc700cyc.txt indicates the read version happens at cycle 3, the write version happens at cycle 4. I'd have expected both to happen on cycle 4, but then the read being at cycle 3 is probably how Overload observed this oddity: it was probably reading the underlying RAM value instead of the register value for mov (x+),a. If it were the discarded idle read, then he couldn't observe it was pulling underlying RAM.
https://gitlab.com/higan/higan/blob/mas ... ctions.cpp
Which instructions are missing from Overload's list?
byuu wrote:
but then the read being at cycle 3 is probably how Overload observed this oddity: it was probably reading the underlying RAM value instead of the register value for mov (x+),a. If it were the discarded idle read, then he couldn't observe it was pulling underlying RAM.
No, the
write is the one where the "bypassing registers" effect is noticeable. If you
MOV (X)+,A and X is pointing at one of the timer registers which are reset when read ($FD through $FF), that timer
won't get reset at all. Anomie's 2007 document and nocash's fullsnes document both point out this anomaly (it's an anomaly because almost all writes on the SPC700 do a discarded read from the same address first and thus
do reset the timers)
You've just made it so that
MOV (X)+,A does reset timer registers, which is well known to be wrong!
There are two possibilities for how
MOV A, (X)+ works which are indistinguishable with a logic analyzer and tricky but possible to distinguish with software testing:
Possibility 1:
Cycle 3 reads from underlying RAM and is discarded
Cycle 4 reads from RAM or an internal register and is kept
Possibility 2:
Cycle 3 reads from RAM or an internal register and is kept
Cycle 4 reads from wherever and is discarded (standard duplicate read)
To distinguish them, you'd have to start the instruction exactly 3 cycles before the timer being read is due to tick over, so that reading the timer on Cycle 3 and Cycle 4 will get different values.
I assume that that's exactly what blargg did back in 2007, because in Anomie's document every instruction where the final read cycle
can be made to read from a timer register is "confirmed by blargg"; it's only instructions that only read from the stack which are marked as guesses.
In short, I'm almost certain that possibility 2 is the one that's correct. It's consistent with Overload's findings to the extent that can be distinguished with a logic analyzer, and consistent with blargg's findings to the extent that can be distinguished via software testing.
> Which instructions are missing from Overload's list?
SLEEP, STOP, and I thought CLRn/SETn was missing, but it looks like it's there but being called CLR1/SET1.
I know, if you call SLEEP or STOP, you're never recovering anyway. But I'm wondering if it waits one idle cycle or two between each execution (plus the opcode fetch.) I copied the 65816 WAI's two idle cycles for now.
> No, the write is the one where the "bypassing registers" effect is noticeable.
Ah, I see. You're gonna have to forgive me, I have eight different emulated systems all rattling around inside my head, so some details are fuzzy these days. It's not like back in the bsnes days I'm afraid.
So then ... it's gotta be one of these two designs, right?
Code:
//mov (x),a
auto SPC700::instructionIndirectXRead(fpb op) -> void {
idle(PC);
uint8 data = read(page(X)); //this *WILL* reset the timers, and A will be the old timer value
A = alu(A, data);
}
//mov a,(x)
auto SPC700::instructionIndirectXWrite(uint8& data) -> void {
idle(PC);
idle(page(X)); //this *WILL* reset the timers
write(page(X), data);
}
//POSSIBILITY ONE
//mov (x+),a
//I have to swap cycles 3 and 4 from the Git repo
auto SPC700::instructionIndirectXIncrementRead(uint8& data) -> void {
idle(PC);
idle(); //this will not reset the timers (or even if it does, we won't notice due to next cycle)
data = read(page(X)); //this *WILL* reset the timers
ZF = data == 0;
NF = data & 0x80;
}
//mov a,(x+)
auto SPC700::instructionIndirectXIncrementWrite(uint8& data) -> void {
idle(PC);
idle(); //this will not reset the timers, period
write(page(X++), data);
}
//POSSIBILITY TWO
//mov (x+),a
auto SPC700::instructionIndirectXIncrementRead(uint8& data) -> void {
idle(PC);
data = readButNotFromIORegistersBecauseFuckYouThatsWhy(page(X)); //this will return internal RAM
read(page(X++)); //this *WILL* reset timers, but we won't get the timer value into A
ZF = data == 0;
NF = data & 0x80;
}
//mov a,(x+)
auto SPC700::instructionIndirectXIncrementWrite(uint8& data) -> void {
idle(PC);
readButNotFromIORegistersBecauseFuckYouThatsWhy(page(X)); //this will *not* affect the timers
write(page(X++), data);
}
> To distinguish them, you'd have to start the instruction exactly 3 cycles before the timer being read is due to tick over, so that reading the timer on Cycle 3 and Cycle 4 will get different values.
That sounds like it would work. Revenant, I don't suppose you'd be up for writing this test too? ^-^
> In short, I'm almost certain that possibility 2 is the one that's correct. It's consistent with Overload's findings to the extent that can be distinguished with a logic analyzer
It doesn't sound like the logic analyzer revealed any evidence supporting either possibility. It cannot determine which cycle's read is actually used internally.
> and consistent with blargg's findings to the extent that can be distinguished via software testing.
Has blargg really confirmed this exact scenario? We're 100% proof positive of that? The million dollar problem here is that blargg's test ROMs are gone.
If we had his ROM, I could fire it up on possibilities 1 and 2, and one of them should fail if he wrote a test as you described.
So here's the thing ... possibility one is way more sane and logical to me. But I also have 13 years of experience with the SNES shitting all over my notions of sane and logical, so I concur with you. Possibility two is the most likely because "fuck you (emudevs), that's why."
Still ... I'd
really like to have a test ROM to prove this.
Your possibility 2 for reading is wrong (it doesn't even increment X). It should be:
Code:
auto SPC700::instructionIndirectXIncrementRead(uint8& data) -> void {
idle(PC);
data = read(page(X)); //this *WILL* reset the timers, and A will be the old timer value
idle(page(X++)); // this might or might not reset the timers a second time; it's almost impossible to test
ZF = data == 0;
NF = data & 0x80;
}
According to blargg, the read whose value matters is on cycle 3, not cycle 4, and like I said I'm assuming that he used the timers to test all these opcode timings.
If
MOV A,(X)+ couldn't read the timers at all (it returned underlying RAM, or always returned 0 due to resetting the timer and then reading it again) I'm sure blargg would have noticed and would have made a note of such surprising behaviour.
ETA: also you've got the canonical mnemonics backwards. SPC700 is Intel style (destination first), not Motorola style (source first).
MOV A,(X)+ is a load,
MOV (X)+,A is a store.
... I can't handle this. Every single post changes small details and it keeps throwing me off.
> Your possibility 2 for reading is wrong
My possibility 2 does have X++. Possibility 1 doesn't, due to a copy paste typo.
> ETA: also you've got the canonical mnemonics backwards. SPC700 is Intel style (destination first), not Motorola style (source first).
Sigh. Okay, I'm just going to use target=source syntax, that way it's obvious what I mean and people don't have to keep track of which ordering every CPU uses.
...
Let's try this one more time. If this fails, then I'm giving up.
readIO() = this read DOES affect timers; returns RAM outside $f0-ff
readRAM() = this read does not affect timers; returns underlying RAM value
Note: when readRAM()'s read is discarded, it may be a true idle cycle that doesn't actually read.
However, this won't have any possible effect on emulation, so let's not worry about that.
For cycles 3 and 4 ... the possibilites are:
a=(x++) -- we are reading the memory at (X), storing it in A, and incrementing X
[Alpha] readRAM(x), a=readRAM(x++)
[Beta] readRAM(x), a=readIO(x++)
[Gamma] readIO(x), a=readRAM(x++)
[Delta] readIO(x), a=readIO(x++)
[Epsilon] a=readRAM(x), readRAM(x++)
[Zeta] a=readRAM(x), readIO(x++)
[Eta] a=readIO(x), readRAM(x++)
[Theta] a=readIO(x), readIO(x++)
(x++)=a -- we are storing the value of A into (X), and incrementing X
[Iota] readRAM(x), writeRAM(x++,a)
[Kappa] readRAM(x), writeIO(x++,a)
[Lambda] readIO(x), writeRAM(x++,a)
[Mu] readIO(x), writeIO(x++,a)
Many of these possibilities are nonsensical. I'm just elaborating every possible one so we can get this right.
Most likely, Kappa is correct for writes. Reads seem to be more confusing, but logically I'd say it was Beta, but the notes from blargg and Overload seem to indicate that case is most definitely wrong. It sounds like you favor either Eta or Theta. Of the two, Eta is more reasonable to me.
byuu wrote:
Most likely, Kappa is correct for writes.
Agree.
Quote:
Reads seem to be more confusing, but logically I'd say it was Beta, but the notes from blargg and Overload seem to indicate that case is most definitely wrong. It sounds like you favor either Eta or Theta. Of the two, Eta is more reasonable to me.
Agree with Eta or Theta.
> According to blargg, the read whose value matters is on cycle 3, not cycle 4
I mean, this directly contradicts Overload's document.
If blargg is correct and cycle 3 is the read that matters (eg it's the one to reset the timers), then Overload is wrong in saying cycle 3 can't access internal registers. If Overload is correct, then cycle 4 is the one that matters. From a logical design perspective, I like cycle 4 being the one that reads from or writes to IO, and cycle 3 being a read from internal RAM. Which is Beta.
It's not clear from either of their documentation which of the two cycle reads actually set A, either.
And to throw out more insane theories ... is there a possibility that simply reading from $fd-ff takes another cycle to wrap up after the fact, and the read -is- actually working with IO, just ignored because those registers are busy? Because I really don't believe there's specialization of logic inside the SMP core for just this one instruction. Even if we figure out how to emulate it, it would be nice to understand the why as well. I feel like if the timer register reads are the ONLY thing we're going on, that it's not enough information to definitively say that internal IO is ignored for all of $f0-ff.
byuu wrote:
I mean, this directly contradicts Overload's document.
If blargg is correct and cycle 3 is the read that matters (eg it's the one to reset the timers), then Overload is wrong in saying cycle 3 can't access internal registers. If Overload is correct, then cycle 4 is the one that matters. From a logical design perspective, I like cycle 4 being the one that reads from or writes to IO, and cycle 3 being a read from internal RAM. Which is Beta.
Reads have different timing from writes in every other SPC700 addressing mode, so why do you expect them to be perfectly symmetrical in this one?
Unfortunately, your OCD obsession with beautiful, logical symmetry has nothing to do with how electronic hardware works. After more than a decade as an emulator developer across multiple systems by multiple manufacturers, surely you should
stop being surprised by now when your intuitions about what the most "logical" way for something to work are wrong so much more often than they are right?
Quote:
It's not clear from either of their documentation which of the two cycle reads actually set A, either.
The read that can hit I/O is definitely the read that sets A, because blargg would have noticed if reading from $FD-$FF via that addressing mode always set A to 0, or set A to the underlying APU RAM instead of the timer value.
(snip utterly unverifiable speculation)
Here's what we know: writing to $FD-$FF by
MOV (X)+,A doesn't reset the timer, unlike literally every other addressing mode. Reading from $FD-$FF by
MOV A,(X)+ does reset the timer and set A to its previous value, exactly like every other addressing mode. As "unlikely" and "impossible" as you find it, as loudly as it makes your OCD scream, those are the objective facts.
> Unfortunately, your OCD obsession with beautiful, logical symmetry has nothing to do with how electronic hardware works. After more than a decade as an emulator developer across multiple systems by multiple manufacturers, surely you should stop being surprised by now when your intuitions about what the most "logical" way for something to work are wrong so much more often than they are right?
I mean I did say like four times in this thread alone that hardware, especially SNES hardware, never acts in the logical way you would expect.
That's not going to stop me from thinking logically and wishing things were that way. When you and I star in that next Star Trek reboot, you can be Kirk and I'll be Spock :P
> The read that can hit I/O is definitely the read that sets A, because blargg would have noticed if reading from $FD-$FF via that addressing mode always set A to 0, or set A to the underlying APU RAM instead of the timer value.
I sure hope so. In absense of confirmation, that's still the more likely outcome at least. So that narrows us to: Beta, Delta, Eta, Theta.
In my view, I find it very unlikely that both reads are going to hit IO. Especially not if A is set on the fourth cycle. So that narrows us further to: Beta, Eta.
So if Overload is right that cycle 3 is the ignored one, then the answer is Beta.
If blargg is right that cycle 4 is the ignored one, then the answer is Eta.
> Reads have different timing from writes in every other SPC700 addressing mode, so why do you expect them to be perfectly symmetrical in this one?
Writes *gain* an extra cycle in all those other modes. That is not the case here.
> writing to $FD-$FF by MOV (X)+,A doesn't reset the timer, unlike literally every other addressing mode.
Right, Kappa will result in that effect.
> Reading from $FD-$FF by MOV A,(X)+ does reset the timer and set A to its previous value, exactly like every other addressing mode.
Right, so that rules out Alpha, Gamma, Epsilon, Zeta.
...
So, logically I think it's Beta. Based on my experience with the SNES, I would bet money on it being Eta, although there is a slim chance it is Delta or Theta.
AWJ wrote:
I can write a program to do this, but someone who's already got a bit of experience with APU programming (Revenant?) can probably do it a lot more easily and quickly.
Alright, finally got around to this.
http://revenant1.net/smpechotest.sfcThe result:
(that is, $214x is reading back the initial SMP->CPU port writes, not echo buffer data)
I can try to test the other thing later, I suppose.
Thanks a ton, Revenant. As I expected, that was much faster and much nicer than I could have done it. I'm impressed that you managed to play actual music essentially without access to zero page.
Okay, I had to take it down and reupload because I was dumb and didn't delay for long enough before enabling the echo to reliably avoid clobbering my one unnecessarily large sample on the real hardware. Don't write test ROMs at 2:30 AM, kids.
Anyway, you get the idea. The real console works pretty much the way I figured it did (i.e. with a bunch of simple D-latches or whatever between the CPU and SMP).
Shoot, I was going to post that hex_usr just made the same demo last night, but was too tired. Still, thank you for the extra test to confirm!
This has been fixed in upstream, by giving SPC700::port(Read,Write) its own 4x8-bit buffer.
So obviously this confirms that the DSP can modify the internal $f4-f7 RAM, and that the CPU does not read back the internal RAM, but separate 4x8-bit latches. But one question I still have is, if the CPU writes to $2140-2143, do those bytes show up when the DSP then reads from underlying $f4-f7 RAM?
byuu wrote:
Shoot, I was going to post that hex_usr just made the same demo last night, but was too tired. Still, thank you for the extra test to confirm!
This has been fixed in upstream, by giving SPC700::port(Read,Write) its own 4x8-bit buffer.
So obviously this confirms that the DSP can modify the internal $f4-f7 RAM, and that the CPU does not read back the internal RAM, but separate 4x8-bit latches. But one question I still have is, if the CPU writes to $2140-2143, do those bytes show up when the DSP then reads from underlying $f4-f7 RAM?
Definitely not. There are 8 latches, four CPU-to-SMP (though the SMP can clear them to 0) and four SMP-to-CPU and all totally separate from RAM. The CPU having direct access to APU RAM (either reading or writing) makes no sense and is simply impossible when the SMP and DSP are both already constantly using it. Look how difficult it is for the GSU or SA-1 to share ROM/RAM with the CPU.
I imagine the reason you got confused and originally implemented it the way you did is because whenever the SMP writes to any of $F0-$FF the data also falls through to APU RAM as well as to the internal register.
Code:
SMP writes to $F1 -> value written goes to underlying APU RAM, CPU-to-SMP latches set to 0 depending on set bits
SMP writes to $F4-$F7 -> value written goes to SMP-to-CPU latch (where the CPU can see it) and also to underlying APU RAM (where the DSP can see it)
SMP reads $F4-$F7 -> reads CPU-to-SMP latch
DSP reads from or writes to $F0-FF -> reads/writes APU RAM, never SMP registers
CPU writes to $2140-$2143 -> value written goes to CPU-to-SMP latch, does not affect APU RAM
CPU reads from $2140-$2143 -> reads SMP-to-CPU latch, cannot see APU RAM
Every one of these cases was already correct in higan except "CPU reads from $2140-$2143".
Perfect, thank you. Sorry to be so pedantic, but thanks for your patience.
Here's the updated code, which passes hex_usr's and Revenant's test ROMs:
Code:
alwaysinline auto SMP::readRAM(uint16 addr) -> uint8 {
if(addr >= 0xffc0 && io.iplromEnable) return iplrom[addr & 0x3f];
if(io.ramDisable) return 0x5a; //0xff on mini-SNES
return apuram[addr];
}
alwaysinline auto SMP::writeRAM(uint16 addr, uint8 data) -> void {
//writes to $ffc0-$ffff always go to apuram, even if the iplrom is enabled
if(io.ramWritable && !io.ramDisable) apuram[addr] = data;
}
auto SMP::readPort(uint2 port) const -> uint8 {
return io.port[port];
}
auto SMP::writePort(uint2 port, uint8 data) -> void {
io.port[port] = data;
}
auto SMP::readBus(uint16 addr) -> uint8 {
uint result;
switch(addr) {
case 0xf0: //TEST -- write-only register
return 0x00;
case 0xf1: //CONTROL -- write-only register
return 0x00;
case 0xf2: //DSPADDR
return io.dspAddr;
case 0xf3: //DSPDATA
//0x80-0xff are read-only mirrors of 0x00-0x7f
return dsp.read(io.dspAddr & 0x7f);
case 0xf4: //CPUIO0
case 0xf5: //CPUIO1
case 0xf6: //CPUIO2
case 0xf7: //CPUIO3
synchronize(cpu);
return cpu.readPort(addr);
case 0xf8: //RAM0
return io.ram00f8;
case 0xf9: //RAM1
return io.ram00f9;
case 0xfa: //T0TARGET
case 0xfb: //T1TARGET
case 0xfc: //T2TARGET -- write-only registers
return 0x00;
case 0xfd: //T0OUT -- 4-bit counter value
result = timer0.stage3;
timer0.stage3 = 0;
return result;
case 0xfe: //T1OUT -- 4-bit counter value
result = timer1.stage3;
timer1.stage3 = 0;
return result;
case 0xff: //T2OUT -- 4-bit counter value
result = timer2.stage3;
timer2.stage3 = 0;
return result;
}
return readRAM(addr);
}
auto SMP::writeBus(uint16 addr, uint8 data) -> void {
switch(addr) {
case 0xf0: //TEST
if(r.p.p) break; //writes only valid when P flag is clear
io.clockSpeed = (data >> 6) & 3;
io.timerSpeed = (data >> 4) & 3;
io.timersEnable = data & 0x08;
io.ramDisable = data & 0x04;
io.ramWritable = data & 0x02;
io.timersDisable = data & 0x01;
io.timerStep = (1 << io.clockSpeed) + (2 << io.timerSpeed);
timer0.synchronizeStage1();
timer1.synchronizeStage1();
timer2.synchronizeStage1();
break;
case 0xf1: //CONTROL
io.iplromEnable = data & 0x80;
if(data & 0x30) {
//one-time clearing of APU port read registers,
//emulated by simulating CPU writes of 0x00
synchronize(cpu);
if(data & 0x20) {
cpu.writePort(2, 0x00);
cpu.writePort(3, 0x00);
}
if(data & 0x10) {
cpu.writePort(0, 0x00);
cpu.writePort(1, 0x00);
}
}
//0->1 transistion resets timers
if(!timer2.enable && (data & 0x04)) {
timer2.stage2 = 0;
timer2.stage3 = 0;
}
timer2.enable = data & 0x04;
if(!timer1.enable && (data & 0x02)) {
timer1.stage2 = 0;
timer1.stage3 = 0;
}
timer1.enable = data & 0x02;
if(!timer0.enable && (data & 0x01)) {
timer0.stage2 = 0;
timer0.stage3 = 0;
}
timer0.enable = data & 0x01;
break;
case 0xf2: //DSPADDR
io.dspAddr = data;
break;
case 0xf3: //DSPDATA
if(io.dspAddr & 0x80) break; //0x80-0xff are read-only mirrors of 0x00-0x7f
dsp.write(io.dspAddr & 0x7f, data);
break;
case 0xf4: //CPUIO0
case 0xf5: //CPUIO1
case 0xf6: //CPUIO2
case 0xf7: //CPUIO3
synchronize(cpu);
writePort(addr, data);
break;
case 0xf8: //RAM0
io.ram00f8 = data;
break;
case 0xf9: //RAM1
io.ram00f9 = data;
break;
case 0xfa: //T0TARGET
timer0.target = data;
break;
case 0xfb: //T1TARGET
timer1.target = data;
break;
case 0xfc: //T2TARGET
timer2.target = data;
break;
case 0xfd: //T0OUT
case 0xfe: //T1OUT
case 0xff: //T2OUT -- read-only registers
break;
}
writeRAM(addr, data); //all writes, even to MMIO registers, appear on bus
}
Please ignore that all eight registers are most likely inside the SMP core. I'm aware of that.
CPU writes to $214x go to a CPU-side 4x8-bit array, CPU reads come from the SMP-side 4x8-bit array.
SMP writes to $f4-f7 go to an SMP-side 4x8-bit array, SMP reads come from the CPU-side 4x8-bit array.
SMP writes to $f1 can write to the CPU-side 4x8-bit array (which is why all eight bytes are physically in the SMP.)
SMP writes to $f4-f7 will also modify the internal APU RAM that the DSP can then read back.
DSP reads from $f4-f7 are from underlying APU RAM, DSP writes to $f4-f7 are to underlying APU RAM.
...
Also, I added emulation of the glitch with mov (x)+ modes. I'm pretty confident writes are correct, but we still need to create test ROMs to narrow down the true behavior of reads.
That all looks right to me.
byuu wrote:
Shoot, I was going to post that hex_usr just made the same demo last night, but was too tired. Still, thank you for the extra test to confirm!
Damn, that's what I get for not checking your forum as often as this one. Hopefully mine was useful anyway.
Re: the test for mov a,(x)+: if my understanding of the SMP timing is correct, would this theoretically work?
Code:
mov x, #$ff
mov $fc, #1 ; timer 2 ticks every 16 CPU cycles
mov $f1, #0 ; reload timer
mov $f1, #1 ; enable timer
; wait 12 (16-4) cycles
xcn
xcn
nop
; read timer 2
mov a, (x)+
If I have this right, then cycle 3 of the read will occur 15 cycles after the timer starts, and cycle 4 occurs 16 cycles after, with the timer ticking up in between the two. I have a feeling I might be missing some crucial timing detail, though, so please feel free to correct me before I try making another test ROM.
Revenant wrote:
byuu wrote:
Shoot, I was going to post that hex_usr just made the same demo last night, but was too tired. Still, thank you for the extra test to confirm!
Damn, that's what I get for not checking your forum as often as this one. Hopefully mine was useful anyway.
Re: the test for mov a,(x)+: if my understanding of the SMP timing is correct, would this theoretically work?
Code:
mov x, #$ff
mov $fc, #1 ; timer 2 ticks every 16 CPU cycles
mov $f1, #0 ; reload timer
mov $f1, #1 ; enable timer
; wait 12 (16-4) cycles
xcn
xcn
nop
; read timer 2
mov a, (x)+
If I have this right, then cycle 3 of the read will occur 15 cycles after the timer starts, and cycle 4 occurs 16 cycles after, with the timer ticking up in between the two. I have a feeling I might be missing some crucial timing detail, though, so please feel free to correct me before I try making another test ROM.
I don't think that will work. According to bsnes, it looks like the non-programmable stage of the timers (the one that gives T2 a different base frequency from T0 and T1) ticks constantly whether or not each timer is enabled via $F1. You'd have to enable the timer and then do some blargg voodoo to synchronize your instruction execution with it.
Argh, you're right. Good catch.
I wonder if there's a way you could reliably manipulate that stage by (temporarily) changing the timer speed bits in the test register, assuming bsnes/higan is correct about how those actually work. It'd probably be a long shot though.
Revenant (or anyone with a flashcart handy) can you run "test_speed.smc" from
http://snescentral.com/article.php?id=1115 on real hardware and snap a screenshot? I just want to check that it really prints "done" and not "passed" on hardware, and the numbers it prints in current higan still match hardware.
I can check later tonight when I get home unless someone beats me to it again.
(Should I run the test_timer_* ones too or are those not as important?)
Revenant wrote:
I can check later tonight when I get home unless someone beats me to it again.
(Should I run the test_timer_* ones too or are those not as important?)
You might as well run all of them if it's no inconvenience, but the test_speed one is the one I'm currently reverse engineering.
ETA: by "all of them" I only mean test_timer_speed*, not the freezing ones or whatever.
> You might as well run all of them if it's no inconvenience, but the test_speed one is the one I'm currently reverse engineering.
I think blargg's convention was to print "done" on tests where he didn't check the values against known good values. The rest will be "passed" or "failed", sometimes with a CRC32.
test_speed is testing the S-SMP TEST register's speed control. d7,d6 represent the SMP speed:
00 = 100% speed
01 = 50% speed
10 = deadlock
11 = 10% speed
So when you look at blargg's test numbers, you're seeing this:
0a, 1a, 2a, 3a = 100% speed = ~256
4a, 5a, 6a, 7a = 50% speed = ~128
8a, 9a, aa, ba = deadlock (cannot test)
ca, da, ea, fa = 10% speed = ~25.6
The actual reality of these speed bits is even more nuanced, however. Different SNES consoles, even when they're the exact same motherboard revision, will lock up with different values. Some will lock with d6,d7!=0. Others will run two speed modes. We've never found an SNES that could run d6,d7=2 without freezing.
d5,d4 represent the timer speed. That algorithm is more complicated, but it's:
Code:
io.timerStep = (1 << io.clockSpeed) + (2 << io.timerSpeed);
So when you see his results, the values repeat after every four increments of the upper nibble, because only the lower two affect the timer speed.
d3,d2,d1,d0 are timerEnable, ramDisable, ramWritable, timerDisable. Yes, it's very weird to have both an enable and disable bit on the timers. This is set to 0xa for obvious reasons on power-up.
EDIT: forgot that we actually did figure out the d5,d4. Added details after refreshing my memory on them.
http://imgur.com/a/cmdk3Here are four runs of test_speed. The values
(whatever they are) average slightly lower than what higan shows (i.e. I never see 252 on the real hardware).
All of the test_timer_speed ROMs hang after displaying only two rows of values, but appear to be consistent with higan otherwise.
Okay, I've pretty thoroughly disassembled and reverse engineered speed_test.smc. Using some slightly tricky code, it stores a
BRA -2 ($2F $FE) into the last two SMP input ports (
W:$2142-2143 on the CPU side,
R:$00F6-00F7 on the SMP side) and then tells the IPL ROM to jump there, so that the SMP ends up executing a tight infinite loop right out of its input ports. The CPU then controls the SMP by writing two bytes worth of opcodes (one two-byte opcode or two one-byte ones) at a time to $2140-2141, writing $FC to $2143 (changing the BRA -2 to a BRA -4), giving the SMP just enough time to take the changed loop and execute the stored opcode once, and then changing $2143 back to $FE. The sequence of opcodes it executes by this method is as follows:
Code:
CD 00 MOV X=#$00
E8 xx MOV A=#$xx (xx is the TEST register value)
C4 F0 MOV $F0=A
00 3D NOP; INC X (for this one there's a delay loop before the CPU resets the BRA target)
E8 0A MOV A=#$0A (this one has a delay loop too)
C4 F0 MOV $F0=A (this one has a delay loop too)
D8 F4 MOV $F4=X
(at this point the CPU reads $2140 to get the value of X written by the SMP)
E8 00 MOV A=#$00
C4 F4 MOV $F4=A
The results the test program prints to the screen for each value of TEST are the values read from $2140 at line 8: basically, the number of times the SMP was able to execute
NOP; INC X; BRA -4 while the CPU was in its delay loop.
The important thing is that during this test the SMP doesn't access RAM even once; it's executing entirely out of its I/O ports. So the test program results are fully consistent with nocash's theory that TEST d6-d7 controls the number of cycles an I/O port access takes and d4-d5 controls the number of cycles a RAM access takes. If that theory is right, the complicated formula in higan relating TEST d4-d7 to the timers is simply an artifact of the proportion of RAM accesses to I/O port accesses that blargg's timer test program does.
I also suspect that a speed of 2 (for either RAM or I/O) takes 5 cycles but only clocks the timers 4 times, and a speed of 3 takes 10 cycles but only clocks the timers 8 times. That would explain why the SMP runs at 1/10th speed while executing out of I/O ports while TEST.bits(6, 7) == 3, but the fastest the timers can apparently go relative to SMP instruction execution is only 8 times normal (when TEST.bits(4,7) == 0xF).
Another possibility for the 8/10 inconsistency is that blargg's timer test program makes the SMP do some ROM accesses in addition to RAM and I/O, and that ROM accesses always take only 1 cycle. I guess I'll have to disassemble it as well to be sure.
nocash's model is certainly more elegant, but let's note a few things.
First, he missed the purpose of $00f0.d2 as being RAM disable. You can still run code out of the I/O ports with RAM disabled. Particularly troublesome as he analyzed my uPD96050 emulation and used my new instruction mnemonics, but I guess he didn't look at my SMP implementation.
Second, he mentions $00f0.d6,d7 as controlling not just I/O but ROM access timing ... I can't imagine he's talking about the IPLROM. So ... what ROM?
This is kind of a theme with nocash. He's often right, but when he hits undocumented things, sometimes his theories are just off the wall, and he doesn't note them as theories, and he provides no proof of his claims. I suppose that's where we come in (well, Revenant mostly) :P
The $00f8,$00f9 note I am much more inclined to believe because he is pointing to actual CPU pins. But I do wonder where he got the P4/P5/P5RD labels from. Similarly, documentation indicates $00fa-00fc is TnTARGET, not TnDIV. He's just renaming registers as he chooses to.
Again, I'm not foolish enough to say he's wrong. But I'd love to see some tests to indicate if he's right. Especially one to prove the 20% speed case. Looking at blargg's test_timer_speed:
0 = 2731 (100%)
1 = 1639 (60%)
2 = 911 (33%)
3 = 482 (17.5%)
Not quite 1/2/5/10. But you seem to have a better grasp on how executing code out of I/O registers can affect timers, so ... I'm willing to make the change he talks about into a higan fork branch, and see if blargg's tests still align properly.
Next, he's implying that NOP's idle cycle is a "RAM timing" and TCALL's three idle cycles are "I/O timing." Yet Overload's logic analyzer traces show that the first cycle of both has the program counter on the bus with RWB=1. In other words, this sounds like memory reads.
If both Overload and blargg were able to detect the mov (x)+ anomaly, then how would they have missed that other opcodes had weird effects like that?
Again, it's nocash. What even is the "SPC700 waitstates on internal cycles" table? How did he make that? Where is it from? How did he verify that? No answers. Just a table and "just go with it."
And finally, I/O writes fall through and update the underlying APU RAM. So what happens if your I/O speed is 1 but your RAM speed is 10? Does the RAM write just silently fail? Or silently work anyway? Maybe it'd fail if your -actual- APU RAM were timed ten times slower, whereas since it's perfectly capable of 1-waitstate operation, it'll just always work in this case?
Revenant, I feel pretty bad for working you like a mule but I've got another test program I'd like you to write. This one should be simpler than the echo buffer test.
Upload this tiny program to the SMP (somewhere that won't get clobbered by the IPL ROM--$0200 would be good) and jump to it:
Code:
mov $f1,#$b0 ; enable IPL ROM; clear input ports
mov $f4,#$00 ; clear output port $F4
: mov a,$f5
beq :- ; loop until the CPU writes nonzero to $2141
mov $f0,$f4 ; set TEST to whatever the CPU wrote to $2140
jmp $ffc0 ; jump to IPL ROM entry point
Then, have the CPU write different TEST values to $2140-2141 and time how long it takes for the IPL ROM to clear APU zero page and put out the $AA handshake. Something like this on the CPU side:
Code:
(A contains value to write to TEST; $0A, $1A, $4A and $5A should be safe)
($CA and $DA are probably safe too, at least on Revenant's SNES)
rep #$30
ora #$FF00 ; set high bits which will be written to $2141
tay
lda #$2100
tcd ; use direct page to make our loop tighter
sep #$20
lda #$AA
ldx #$0000
sty $40 ; = $2140
: inx
beq hung
cmp $40 ; = $2140
bne :- ; loop until $2140 = #$AA (meaning IPL finished clearing zero page)
(now print TEST value and X to the screen)
hung:
(whoops, looks like we hung the SMP with a bad TEST value...)
The idea is to see which of $F0 d4-d7, if any, affect execution speed when the SMP is accessing ROM and RAM (the IPL ROM startup code runs out of ROM and writes to RAM).
byuu wrote:
What even is the "SPC700 waitstates on internal cycles" table? How did he make that? Where is it from? How did he verify that? No answers. Just a table and "just go with it."
Agree that that table is probably nonsense. Nocash claims
NOP's idle cycle has "RAM timing", but blargg's test_speed.smc does
NOPs and isn't affected by d4-d5 at all.
Quote:
So what happens if your I/O speed is 1 but your RAM speed is 10? Does the RAM write just silently fail? Or silently work anyway? Maybe it'd fail if your -actual- APU RAM were timed ten times slower, whereas since it's perfectly capable of 1-waitstate operation, it'll just always work in this case?
The TEST register is probably meant to slow down the SMP while it's running in an ICE or something like that, and isn't meant to be touched on a production system. It's called TEST, after all.
Quote:
But you seem to have a better grasp on how executing code out of I/O registers can affect timers
The test program I disassembled (test_speed.smc) doesn't use or touch the SMP timers at all.
Sorry, can you provide source code for that ROM? Even in bsnes-classic, TEST d4-d5 seem to be affecting the result by more than rounding error and $DA is locking up the SMP, and I'm not sure what's going on... You don't have NMI enabled while you're doing the timing loops, do you?
Okay, looking again, I indeed had NMI enabled, which did affect the results. I uploaded a new version of the ROM.
Also, the reason $DA and higher were failing was because I wasn't waiting on the SMP program to write zero to $F4, so the CPU would write to the port, then the SMP would write $f1 to clear the input port, then the CPU would time out waiting for a response to something that the SMP never actually saw.
I fixed that and ended up seeing something strange on the hardware: after writing $CA, $DA, $EA, or $FA, the CPU<->SMP communication seems to break down and zero is never read from $2140 again. (Instead, I get a constant $CC, which tells me that attempting to restart the SMP program by forcing a jump to $200 may or may not have actually succeeded). See
this screenshotThe actual values returned when I test each of those four individually (averaged over several runs each):
$CA -> $1A40
$DA -> $1AE2
$EA -> $1CCE
$FA -> $2000
Hopefully that's still useful. I'm not really sure if the weird port issue that happens after those four specific values is somehow my fault or not.
Here's the somewhat messy source to the CPU-side program. Hopefully it's clear what it's actually attempting to do. The SMP code is identical to what you already posted.
Thanks once again.
It looks like on your particular SNES, when TEST is >= $CA the IPL ROM's RAM-clearing loop works (allowing one test to succeed) but some later step, either the IPL ROM comms loop or executing out of RAM, is hanging the SMP.
At any rate, it looks like nocash is approximately correct: TEST d6-d7 seems to affect access speed for both I/O ports and IPL ROM, and TEST d4-d5 affects access speed for RAM. TEST = $5A takes almost exactly twice as long as TEST = $0A, and TEST = $FA (when it works without crashing) takes almost exactly ten times as long as TEST = $0A.
Now to see if I can implement that in bsnes and get the SMP timer tests (which are apparently more precise?) to match...
AWJ wrote:
It looks like on your particular SNES, when TEST is >= $CA the IPL ROM's RAM-clearing loop works (allowing one test to succeed) but some later step, either the IPL ROM comms loop or executing out of RAM, is hanging the SMP.
Considering my SNES is an old SHVC-CPU-01, I wonder if running the ROM on consoles with a separate vs. integrated APU would make a difference here. Or maybe my specific unit is just broken in a really unimportant way, who knows.
Either way, nice to see that the numbers look good. Hopefully matching it up with the other test ROMs isn't too much of a challenge.
Well how about that. It looks like nocash bested myself, anomie, and blargg in understanding the TEST register's top four bits. His model also passes all of blargg's test_*speed ROMs. Oh well, blargg at least figured out RAM disable which nocash didn't, heehee.
Great catch bringing this stuff up, AWJ. I'd have missed both Overload's PDF and nocash's TEST documentation.
This gives us a wonderful test opportunity as well. You know how (x)+ writes don't read the timer? Well if we execute enough (x)+ instructions with RAM at 2 (5 waits, 20% speed) and I/O at 3 (10 waits, 10% speed), then we can find out more information on what that cycle is actually doing. It might even end up being a 100% speed cycle. We can alternate between X fetching $fd and $00 for comparison as well.
The hard part will be trying to model emulation of when real hardware will lock up. It definitely varies per SNES, which makes this way harder. I'm guessing our best bet is simply to print a warning to a debugger/terminal saying TEST.d4-d7!=0. But, it'll always be one way to detect an emulator ... just, not so practical once you confirm it's not :P
Note: if anyone else is reading this thread and doesn't know ... the reason the numbers don't match perfectly is because the real SNES has separate oscillators for the CPU and APU. These values drift on real hardware. Adjusting the values in emulation subtly change the output values as you would expect. Trying to match the current oscillator rates of 20 year old hardware that's undoubtedly drifted somewhat out of spec and has a natural variance between parts is pretty silly. All we need to do is get really close, which we have.
Quote:
Agree that that table is probably nonsense. Nocash claims NOP's idle cycle has "RAM timing", but blargg's test_speed.smc does NOPs and isn't affected by d4-d5 at all.
I think we can say for sure it's junk now.
Great work! Are you using the address on the bus (based on Overload) to adjudicate which speed to use for "idle" cycles, or are you treating some of them as always-IO/ROM? And are you applying my 4-for-5 and 8-for-10 hypothesis for the timers?
Sorry Revenant, I've got one more test for you to run. This one can use exactly the same CPU-side program as before, only the SMP-side is different:
Code:
mov $f1,#$b0 ; enable IPL ROM; clear input ports
mov x,#$7f ; set up x for loop
mov $f4,#$00 ; clear output port $F4
: mov a,$f5
beq :-
mov $f0,$f4 ; set TEST
: mul ya ; (fetch, ?dummy operand?, 7 ?idle?)
inc x ; (fetch, ?dummy operand?)
bmi :- OR bne :- ; (fetch, operand, 2 ?idle? when branch taken)
mov $f0,#$0a ; reset TEST to default
jmp $ffc9 ; jump direcly to IPL ROM handshake (skip RAM clear)
I'd like you to test both with BMI and with BNE on line 9, because nocash claims there's a difference between BPL/BVC/BCC/BNE and BMI/BVS/BCS/BEQ (which seems highly unlikely to me, but whatever). Use every value of TEST you can that doesn't lock up your SNES's SMP (including $8A/$9A/$AA/$BA, just in case they work on your SNES with this opcode sequence)
The goal is to see if there are any "idle cycles" that are always port/ROM-speed regardless of the address (we're already pretty sure that there aren't any that are always RAM-speed)
> Are you using the address on the bus (based on Overload) to adjudicate which speed to use for "idle" cycles, or are you treating some of them as always-IO/ROM?
Code:
uint waitStates[] = {1, 2, 5, 10};
if((addr & 0xffc0) == 0xffc0 && iplromEnable) return waitStates[TEST.bits(6,7)]; //ROM
if((addr & 0xfff0) == 0x00f0) return waitStates[TEST.bits(6,7)]; //IO
return waitStates[TEST.bits(4,5)]; //RAM
As per Overload's PDF, I'm acting like there's no such thing as pure idle cycles, except of course for the weird (x)+ case ... not sure what to do with that one.
> And are you applying my 4-for-5 and 8-for-10 hypothesis for the timers?
I left the timer step as (1 << TEST.bits(6,7)) + (2 << TEST.bits(4,5)); because anything else broke test_timer_speed.
> Sorry Revenant, I've got one more test for you to run.
No interest in that (x)+ test with separate ROM+IO/RAM speeds? :/
byuu wrote:
I left the timer step as (1 << TEST.bits(6,7)) + (2 << TEST.bits(4,5)); because anything else broke test_timer_speed.
That means we aren't done yet. Bragging about still passing blargg's timer speed tests when you haven't actually changed the timer behaviour is cheating
Also, that lookup table should
really be static const.
It is static const, don't worry.
I wasn't meaning to cheat on timers. I just meant that emulating the wait states passed Revenant's tests without breaking blargg's.
The timer thing is trickier. The number of cycles it takes to advance the stage 1 counters is based off both settings, and doesn't seem to care how long RAM vs IO accesses take. In other words, it only cares how many cycles are executed -- not how many wait states each cycle takes.
Code:
(1 << clockSpeed[ROM/IO speed]) + (2 << timerSpeed[RAM speed])
00 = 1 wait state
01 = 2 wait states
10 = 5 wait states
11 = 10 wait states
(.d7,d6) (.d5,d4)
(1 << 0) + (2 << 0) = 3
(1 << 0) + (2 << 1) = 5
(1 << 0) + (2 << 2) = 9
(1 << 0) + (2 << 3) = 17
(1 << 1) + (2 << 0) = 4
(1 << 1) + (2 << 1) = 6
(1 << 1) + (2 << 2) = 10
(1 << 1) + (2 << 3) = 18
(1 << 2) + (2 << 0) = 6
(1 << 2) + (2 << 1) = 8
(1 << 2) + (2 << 2) = 12
(1 << 2) + (2 << 3) = 20
(1 << 3) + (2 << 0) = 10
(1 << 3) + (2 << 1) = 12
(1 << 3) + (2 << 2) = 16
(1 << 3) + (2 << 3) = 24
byuu wrote:
It is static const, don't worry.
I wasn't meaning to cheat on timers. I just meant that emulating the wait states passed Revenant's tests without breaking blargg's.
The timer thing is trickier. The number of cycles it takes to advance the stage 1 counters is based off both settings, and doesn't seem to care how long RAM vs IO accesses take. In other words, it only cares how many cycles are executed -- not how many wait states each cycle takes.
Again, you're confusing your emulator implementation with the hardware. My hypothesis is that RAM cycles and port/ROM cycles each clock the timers according to the length of time they individually take, and that formula of yours is simply an artifact of the fact that blargg's test program happens to execute exactly twice as many RAM cycles as it does port/ROM cycles. If you modified the test program so that the ratio of RAM cycles to port/ROM cycles it executed was different, that formula would no longer work (i.e. higan wouldn't produce the same results as hardware in the modified test).
Well, my attempts at decoupling it have failed. If I try to tick the timers per wait state, it doesn't matter how many ticks are required per iteration for stage0->stage1 tick, the test always fails. And it won't tell me the value of the failed test because that would be much too convenient, wouldn't it?
Hopefully you'll have better luck than I in getting the timers to work as a side effect of improving TEST.d4-d7 as nocash stated.
I really should have given the last test ROM a less clever (or at least better-punctuated) name, because thanks to my day job, I keep reading it as "SMPTEsttest".
Anyway,
AWJ wrote:
Sorry Revenant, I've got one more test for you to run. This one can use exactly the same CPU-side program as before, only the SMP-side is different:
http://revenant1.net/smpidletest_bmi.sfchttp://revenant1.net/smpidletest_bne.sfchttp://imgur.com/a/E2Q8O(spoiler alert: they're the same)
$CA and up actually work this time (with considerably higher values than I was expecting), but I assume that's due to the test register getting restored before going back to IPL land. $8A, etc. are still problematic.
Revenant wrote:
I really should have given the last test ROM a less clever (or at least better-punctuated) name, because thanks to my day job, I keep reading it as "SMPTEsttest".
Anyway,
AWJ wrote:
Sorry Revenant, I've got one more test for you to run. This one can use exactly the same CPU-side program as before, only the SMP-side is different:
http://revenant1.net/smpidletest_bmi.sfchttp://revenant1.net/smpidletest_bne.sfchttp://imgur.com/a/E2Q8O(spoiler alert: they're the same)
$CA and up actually work this time (with considerably higher values than I was expecting), but I assume that's due to the test register getting restored before going back to IPL land. $8A, etc. are still problematic.
Yeah, something very strange is happening with $CA and up on your SNES--probably no coincidence that the previous ROM and blargg's test ROMs lock up on your machine.
Anyway, let's just look at $0A, $1A, $4A and $5A. Translating the cycle counts into decimal:
Code:
0A (IO 0, RAM 0): 666 cycles
1A (IO 0, RAM 1): 930 cycles
4A (IO 1, RAM 0): 1061 cycles
5A (IO 1, RAM 1): 1324 cycles
As expected, 5A takes almost exactly twice as many cycles as 0A. 1A and 4A take intermediate amounts of cycles, which means that some of the cycles in the loop are IO/ROM cycles even though the loop is running entirely out of RAM. We can use the ratios of these four cycle counts to determine how many of the cycles are IO/ROM and how many are RAM.
Let x be the number of cycles out of the 15 cycles in the loop that are IO/ROM cycles. Then, solve either of these linear equations for x:
Code:
666 * x / 15 + 1324 * (15 - x) / 15 = 930 (solution: x = 8.98)
666 * (15 - x) / 15 + 1324 * x / 15 = 1061 (solution: x = 9.00)
How about that, a nice round number. It looks like 9 cycles are IO/ROM cycles and the remaining 6 cycles are RAM cycles. My educated guess would be that the first [PC+1] cycle of each one-byte instruction is a "real" read cycle (with speed depending on what memory it's accessing), and the additional idle cycles of the
mul ya (and presumably other one-byte instructions that take 3 or more cycles) and the two idle cycles of a taken branch are IO/ROM cycles.
I think that's what nocash is getting at with his chart--it's meant to be the number of idle cycles of each instruction that are
always IO/ROM cycles. But some of the numbers in his chart don't make a lot of sense (mainly the conditional branches and some of the stack instructions) and will have to be re-checked.
Anyway, now we have a straightforward method to test the idle cycles of any SPC700 instruction we want
Design a loop that runs only that instruction and ones with known timing, run it with TEST = 0A, 1A, 4A and 5A, and do the ratio math.
ETA: Another educated guess I would make is that "always IO/ROM" idle cycles are "real" idle cycles that don't affect internal SMP registers (i.e. don't reset the timers) regardless of what address appears on the external bus.
Just for byuu, here's a sequence to test the mov (x)+ and mov (x) instructions:
Code:
mov $f1,#$b0 ; enable IPL ROM; clear input ports
mov $00,#$00 ; set up counter for loop
mov $f4,#$00 ; clear output port $F4
: mov a,$f5
beq :-
mov $f0,$f4 ; set TEST
: mov x,#$01 ; (fetch, operand)
mov a,(x)+ OR mov (x)+,a OR mov a,(x) OR mov (x),a (fetch, ?, ?, ?)
inc $00 ; (fetch, operand, read, write)
bne :- ; (fetch, operand, idle, idle)
mov $f0,#$0a ; reset TEST to default
jmp $ffc9 ; jump direcly to IPL ROM handshake (skip RAM clear)
As-is (with RAM vs ROM/IO timing supported):
With turning the two idle cycles in branch instructions, plus all but one in the MUL instruction to always-ROM/IO:
(note that this is not concrete proof that branches are 2, multiplies are 7.)
Given that emulation doesn't have the lockup conditions, these test ROMs can go ahead and test 8a,9a,aa,ba. Or if we just want to hone in on these cycle counts, we only really need to test 0a-7a, which seems much more reliable. Clearly ca-fa is behaving pathologically on Revenant's SNES console.
But ...
this is just getting insane. Rabbit holes are fun, but this is so far beyond anything that will ever be useful to any degree. It was bad enough that I/O cycles became read cycles with -very- strange addresses, but now there's some new mystery where some are secretly 'true' I/O cycles (even though the addresses show up on the bus) and some aren't, meaning Overload's logic analyzer can't even reveal this?
And apparently nocash's table won't work for us, as there's obvious errors like with BNE/BMI, and no test ROMs to verify any of it. So we pretty much have to scrap the whole thing.
Are we really going to try and emulate this exactly, and write test ROMs to time every single SPC700 instruction (well, you can break them into addressing mode groups, so probably 60 or so tests) ... even the ones that are insanely hard to test like TCALL, RTI, etc (and not test SLEEP/STOP since that's impossible)? Is Revenant going to be up for writing all of those tests? Ideally, we need it to be automated, too. Just dumping numbers on the screen is too laborious. Yet due to CPU/APU oscillator differences, we can't do exact matches. So he'll have to use ranged values. If the value is within +/- ~5%, it's a pass. Otherwise it's a failure.
And even if we
do go this far ... we're still not going to know -which- cycles in each instruction are forced to use the ROM/IO timing, and which can use RAM timing. Yes, we can probably guess correctly, but they're gonna be just that ... guesses. But then I suppose it really doesn't matter if side effects are impossible to observe anyway.
One thing is ... I'd like to propose we stop calling TEST.d4,d5 "RAM timing" and TEST.d6,d7 "ROM/IO timing." It doesn't fit with this new extra behavior. Now it's more like "ROM/IO/IDLE timing." That's too clunky. I think we should change it to d4,d5 = external wait-states/timing, d6,d7 = internal wait-states/timing. Given that the APU RAM is attached to the DSP (or really, just not inside of the SMP itself is all that matters here), and the SMP goes through the DSP to get to it. Whereas IPLROM, I/O, and true idle are all internal to the SMP. Are you guys in agreeance with that? If not, please suggest something better :)
byuu wrote:
Are we really going to try and emulate this exactly, and write test ROMs to time every single SPC700 instruction (well, you can break them into addressing mode groups, so probably 60 or so tests) ... even the ones that are insanely hard to test like TCALL, RTI, etc
We only need to test the instructions that have at least one "idle" cycle, and if we start seeing obvious patterns we can skip/assume a lot of them. TCALL is not hard to test at all; just disable IPL ROM and put a suitable vector into high RAM. RTI/RTS are easy to test once we know the timing for PUSH.
Frankly I think blargg-style pass/fail test ROMs are dumb and harmful, and I'm surprised if you don't agree after all the trouble you've gone through trying to make your Game Boy both pass various blargg tests
and run difficult commercial games at the same time. Also, look how bsnes passed all of blargg's SMP $F0 test ROMs for ten years despite having a completely incorrect conception of what the register is actually doing. Test ROMs are all well and good, but doing an opaque series of operations, CRC32ing the results (adding a further layer of opaqueness) and finally printing either "pass" or "fail" doesn't contribute to hardware understanding, it just encourages emulator authors to game the tests.
> We only need to test the instructions that have at least one "idle" cycle, and if we start seeing obvious patterns we can skip/assume a lot of them.
Okay, so then the question is ... are we really going to do this? We need to determine:
OR1,EOR1 addr:bit
MOV1 addr:bit (may not be the same as the former)
Absolute indexed
Branch
Branch on bit
CBNE, DBNZ
DBNZ Y-- (quite different from the others; though we should really test all four separately)
BRK
CALL absolute
PCALL
TCALL
CMC
DAA,DAS
MOVW
CMPW
CMP immediate to direct page
Direct
Direct indexed
DIV
XCN
CLR<flag>,SET<flag>
Implied instructions (NOP, CMC, TAX [mov x,a], etc)
Indexed indirect
Indirect indexed
(x) (should do reads and writes separate)
(x)+ (should do reads and writes separate)
(x),(y) (should do CMP (x),(y) separate but could go off other cases from before)
JMP indirect,x
MUL
PUSH
POP
RTI
RTS
TSB,TRB
STOP (untestable)
SLEEP (untestable)
The current test from Revenant is no good as there's more than one instruction with idle cycles, and right now, we need to start from nothing if we're gonna do this right. The first test should only test one instruction and be unambiguous.
And as you said, certain ones need to be tested before we can test others.
> Also, look how bsnes passed all of blargg's SMP $F0 test ROMs for ten years despite having a completely incorrect conception of what the register is actually doing.
Well that's emulation in a nutshell. I could be working on higan until I keel over at the age of 97 (as if, have you seen my diet?), and there would still surely be plenty of things completely wrong in the SNES core.
Let's not lose sight of the fact we're exhausting weeks of effort here to emulate a TEST register that not one single game, licensed or not, ever actually uses. The only thing that uses TEST are, unsurprisingly, test ROMs.
I'm not saying it's not worth the effort. But let's not pretend that this is a serious flaw in emulation either.
> Test ROMs are all well and good, but doing an opaque series of operations, CRC32ing the results (adding a further layer of opaqueness)
I've ranted plenty about CRC32s. blargg's DMG APU ones in particular that print ten pages of numbers, then a PASS/FAIL CRC32. The source code doesn't have a single comment on what the test is doing, what it's showing, or how to pass it. I get the impression blargg himself didn't actually know.
I actually offered a $50 bounty if anyone could help pass his tests in my DMG core, and had no takers. I gave up after a week of trying, but I suspect the true answer is finer grained cycle timing for latching registers and such.
> it just encourages emulator authors to game the tests.
You can find dozens of rants by me about emudevs being encouraged to pass tests blindly at the expense of working on actually important things; and this situation being hurt by sites like tasvideos ranking emulators based on a raw percentage of how many test ROMs they pass. (And I say this as someone who gets a 100% on the SNES tests list, so you know it's not just me being bitter about scoring badly.)
Nevermind that one test is "basic ADD/SUB flags are correct" and the other is "correct number of dummy read cycles versus internal I/O cycles for ADD when using SMP TEST register." If they had two tests, both would be worth 50% of your total score. That's insane.
However ... I still want these test ROMs for emudev use. I don't want to keep a big text document to familiarize myself with each test every time I run it, and what I should generally be expecting, and I want an easy regression tester. I may clean up the SMP core again in another five years, and I want a set of ROMs I can run to hopefully catch any mistakes I've made.
byuu wrote:
Is Revenant going to be up for writing all of those tests?
Hell no
Instead of compiling a different test ROM for every instruction, I wonder if it'd make more sense to just write some code to allow selecting one instruction at runtime, writing it into SMP RAM a good number of times (which would be better for this purpose than using a loop, if I understand the branch instructions' timing correctly) and then executing that and timing the results.
Although with enough loops we can compute the answer either way -- you could use JMP instead of BRA.
A simple framework will work for most tests, but will not work for the stack manipuating ones. PUSH and POP may be okay if we let them overflow and wrap around the stack repeatedly.
https://github.com/awjackson/bsnes-clas ... 02d56f78fbNote that none of the changes based on Overload's findings are in bsnes-classic yet--with this commit all "idle" cycles, including [pc+1] dummy operands, are treated as IO/ROM cycles (though we already know that's not quite correct, because two of the idle cycles in that mul/inc/bne loop are definitely RAM cycles).
If I run the blargg timer speed tests in this branch, some of the numbers change by 1 one way or the other (but never more than 1) but all the tests still show "passed". If I change the wait_states[] or timer_ticks[] lookup tables at all, the numbers change much more, and sometimes the tests even show "failed".
Revenant wrote:
Instead of compiling a different test ROM for every instruction, I wonder if it'd make more sense to just write some code to allow selecting one instruction at runtime, writing it into SMP RAM a good number of times (which would be better for this purpose than using a loop, if I understand the branch instructions' timing correctly) and then executing that and timing the results.
In order to get big enough numbers to minimize rounding error, we need to run each instruction a couple hundred times. There's no problem using loops if we verify the behaviour of the branch instructions first.
The call instructions are no harder to verify than any other instruction, they just need a bit of setup ahead of time (i.e. plunking suitable vectors in high RAM)
Still, the idea of an interactive test sounds good. Also, that way we can start with educated-guess emulation (e.g. my hypothesis that [pc+1] dummy operands are real reads and every other "idle" is an IO/ROM cycle) and zero in on the instructions that appear to diverge from that after emu-vs-hardware testing.
>
https://github.com/awjackson/bsnes-clas ... 02d56f78fbSo according to your code ...
If the current cycle wait state is 0, you get 3 ticks of the timer stage 0.
If it's 1, 6 ticks.
If it's 2, 12 ticks.
If it's 3, 24 ticks.
So the real ratio is like:
24 clocks to 3 ticks.
48 clocks to 6 ticks.
120 clocks to 12 ticks.
240 clocks to 24 ticks.
So it changes from a 1/8th ratio to a 1/10th ratio on the upper two. That's peculiar, but if it works, it works.
byuu wrote:
>
https://github.com/awjackson/bsnes-clas ... 02d56f78fbSo according to your code ...
If the current cycle wait state is 0, you get 3 ticks of the timer stage 0.
If it's 1, 6 ticks.
If it's 2, 12 ticks.
If it's 3, 24 ticks.
So the real ratio is like:
24 clocks to 3 ticks.
48 clocks to 6 ticks.
120 clocks to 12 ticks.
240 clocks to 24 ticks.
So it changes from a 1/8th ratio to a 1/10th ratio on the upper two. That's peculiar, but if it works, it works.
I think speed values of 2 and 3 are
supposed to be clock dividers of 4 and 8, but because of some interaction with the S-DSP (which actually generates the S-SMP's clock signal) they end up actually taking at least 5 or 10 cycles respectively, sometimes much longer (see Revenant's bizarre result with the mul ya test), and sometimes wedging the clock generator permanently.
The timer_ticks[] values being multiples of 3 is a relic of the old timer_step formula. You can divide them all by 3, also divide the per-timer template arguments by 3 (so 64/64/8 instead of 192/192/24), and everything works out exactly the same:
https://github.com/awjackson/bsnes-clas ... 44d23c6dec(tested with blargg's tests, Revenant's tests, and Tales of Phantasia's intro)
> I think speed values of 2 and 3 are supposed to be clock dividers of 4 and 8, but because of some interaction with the S-DSP (which actually generates the S-SMP's clock signal) they end up actually taking at least 5 or 10 cycles respectively, sometimes much longer (see Revenant's bizarre result with the mul ya test), and sometimes wedging the clock generator permanently.
Interesting. And yeah, I don't really see us emulating the lock-ups. That's getting too pedantic even for me. Would rather put that effort into the CPU<>DMA crash on R1 CPUs that some homebrew actually hits by accident.
> The timer_ticks[] values being multiples of 3 is a relic of the old timer_step formula. You can divide them all by 3, also divide the per-timer template arguments by 3 (so 64/64/8 instead of 192/192/24), and everything works out exactly the same:
True, that's a nice simplification. And not to nitpick, but at this point I'd suggest dropping timer_ticks and just use:
Code:
unsigned ticks = 1 << speed;
Just for the sake of satisfying my own curiosity,
here is a capture of smptesttest.sfc on BMF54123's SNS-101, in which CA/DA/EA/FA are actually usable. Meanwhile, his SNS-CPU-GPM-02 does the same thing as my SHVC-CPU-01, so I think my hunch about it being an issue with pre-1CHIP units might have been correct.
If that's really the case, then doing something akin to smpidletest (where CA etc. makes the SMP run slowly but still eventually recover) could be a way for software to tell 1CHIP/mini consoles apart from previous revisions, if one ever wanted/needed to do that for some reason.
I've just received a PM from Overload. He's done additional testing with a logic analyzer, updated
his document, and confirmed a number of my intuitions:
The second cycle of one-byte instructions (and one two-byte instruction) is indeed a kind of dummy operand fetch which uses the external clock divider (TEST bits 4-5) if executing from RAM. Other internal operation cycles always use the internal clock divider and don't trigger read side effects from internal SMP registers, regardless of the address they put on the external bus. The oddball is
dbnz y,rr, which has both a dummy operand fetch on cycle 2 and a real operand fetch on cycle 4. It's probably because that instruction shares microcode with the instructions that have a direct-page operand and a relative operand.
blargg was right about
mov a,(x)+: the third cycle is the read and the fourth cycle is an internal operation. Whereas for
mov (x)+,a the third cycle is an internal operation and the fourth cycle is the write. If I were to guess why this addressing mode works differently from all the other register/memory addressing modes (e.g.
(x)), it's probably because the other modes share microcode with
adc et al, but the
(x)+ mode
only exists for
mov so it's microcoded specially.
TEST bits 4-7 are clock dividers of 2/4/8/16 applied to the clock coming from the S-DSP, which is already divided by 12 (so a final divider of 24/48/96/192). Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)
Pin 16 (CPUK on the schematic) is the 2.048 MHz clock input from the S-DSP. Pin 15 is R/'W (low on writes, high on reads and internal operations, same as a 6502) Pin 14 is clock output (roughly equivalent to phi2 on a 6502, but its duty cycle is 25% low/75% high rather than 50%/50%) It looks to me like the SPC700 runs on a 4-phase clock internally like a 6809, rather than 2-phase like a 6502--maybe that's how it's able to do RMW ops without an idle cycle between the read and the write.
For emulation purposes, the address on the bus during real internal operations (not dummy operand fetches) seems pretty much irrelevant.
AWJ wrote:
Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)
That only applies to the internal divider (bits 6-7), right?
Revenant wrote:
AWJ wrote:
Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)
That only applies to the internal divider (bits 6-7), right?
An external divider setting of 2 or 3 seems less likely to lock up (at least with the mixtures of instructions and internal/external cycles we've been doing) but it still seems to cause a 25% slowdown on RAM cycles. Compare the results of all our tests with TEST=$FA to TEST=$0A. $FA makes all the tests take exactly 10 times as long as seen from the S-CPU.
Good news and bad news.
The good news is that I've implemented all of Overload's new findings on IO cycles, plus I have the (x)+ case right for both reads and writes now.
Further, I've implemented the SMP as running at DSP/12. For the CPU cycles, I consume {2,4,10,20} cycles to simulate the glitchy behavior where 8,16 are not evenly divisible by 12. Yet I still run the timers by {2,4,8,16}. As a result of this, I've reduced the timer stage 0 counters to {128, 128, 16}.
Note that I could run the SMP at DSP/24 and use {1,2,5,10}, {1,2,4,8}, and {64,64,8}, but I figured I'd be more self-documenting and put a lot of notes about this behavior and its glitchiness into the smp/timing.cpp file.
We now closely match every test by Revenant, and still pass test_speed by blargg.
The bad news is that we fail test_timer_speed now on 1A (and most certainly on the others as well.) Since blargg doesn't print failed values, I traced the ROM and determined higan is getting 1561 for a timer value of 1A, whereas it wants ~1639 to pass. My suspicion is we have converted some cycles that really do read from RAM into idle cycles erroneously.
It's possible that I made a mistake somewhere, but I was super cautious this time, and since all of Revenant's stuff passes ... I think we may still have more stuff to discover here.
All the same, I'll link to the Git repo for the new code once it's been pushed.
byuu wrote:
Code:
auto SMP::wait(maybe<uint16> addr) -> void {
static const uint cycleWaitStates[4] = {2, 4, 10, 20};
static const uint timerWaitStates[4] = {2, 4, 8, 16};
uint waitStates = io.externalWaitStates;
if(!addr) waitStates = io.internalWaitStates;
//snip rest
Excessive C++ cleverness has bitten you in the back. This code is failing to distinguish between an argument of 0 and no argument, and turning accesses to address 0 (which is RAM) into internal accesses. I haven't bothered to disassemble the timer tests (since they worked for me on the first try) but I can tell from the debugger that they do use address 0.Also, you've changed the order things happen in read() and write(). Before you were doing the read/write and then advancing the timers, now you're advancing the timers and then doing the read/write. I don't think this is the cause of the failure or even that it's necessarily wrong, just pointing it out because you do have to pay close attention to these things (for the S-CPU in particular, it makes a big difference to many edge cases exactly what order things are done in CPU::read() and CPU::write())
Aside, I don't think the glitchiness with dividers of 8 or 16 has anything to do with "being divisible by 12". Dividing by n and then dividing by m is arithmetically equivalent to dividing by (m * n). Whether m is divisible by n or n is divisible by m is irrelevant. I think the S-DSP just isn't happy when the S-SMP's clock output is too slow. Remember that the S-DSP outputs a clock to the S-SMP and the S-SMP divides that clock and outputs it back to the S-DSP--it's a mutual interaction.
ETA:
Changing the subject, I just noticed that in higan you're initializing the S-DSP ENDX to random(0), which means that Magical Drop will never work if randomization is disabled. Surely it should be random(0xff) instead (that's what I've done in bsnes-classic).
Here's my hypothesis for what is going on with the S-DSP initial state on real hardware. The initial state of each voice is completely random, which means that each voice
is playing (from a random sample address and with random parameters) when the chip is powered on. If software doesn't touch any of the registers for a voice, eventually it will finish playing (it'll read a BRR header byte that has the END bit set and the LOOP bit clear) and set its corresponding bit in ENDX. There are two cases where this can fail to happen: if the random chunk of RAM that a voice is playing from happens to parse as a looping sample (and never gets overwritten by software to something that
doesn't parse as a looping sample), or if the voice has a frequency of 0. Thus, by the time the IPL ROM passes control to an uploaded program, ENDX is usually 0xFF but occasionally one or two bits are clear, and those bits may or may not eventually get set depending on random chance and RAM contents.
This would explain why Magical Drop occasionally fails on certain real consoles, but certainly doesn't fail 25% of the time.
Cydrak disassembled the test.
Code:
8f4af0 mov $f0, #$4a ; set timings (modified per test)
8f81f1 mov $f1, #$81 ; enable IPL, timer 0
8f00fa mov $fa, #$00 ; set timer 0
8f0100 mov $00, #$01
8f0001 mov $01, #$00 ; $0000 = 1
e4fd lda $fd ; reset ticks
e800 lda #$00
8d00 ldy #$00
f8fdf0fc -; ldx $fd; beq - ; wait on timer tick
7a00f8fdf0fa -; adw $00; ldx $fd; beq - ; count loops to next tick
8f0af0 mov $f0, #$0a ; restore default timings
daf6 stw $f6 ; post loop results and sync S-CPU
8f55f4 mov $f4, #$55
e8dd64f4d0fc lda #$dd; -; cmp $f4; bne -
5fc0ff jmp $ffc0 ; return to IPL
Here are the expected ranges.
Code:
; TEST loops
; $0a $0a8f <= X < $0ac6
; $1a $0656 <= X < $0677
; $2a $0384 <= X < $0397
; $3a $01dd <= X < $01e6
; $4a $07eb <= X < $0814
; $5a $0548 <= X < $0563
; $6a $032a <= X < $033b
; $7a $01c2 <= X < $01cb
; $ca $032b <= X < $033c
; $da $02a4 <= X < $02b1
; $ea $01fa <= X < $0205
; $fa $0151 <= X < $0158
The problem turned out to be that I missed the (8) footnote on 6d. I probably missed the (9) footnote on 11 as well. It's a little tricky reading this PDF. Here is a crude fix:
Code:
auto SPC700::instructionDirectReadWord(fpw op) -> void {
uint8 address = fetch();
uint16 data = load(address + 0);
if(op == &SPC700::algorithmLDW) load(address + 0);
else idle();
data |= load(address + 1) << 8;
YA = alu(YA, data);
}
(the other MOVW is in DirectWriteWord.)
But anyway, all the tests pass now, hooray!
> Excessive C++ cleverness has bitten you in the back. This code is failing to distinguish between an argument of 0 and no argument, and turning accesses to address 0 (which is RAM) into internal accesses.
Unfortunately, that's not correct.
if(!addr) is testing the explicit operator bool() const of maybe<uint16>, which returns true if the maybe has a value in it, false if it's nothing. It won't see an address of zero until executing *addr to get the underlying value.
> Also, you've changed the order things happen in read() and write(). Before you were doing the read/write and then advancing the timers, now you're advancing the timers and then doing the read/write.
Yeah, we had $2137/$4201 to confirm that difference on the CPU side. It probably exists on the SMP side too, but this new emulation code makes this very difficult. And I'm not even sure when the reads happen when the divider is not set to 0 (or effectively 2 cycles.)
> Remember that the S-DSP outputs a clock to the S-SMP and the S-SMP divides that clock and outputs it back to the S-DSP--it's a mutual interaction.
Ah well. It's not like we're gonna be emulating the chance of crashing with this register anyway :/
> Changing the subject, I just noticed that in higan you're initializing the S-DSP ENDX to random(0), which means that Magical Drop will never work if randomization is disabled. Surely it should be random(0xff) instead (that's what I've done in bsnes-classic).
There's no option to disable randomization currently. I'll keep that in mind though.
It seems you know about the oddities with that title's game over screen. We can make a separate topic to work through that if you'd like. I'm very interested in what's going on there. But again, we'll need to confirm things before I'll make changes, and this one's probably not gonna have an "easy mode" like the SMP courtesy of Overload, heheh.
byuu wrote:
Unfortunately, that's not correct.
if(!addr) is testing the explicit operator bool() const of maybe<uint16>, which returns true if the maybe has a value in it, false if it's nothing. It won't see an address of zero until executing *addr to get the underlying value.
Are you absolutely sure about that? If I apply the following change in bsnes-classic so that address 0 is treated as internal:
Code:
diff --git a/bsnes/snes/smp/memory/memory.cpp b/bsnes/snes/smp/memory/memory.cpp
index b577ca6..8fdd4cd 100644
--- a/bsnes/snes/smp/memory/memory.cpp
+++ b/bsnes/snes/smp/memory/memory.cpp
@@ -175,6 +175,7 @@ alwaysinline void SMP::op_buswrite(uint16 addr, uint8 data) {
}
unsigned SMP::speed(uint16 addr) const {
+ if(addr == 0) return status.clock_speed;
if((addr & 0xfff0) == 0x00f0) return status.clock_speed;
if(addr >= 0xffc0 && status.iplrom_enabled) return status.clock_speed;
return status.ram_speed;
then blargg's timer tests fail exactly the same way as they do for you.
> Are you absolutely sure about that?
I am absolutely sure, yes. You read the post before I had a chance to update it with the actual bug.
There was an "addw $00" instruction, and I was performing a read instead of an idle for cycle 4. $1A meant the idle would've been slower than the read, hence the test was completing too quickly.
The test completes properly now, as do all other tests thus far. You're right that I'm being a bit too clever with maybe<uint16>, but well ... you know me >_>
I didn't want to have to type out "wait(0, true)" for idle(), and "wait(addr, false)" for read(), write().
Anyway ... I think we finally did it! A million thanks to you, Revenant, Cydrak, and Overload! We finally have the TEST register fully emulated in bsnes/higan, after thirteen long years of mystery! :D
Tangent: I thought about passing the address bus value along with idle() just so whatever inherits from Processor::SPC700 could spy on the value if there were some need for it ... but honestly it's just gonna be a source of confusion. Nothing else that we emulate is ever gonna use an SPC700, so it's just busy work for a side effect that is irrelevant to emulation. Much like emulating a CIC, for example. Overload's PDF has that information, so that's good enough for technical documentation on it.
See, even I'm not right all the time
!0 being false is PHP-tier insanity and makes all my programming senses scream in anguish. If it makes sense to you then go ahead and keep on doing it that way, but don't be surprised when everyone except your most devoted followers prefers to fork older versions of your code that don't contain quite as much black magic.
> !0 being false is PHP-tier insanity and makes all my programming senses scream in anguish.
C++14 added optional<T> as well. nall/maybe is equivalent in functionality to that, only it has a few other small features in addition.
> don't be surprised when everyone except your most devoted followers prefers to fork older versions of your code that don't contain quite as much black magic.
If they want to use your fork from 2010 that's full of all of my various coding styles from 2004 to 2010 (hooray, ppu-balanced!), plus your own coding style changes, they're more than welcome to.
You may even end up with more users than people running higan official. That's fine, it's not a popularity contest to me. Hopefully not to you, either. One of these days, someone is going to come along and write a fast+accurate SNES emulator like gambatte, mGBA, BlastEm, etc. And overnight, bsnes/higan and all 18+ forks of it are going to be mostly abandoned by users, frontend emulators (RA, Openemu, Bizhawk), etc.
Basically, you get to choose between clean code and running fast. Clean code is also way easier to work with and improve, which is good for idiots like me (and I do mean that, I'm not very smart, never claimed to be.) The tricky part of clean code is that reasonable people disagree on what is cleaner. You have people on one extreme like endrift writing all code in C. You have people on the other extreme like Nemesis creating separate C++ classes for every individual CPU instruction. Then you have crazy people like me bringing back cooperative threading like it's 1989.
Anyway, as long as you're not outright disparaging, I'm willing to continue talking with you to better both of our emulator versions.
You may've been wrong on maybe this time, but you also may've been right. I appreciate you were trying to help fix a bug in my codebase either way.
In general, I'd just really like more civility in emudev. I know I was awful at it in the past, but I really am trying to be nice going forward. The MAME dev comments finally slowed down, and now the Bizhawk team has just been savagely condescending lately with my stuff, and it's really wearing me down.
byuu wrote:
C++14 added optional<T> as well. nall/maybe is equivalent in functionality to that, only it has a few other small features in addition.
optional<T> is only standard in C++17; it's experimental in C++14. And even if it's standard, using it with integer types, or with any type which has an implicit bool conversion of its own (e.g. a container which is false when empty) strikes me as a dangerous thing to do.
AWJ wrote:
!0 being false is PHP-tier insanity
In my experience, PHP is no more or less insane than JavaScript. Several stylistic recommendations by Douglas Crockford in
JavaScript: The Good Parts apply equally to PHP.
But what are the semantics of
java.lang.Integer in boolean context?
byuu wrote:
Then you have crazy people like me bringing back cooperative threading like it's 1989.
Or like the JavaScript programmers who have rediscovered
asynchronous concurrency.
Okay, I fixed both ADDW,SUBW,MOVW(read) idle cycle 4, plus BBC,BBS,CBNE idle cycle 4.
All of blargg's and Revenant's tests are passing nicely now. Thanks again, everyone!
...
By the way, I should note that AWJ referred to the SPC700 assembler syntax as "opcode target,source" ... however, Overload's PDF uses "opcode source,target"
For instance, see 14b. Indirect (x+):
Cycle 3 is marked as condition (5) => internal operation for MOV A,(X+)
Cycle 4 is marked as condition (6) => internal operation for MOV (X+),A
The read version that loads (X+) and puts the value into A has cycle 4 as the idle cycle.
The write version that writes A into (X+) has cycle 3 as the idle cycle.
Also see 6d. Direct dp:
Cycle 4 is marked as condition (8) => internal operation for ADDW, SUBW, and MOVW YA,dp
Which would mean that this condition doesn't apply for MOVW dp,YA
But it's clear that the three condition (8) instructions are the ones that read from direct page, not write to it.
Not too important, it's just ... one more reason why I hate the official SPC700 assembler syntax.
I have a lot of ideas about how to make that PDF more readable, but I don't have the time or interest in trying to improve it myself.
byuu wrote:
By the way, I should note that AWJ referred to the SPC700 assembler syntax as "opcode target,source" ... however, Overload's PDF uses "opcode source,target"
For instance, see 14b. Indirect (x+):
Cycle 3 is marked as condition (5) => internal operation for MOV A,(X+)
Cycle 4 is marked as condition (6) => internal operation for MOV (X+),A
The read version that loads (X+) and puts the value into A has cycle 4 as the idle cycle.
The write version that writes A into (X+) has cycle 3 as the idle cycle.
You're right, Overload has these two backwards.
Quote:
Also see 6d. Direct dp:
Cycle 4 is marked as condition (8) => internal operation for ADDW, SUBW, and MOVW YA,dp
Which would mean that this condition doesn't apply for MOVW dp,YA
But it's clear that the three condition (8) instructions are the ones that read from direct page, not write to it.
This one is consistent with the official syntax.
ADDW YA,dp,
SUBW YA,dp, and
MOVW YA,dp are all read instructions and all have an internal operation on cycle 4.
I was bored, so make of this what you will ...
Basically ... the cycle wait states of {2, 4, 10, 20} become {2, 4, ???, 248}, but
only for internal wait states. External wait states are still {2, 4, 10, 20}, and timer wait states are still {2, 4, 8, 16}.
That's a hell of a jump from 20 to 248 ... I can't make any sense of how this would change between models so drastically. The missing internal[2] value could be 10, or it could be 124, or something else entirely. Obviously we can't test it if every real system locks up when we try.
I have no idea if I should even bother emulating this behavior with a switch in the code for documentation purposes.
Super unoptimized code, modified to match Revenant's values:
Code:
auto SMP::wait(maybe<uint16> addr) -> void {
static const uint cycleWaitStatesInternal[4] = {2, 4, 10, 248}; //10 is unverified
static const uint cycleWaitStatesExternal[4] = {2, 4, 10, 20};
static const uint timerWaitStates [4] = {2, 4, 8, 16};
bool internal = false;
if(!addr) internal = true; //idle cycles
else if((*addr & 0xfff0) == 0x00f0) internal = true; //IO registers
else if(*addr >= 0xffc0 && io.iplromEnable) internal = true; //IPLROM
step(internal ? cycleWaitStatesInternal[io.internalWaitStates] : cycleWaitStatesExternal[io.externalWaitStates]);
stepTimers(internal ? timerWaitStates[io.internalWaitStates] : timerWaitStates[io.externalWaitStates]);
}
How do smptesttest and test_timer_speed* behave when using 248 wait states?
I double-checked test_timer_speed on my SNES and it hangs as early as 2A, rather than CA, for whatever reason. Not sure if the same issue is at play here.
I can already guarantee blargg's test_speed ROMs will fail. He wasn't aware of these alternate timing numbers.
I'll try smptesttest tomorrow.
I know blargg's test would fail, but I'm curious if using all those extra wait states would manage to somehow reproduce the issue on some consoles where using certain values with those tests makes the SMP appear to become totally unresponsive.