I have been running Blargg's test roms through Nintendulator (much thanks to Blargg and Q for these!).
Specfically, I am using this test from the instr_test suite: 01-implied.nes
http://blargg.8bitalley.com/parodius/ne ... est-v3.zipFairly early on, this sequence occurs:
Code:
E881 C8 INY A:00 X:FF Y:FF P:E5 SP:FB CYC:198 SL:244
E882 D0 FB BNE $E87F A:00 X:FF Y:00 P:67 SP:FB CYC:204 SL:244
E884 E6 0F INC $0F = 01 A:00 X:FF Y:00 P:67 SP:FB CYC:210 SL:244
Remember that cycles here are PPU cycles. It shows opcode D0 (BNE) taking 6 cycles. This is equivalent to 2 CPU clock cycles. The best information I can find shows that this instruction takes between 3 and 5 cpu cycles to complete:
3 if branch is not taken
4 if the branch is taken
5 if the branch is taken, and a page boundary crossed
Is Nintendulator's timing off here, or am I misunderstanding this situation? Thanks for any advice.
2 CPU cycles if branch isn't taken, 3 if it is, 4 if it crosses a page.
http://www.obelisk.demon.co.uk/6502/reference.html is a good basic reference for instruction timing. You might be misled if you read something like
http://nesdev.com/6502_cpu.txt (notice the "Notes:" section though), because the fetch of the next opcode is included among the steps and won't need to be done again.
For accurate emulation, you could just follow the steps in the latter document of course, but you will still need to keep in mind that trickiness for the branch instructions.
Thanks for the help everyone, I was just using inaccurate info.
It isn't really inaccurate, it's more that the last cycle of an opcode overlaps the first cycle of the next opcode as it can start decoding ahead of time. For example, EOR imm takes up 3 cycles in theory (one to fetch opcode, one to fetch data, one to execute instruction), but since in the last cycle it doesn't need to touch memory it fetches the opcode for the next instruction, effectively making it last 2 cycles.
EDIT: not sure if that's the exact opcode, trying to remember what I saw once regarding C64 timings (which also uses the 6502).
Thanks again for the help everyone, could someone explain this one to me as well?
This trace from Nintendulator shows a BEQ instruction taking 9 PPU clocks (3 CPU clocks). I don't see how this is the case.
To me it seems like this should take 4 CPU clocks. It should use the two standard cycles, plus the optional cycle since the branch is taken, plus another cycle since it crosses a page boundary.
Code:
CFFC C9 5A CMP #$5A A:5A X:81 Y:69 P:25 SP:FB CYC:286 SL:1
CFFE F0 05 BEQ $D005 A:5A X:81 Y:69 P:27 SP:FB CYC:292 SL:1
D005 A9 AA LDA #$AA A:5A X:81 Y:69 P:27 SP:FB CYC:301 SL:1
My best guess is that I am not understanding the page boundary properly - to me this looks like a crossing.
PC=CFFE
fetch opcode then increment PC
PC=CFFF
fetch offset then increment PC
PC=D000
add offset to PC, no carry from low byte so no extra cycle
Thank you! I couldn't sleep last night thinking about this issue - embarrassed that the answer is so obvious. Really appreciate the explanation of that one.
The answer was never obvious to me at least, and it wasn't helped by the ambiguous "when a branch crosses a page, an extra cycle is taken" that most describe. It should be something like, "if the branch is to an instruction that begins on a different page than the instruction just after the branch begins on, an extra cycle is taken".
I think the rule of thumb is that if the high byte of the address bus changes you have to add an extra cycle (low and high bytes are updated on separate cycles, which can lead to some interesting situations with hardware if you aren't careful...).
"If the high byte changes" is ambiguous. A taken branch whose opcode is at $80FF involves a change in the high byte of the PC. So does one whose opcode is at $80FE and branch offset at $80FF. For example, it might not be clear to everyone that the branch offset is fetched, the PC incremented like normal, before the addition takes place. It might seem like the addition occurs with the PC at the branch offset.
Sik wrote:
I think the rule of thumb is that if the high byte of the address bus changes you have to add an extra cycle (low and high bytes are updated on separate cycles, which can lead to some interesting situations with hardware if you aren't careful...).
Better rule of thumb: If adding an 8-bit value to a 16-bit value and there is a carry from bit 7, you add an extra cycle to fix the high 8-bits*
Example:
Code:
word.lo_byte += byte;
if ( word.lo_byte < byte ) // carry
{
cycle();
word.hi_byte += 1;
}
*Aside from write-only instructions with Absolute X, Absolute Y, and Zero Page Indirect Y addressing. There the extra cycle is fixed and always taken.

Also PC increments while fetching opcode/operands don't observe this penalty.
blargg wrote:
"If the high byte changes" is ambiguous. A taken branch whose opcode is at $80FF involves a change in the high byte of the PC. So does one whose opcode is at $80FE and branch offset at $80FF. For example, it might not be clear to everyone that the branch offset is fetched, the PC incremented like normal, before the addition takes place. It might seem like the addition occurs with the PC at the branch offset.
Actually this reminds me, there's a massive issue when doing absolute branches (i.e. full address instead of offset), if the address (the operand) happens to cross page zero,
it will not be read properly - it will read the low byte from the end of the page and the high byte from the beginning of the same page, instead of the beginning of the next page. This is because the CPU doesn't increment the address properly when reading.
I thought the JMP bug was just for indirect jumps like JMP ($xxFF), not for absolute jumps like $xxFE: JMP $9000.
tepples wrote:
I thought the JMP bug was just for indirect jumps like JMP ($xxFF), not for absolute jumps like $xxFE: JMP $9000.
Correct.