SNES Timing Questions

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
SNES Timing Questions
by on (#175512)
Hey all, I posted this on Reddit and they forwarded me here (after giving me some helpful information). Here's the question:

I've been working on a SNES emulator for fun, and have found some decent information, except for actual timing information. I'm probably just over thinking this, or maybe I just need more information, but here's my conundrum. There's this page with some timing information: http://wiki.superfamicom.org/snes/show/Timing

It talks about the SNES master clock running about 21.477 MHz, and that internal IO instructions take 6 cycles, and then different memory accesses can be between 6, 8 or 12 cycles and 1364 cycles per scanline (most of the time). This is cool, I just need to figure out what instructions take what timings and I'll have an idea of how many instructions to process per frame.
Then, I get to the instruction timings: http://wiki.superfamicom.org/snes/show/65816+Reference

This references CPU cycles, and none are even more than 8 cycles (many are between 1 and 6), and CPU cycles are not the same as master clock cycles. I found that the CPU can run at 2.68MHz most of the time, but can also run at 3.58 MHz or 1.79 MHz.

I'm wondering, does anyone have any good information on this stuff? Most of what I find appears to be from the exact same source, and has these two different ways of talking about timing, which doesn't make sense to me. Can someone help me make sense of this, or point me to a source that can give me a good idea about these things? Thanks ahead of time!
Re: SNES Timing Questions
by on (#175515)
hatfarm wrote:
This references CPU cycles, and none are even more than 8 cycles (many are between 1 and 6), and CPU cycles are not the same as master clock cycles. I found that the CPU can run at 2.68MHz most of the time, but can also run at 3.58 MHz or 1.79 MHz.
Right. Every CPU instruction takes some number of CPU cycles; each CPU cycle in turn takes 6, 8, or 12 master clock cycles depending on which memory it's accessing.

"Internal" cycles on the CPU, and reads from or writes to "Fast" memory regions, specifically the upper 3/8th of memory when enabled (banks $80-$BF pages $80-$FF and banks $C0-$FF all pages) and most registers (banks $00-$3F and $80-$BF, pages $20-$3F and $42-$5F), take place in 6 master clock cycles.

Cycles reading or writing to "normal" memory regions (banks $00-$3F, pages $00-$1F and $60-$FF; banks $40-$7F all pages; and that same upper 3/8th of address space when fast memory is not enabled) take 8 master clock cycles.

Finally, cycles reading or writing to "slow" memory, which is only banks $00-$3F and $80-$BF, pages $40 and $41, take 12 master clock cycles.
Re: SNES Timing Questions
by on (#175519)
Okay, so if I'm understanding this correctly then, the instruction CPU cycles are multiplied by the master clock cycles to get the total number of cycles an instruction takes? Is that right?

Thanks for the link, I hadn't found that site, I'll definitely be giving it a thorough read.
Re: SNES Timing Questions
by on (#175521)
No, it's far more involved than just multiplied.

For example, a NOP instruction takes two CPU cycles: one to fetch the byte that is the instruction, and one internal. That's either going to take 8+6=14 master clock cycles, or 6+6=12 master clock cycles, depending on where it's executing from.
Re: SNES Timing Questions
by on (#175528)
Also note that the CPU reference on superfamicom.org, while useful for programming, is grossly oversimplified. For one thing, it only gives the minimum number of CPU cycles for each instruction, ignoring cycle-adding cases like 16-bit mode, non-page-aligned DP, and so on. Also the explanations sometimes leave a lot to be desired. This and this have more detail. So does fullsnes, apparently, but it's a bit cryptic...

EDIT: I'm sorry; the superfamicom.org page does in fact mention the cycle-adding cases at the bottom. Just make sure you look that far...

Also look up DRAM refresh. The CPU actually only gets 1324 master cycles per scanline, because the memory controller stalls it for 40 master clocks right about the middle of each scanline to refresh the contents of WRAM. Exact timing on this is kinda squirrelly and varies between models.

Also, DMA has some funky timing associated with it, but the Timing page on superfamicom.org seems to have the appropriate information.
Re: SNES Timing Questions
by on (#175529)
93143 wrote:
non-page-aligned DP

What is "page-aligned"? I know I've heard you say that before, but I'm curious now. I thought using a16 bit accumulator instead of 8 was the only thing that could add a cycle. Actually, I may be hallucinating, but does another cycle get added if you're using 16 bit x and y, even if you're not moving data in or out of them (so not stx, ldy, etc.)? In other words, is "lda $00,x" going to take one more cycle with a 16 bit x?

I don't need to worry about per cycle stuff yet, but I have a bad feeling I will as stuff gets tight.
Re: SNES Timing Questions
by on (#175530)
You can set the direct page to start anywhere from $00:0000 to $00:ffff via the 16-bit DP register. If you set it to something that isn't a multiple of $0100, however, it will add an extra cycle to direct page addressing instructions.

There are actually a number of different things that can add extra cycles to instructions. This is probably the best resource I've found at the moment; it's better than the one on the SFC wiki at least, though, uh... it should be noted that I did spot a typo for either bytes or cycles in this table at one point, and then forgot where it was. Good luck?
Re: SNES Timing Questions
by on (#175532)
Espozo wrote:
What is "page-aligned"?

Zero bottom byte. A "page" is 256 bytes, just like a bank is 65,536 bytes. If the bottom byte of DP is zero, the CPU can just take the operand of a direct-page instruction (an 8-bit address) and stick the top byte of DP on top to generate the absolute address. But if the bottom byte of DP is not zero, it has to actually add the 8-bit direct page address to the full 16-bit DP to generate the absolute address, and that takes longer.

Quote:
is "lda $00,x" going to take one more cycle with a 16 bit x?

No, but "lda $0000,x" does. Direct page instructions take an extra cycle for indexing regardless of the size of X/Y.
Re: SNES Timing Questions
by on (#175533)
Espozo wrote:
93143 wrote:
non-page-aligned DP

What is "page-aligned"? I know I've heard you say that before, but I'm curious now. I thought using a16 bit accumulator instead of 8 was the only thing that could add a cycle. Actually, I may be hallucinating, but does another cycle get added if you're using 16 bit x and y, even if you're not moving data in or out of them (so not stx, ldy, etc.)? In other words, is "lda $00,x" going to take one more cycle with a 16 bit x?

"Pages" on 65xx CPUs are 256 bytes, i.e. $0000-00FF is page 0 (hence the term "zero page" in 6502/65c02), page 1 is $0100-01FF, etc..

The "common" cycle additions on 65816, for some operations (it varies per addressing mode):

1. Add 1 cycle if 16-bit accum (or 16-bit X/Y, for opcodes like ldx and ldy)
2. Add 1 cycle if low byte of D (direct page register) is a value other than $00
3. Add 1 cycle if when using indexed addressing (ex. lda $12FF,x), accessing data crosses a page boundary

#1: Should be obvious.

#2: Already covered by Nicole and 93143.

#3: Consider what happens if you do (assume 16-bit accumulator) ldx #1 ; lda $12FF,x. The accumulator will get loaded with data from address $1300 and address $1301. This costs an extra cycle because the effective address has to wrap a page ($12-->$13) when doing the calculation ($12FF + 1).

Branch instructions also have cycle penalties for page crossing, as well as whether or not the branch is taken (branches taken cost an extra cycle). Other opcodes and addressing modes have similar cycle penalties.

And yes, the cycle penalties can "stack" (meaning you can have two of them applying at the same time to cause, say, a 2-cycle penalty).

Please refer to the Programming the 65816 (including the 6502, 65C02, and 65802) by Western Design Center book. For the 2015/03/17 (54MByte) version, refer to Chapter 18 and pay attention to the subscript items / footnotes at the each of each opcode.

Welcome to how/why counting cycles for program efficiency/timing is difficult.
Re: SNES Timing Questions
by on (#175540)
koitsu wrote:
Welcome to how/why counting cycles for program efficiency/timing is difficult.

Is there an easier way? I'm not 100% concerned about pixel perfect reproduction, but would like to be pretty close.

Thanks everyone for the help!
Re: SNES Timing Questions
by on (#175557)
hatfarm wrote:
koitsu wrote:
Welcome to how/why counting cycles for program efficiency/timing is difficult.

Is there an easier way? I'm not 100% concerned about pixel perfect reproduction, but would like to be pretty close.

For programmers: no, there is not an easier way. You literally sit down and start counting cycles manually. Here's an example (but for a routine someone wrote for the NES/6502).

For CPU emulation, "instruction timing" is not that difficult, because the cycle counts and the "adjustments" for certain criteria (see my previous post) are documented. I even refer to the WDC document you can use in said previous post.

I can't really help you with timing/frequencies involving separate clocks, but lidnariq already covered that.

As for the different operational speeds (specifically 1.79MHz vs. 2.68MHz vs. 3.58MHz): these are for NTSC (PAL is different). 1.79MHz is speed when accessing things like controller ports (for buttons or peripheral I/O; specifically MMIO regs $4000-41FF in banks $00-3F). 2.68MHz (a.k.a. "SlowROM") is the normal operating speed for most things (see lidnariq's post), and 3.58MHz (a.k.a. "FastROM") is what can be used for certain banks/memory regions (and can be toggled in real-time via MMIO register $420D bit 0). All these frequencies are divisions of the master clock speed (crystal) of 21.47727MHz.

My advice is that if you want to do a SNES emulator, start on a 65816 emulation core (unless there's already one out there you can use -- no idea). You're not going to get "pretty graphics and sound" up and running, but you'd at least start to get some actual games *running* (though they'll likely get stuck in infinite loops waiting on SNES MMIO registers to return certain values -- that's normal. Take baby steps!).
Re: SNES Timing Questions
by on (#175568)
Here is how you compute the speed (6,8,12 clocks) of any memory address on the SNES:

Code:
unsigned CPU::speed(unsigned addr) const {
  if(addr & 0x408000) return addr & 0x800000 ? romSpeed : 8;
  if(addr + 0x6000 & 0x4000) return 8;
  if(addr - 0x4000 & 0x7e00) return 6;
  return 12;
}


Where romSpeed is 6 when $420d.d0=1, and 8 when $420d.d0=0.

I know this routine is cryptic (it took me a long time to come up with this), but it's the smallest and fastest possible implementation of the logic. Many smart people have tried to best me on this routine with lookup tables and other such tricks, but nothing ends up faster or simpler than the above.

If you want to know the regions, you can reference the docs linked earlier.
Re: SNES Timing Questions
by on (#175686)
Thank you so much for that! It's way better than what I had.

I want to make sure I understand the timing.

Here's roughly what I have for the BRA instruction:
Code:
CPUCycleCount = FAST_CPU_CYCLE + (this.memory.getMemAccessCycleTime(this.pbr, this.pc) << 1);


The reason I'm thinking it is this way, is because we have a single mem access for the instruction fetch and another for the PC incrementer value. Then, the FAST_CPU_CYCLE (which is 6 master cycles) because of the internal addition and moving to the PC (getting us to the 3 CPU cycles that are supposed to be used by the instruction).

Is this the right thinking? Looking at the manual, sometimes a single instruction can grab a word vs a byte, but I'm not seeing that be reflected with instruction fetches and operand fetches.

Thanks again for all your help!
Re: SNES Timing Questions
by on (#175691)
hatfarm wrote:
Is this the right thinking? Looking at the manual, sometimes a single instruction can grab a word vs a byte, but I'm not seeing that be reflected with instruction fetches and operand fetches.

Speaking strictly about branch opcodes (excluding the brl opcode):

1. Most branch instructions are 2 bytes in length: 1 for the opcode, 1 for the operand. The operand byte is essentially signed, thus branches can only go back 128 bytes or forward 127 bytes (from the operand byte itself). They're "PC relative", rather than absolute addresses.

2. Most branch instructions cost 3 CPU cycles unconditionally.

3. If emulation mode is enabled (CPU flag e=1), and the branch is taken (i.e. for conditional branches, the conditional proves true), and the effective address calculated crosses a page boundary, then there is an additional 1 cycle penalty.

Speaking generally about instructions and their lengths:

There are several instructions which "grab" more than a word (word = 16-bits) as part of their operand. Opcodes that use long addressing, for example, have operands that consist of 3 bytes (so the entire instruction is 4 bytes). An example would be opcode $af (ex. lda $123456), which uses absolute long addressing.
Re: SNES Timing Questions
by on (#175693)
Yeah, but the manual implies that sometimes it takes a single cycle to do that, and sometimes it takes multiple cycles (at least in the case of a word vs a byte). If it's a 4 byte instruction, how many cycles is that going to take?
Re: SNES Timing Questions
by on (#175695)
Also, are all integers signed? I've got a "getWord" function, but I wasn't sure if that should return a signed or unsigned value, or if there are times when it's both.
Re: SNES Timing Questions
by on (#175696)
hatfarm wrote:
Yeah, but the manual implies that sometimes it takes a single cycle to do that, and sometimes it takes multiple cycles (at least in the case of a word vs a byte). If it's a 4 byte instruction, how many cycles is that going to take?

Where in the manual is this "implied"? The CPU cycle counts are defined clearly and are static, barring the conditionals that might cause them to take more time. If you want T-phase tear-downs of each addressing mode, that's available as well, but you do not need to worry about that level of granularity when emulating the CPU. Honest.

If we're talking about memory access times etc. then that's a different subject and one I can't really talk about (the other hardware guys here can).
Re: SNES Timing Questions
by on (#175697)
hatfarm wrote:
Also, are all integers signed? I've got a "getWord" function, but I wasn't sure if that should return a signed or unsigned value, or if there are times when it's both.

When you say "are all integers signed", I need to know what you're talking about *specifically*. This question is sort of loaded, in the sense that it sounds like something someone familiar with higher level languages (particularly C) would ask. I don't mean that in a judgy way either!

Are we talking about branch instructions? If so, the operand is treated as a signed number. Otherwise no, most things are unsigned.

However, many instructions (esp. load instructions) keep track of whether or not the MSB in the resulting value or modified value is set (this is reflected in the CPU flag n, which stands for negative) (the CPU flag z (zero flag) is also a commonly modified one, defining whether or not the result was 0 or not). Whether or not the underlying 65816 program *chooses* to make use of the n flag is up to the programmer. In other words: values are just values. What CPU flags are modified by an instruction are documented per-opcode/per-addressing-mode.

Do you have the time to sit down and read the manual (not skim it)? If you do, I think it'd be worthwhile. The WDC manual (originally from Ron Lichty and David Eyes) actually reads fairly easily a lot of the time, meaning it's not necessarily "hard" reading material. It goes over 6502 and 65c02 as well, so you can have a good understanding of the 8-bit CPUs it was based on (and that information applies to emulation mode as well).
Re: SNES Timing Questions
by on (#175700)
koitsu wrote:
Where in the manual is this "implied"? The CPU cycle counts are defined clearly and are static, barring the conditionals that might cause them to take more time. If you want T-phase tear-downs of each addressing mode, that's available as well, but you do not need to worry about that level of granularity when emulating the CPU. Honest.

If we're talking about memory access times etc. then that's a different subject and one I can't really talk about (the other hardware guys here can).

I do mean memory access times. I actually would be interested in T-phase teardowns of these things, but would probably just think they were cool and move on. I'm not interested in super hardcore perfect reproduction, I just want to make sure I'm not missing anything brutal that will cause my times to slip.

I can't remember the exact line, but I think it said that during a memory read (LDA/STA), it'd read/write 2 bytes, but there were other times when it would read a single byte, and those take the same amount of time.
Re: SNES Timing Questions
by on (#175701)
koitsu wrote:
2. Most branch instructions cost 3 CPU cycles unconditionally.

Huh? No, it's only unconditional branches that always take 3 cycles. Conditional branches only take 3 cycles if they're taken; otherwise they take 2. (Plus the page boundary thing in emulation mode, of course.)
Re: SNES Timing Questions
by on (#175702)
koitsu wrote:
When you say "are all integers signed", I need to know what you're talking about *specifically*. This question is sort of loaded, in the sense that it sounds like something someone familiar with higher level languages (particularly C) would ask. I don't mean that in a judgy way either!

Are we talking about branch instructions? If so, the operand is treated as a signed number. Otherwise no, most things are unsigned.

However, many instructions (esp. load instructions) keep track of whether or not the MSB in the resulting value or modified value is set (this is reflected in the CPU flag n, which stands for negative) (the CPU flag z (zero flag) is also a commonly modified one, defining whether or not the result was 0 or not). Whether or not the underlying 65816 program *chooses* to make use of the n flag is up to the programmer. In other words: values are just values. What CPU flags are modified by an instruction are documented per-opcode/per-addressing-mode.

Do you have the time to sit down and read the manual (not skim it)? If you do, I think it'd be worthwhile. The WDC manual (originally from Ron Lichty and David Eyes) actually reads fairly easily a lot of the time, meaning it's not necessarily "hard" reading material. It goes over 6502 and 65c02 as well, so you can have a good understanding of the 8-bit CPUs it was based on (and that information applies to emulation mode as well).


I'm definitely a higher level dude (C, Java, Python, JavaScript, all recently), but I do understand low level details. I worked on a team that determined whether IO signals were valid or not coming from sensors, and we looked at timing diagrams and such quite often. I did write linux device drivers for what was essentially the NES controller in my OS course a few years back (I was old in school), but I haven't done much assembly work since then. I also designed/wrote VHDL for a RISC microprocessor, so I have the background to understand this stuff, I'm just a bit rusty.

I don't really remember dealing with multibyte data (and that was x86, not 65816), so that's why I'd like to know. In what instances are we getting a signed integer (8 or 16 bit) and in what instances is it not? I have to select the right datatypes when I'm parsing this, so I want to make sure I do so.

I've been skimming the manual (basically implementing an instruction at a time and reading what I can from the manual). I understand the flags and stuff, they're basically the same as other processors I've worked with. I didn't actually see that bit about the BRA instruction's offset being signed, which would be a big deal if I had screwed it up. However, I've mostly been working on this stuff late at night, so my reading comprehension could be impared a bit by my sleepiness.

Anyway, thank you so much for your helpful (and quick!) responses!
Re: SNES Timing Questions
by on (#175703)
hatfarm wrote:
I do mean memory access times.

Okay, how those fit into the picture is something folks like lidnariq or byuu or others would have to comment on. Purely from a 65816 programmer's perspective, stuff like that has never been a "big" focus of mine. I care exclusively about CPU cycle counts and not, say, how many memory clocks/cycles reading from some memory bus takes. Does it matter? Yes, but that's something nobody has ever been able to explain to me in such a way where it makes tremendous sense. For example, I can tell you that sure, using 3.58MHz memory access is faster than 2.68MHz, and sure, you should try to benefit from that (I think it's 120ns vs. 200ns memory access times? I'm thinking of ROMs here). Likewise, I can tell you a 65816 on a 10MHz clock/crystal definitely runs faster than a 2MHz one (I know this first-hand from having an Apple IIGS accelerator card :-) ). But that's somewhat anecdotal. The hardware bits are what I tend to stray away from.

hatfarm wrote:
I actually would be interested in T-phase teardowns of these things, but would probably just think they were cool and move on.

Sometimes the teardowns help in understanding "how" the CPU does something, but generally speaking it isn't very helpful for emulation (IMO). The references I'm thinking of are here (for 6502 exclusively) and here (for 65816 exclusively).

hatfarm wrote:
I can't remember the exact line, but I think it said that during a memory read (LDA/STA), it'd read/write 2 bytes, but there were other times when it would read a single byte, and those take the same amount of time.

I'd need a reference to this quote/concern. It sounds to me like it's talking about register size, because the 65816 allows you to dynamically (at run-time) change the size (8-bit vs. 16-bit) of the accumulator and the X/Y index registers. 16-bit takes an extra cycle (for what should be an obvious reason), but it's consistent.

Random tip in passing: the two opcodes you're going to have the biggest problem with are probably adc and sbc, although on 65816 there are some others too that'll cause some grief, but they're still to date the biggest stumping points for emulator authors given the two's complement nature and how the overflow flag fits into the picture. They've been discussed heavily here. I always refer people to this thread (the post from blargg has the easiest proper implementation).
Re: SNES Timing Questions
by on (#175704)
Thank you again! So glad someone pointed me here!
Re: SNES Timing Questions
by on (#175705)
Also, I already had to handle the SBC logic (through CPX), and I think I got it right, based on what's in that post. However, definitely not as elegantly written as Blargg's.
Re: SNES Timing Questions
by on (#175706)
93143 wrote:
koitsu wrote:
2. Most branch instructions cost 3 CPU cycles unconditionally.

Huh? No, it's only unconditional branches that always take 3 cycles. Conditional branches only take 3 cycles if they're taken; otherwise they take 2. (Plus the page boundary thing in emulation mode, of course.)

Sorry, you're correct. I could explain what I meant by my statement, but all it'd do is add further confusion to the thread. So I'll correct myself and clarify:

Branch instructions which have conditionals (ex. bcc, bcd, beq, bmi, bne, bpl, bvc, bvs) take 2 cycles by default. If the conditional proves true (branch is taken), there's an additional 1 cycle penalty. There's an additional 1 cycle penalty on top of that if in emulation mode, and the branch is taken, and the effective address crosses a page boundary.

The bra instruction takes 3 cycles, and there's an additional 1 cycle penalty on top of that if in emulation mode, and the branch is taken, and the effective address crosses a page boundary.

The brl instruction takes 4 cycles.
Re: SNES Timing Questions
by on (#175707)
hatfarm wrote:
Also, I already had to handle the SBC logic (through CPX), and I think I got it right, based on what's in that post. However, definitely not as elegantly written as Blargg's.

cpx doesn't affect the v flag, though; adc/sbc do. You probably got the c flag right though. The WDC manual actually goes over all this too (chapter 9) if you want a long step-by-step walkthrough.
Re: SNES Timing Questions
by on (#175725)
hatfarm wrote:
So glad someone pointed me here!


You're welcome :D

I knew the fine folks here would be able to help you. I'm glad this discussion came up too, because I might dabble with 65816 ASM eventually (via the Super Game Boy). I'll be bookmarking this for reference.
Re: SNES Timing Questions
by on (#175763)
hatfarm wrote:
Also, are all integers signed? I've got a "getWord" function, but I wasn't sure if that should return a signed or unsigned value, or if there are times when it's both.


When you're writing a CPU emulator, you should use unsigned integers almost everywhere, and explicitly test the sign bit (e.g. flags.n = (bool)(result & 0x8000)) or cast to a signed type on an as-needed basis (regs.pc += (int8_t)operand). The reason is that in C and C++, signed integer overflow is undefined behaviour. Code like the following:

Code:
void add(int16_t operand) {
  int16_t result = regs.a + operand; // regs.a is an int16_t
  flags.n = (result < 0);
  // compute other flags
  regs.a = result;
}


simply cannot be guaranteed to give the result you expect for flags.n (or even for result). (int16_t)32767 + (int16_t)32767 does not necessarily equal (int16_t)(-2) in C.
Re: SNES Timing Questions
by on (#185019)
byuu wrote:
Here is how you compute the speed (6,8,12 clocks) of any memory address on the SNES:

Code:
unsigned CPU::speed(unsigned addr) const {
  if(addr & 0x408000) return addr & 0x800000 ? romSpeed : 8;
  if(addr + 0x6000 & 0x4000) return 8;
  if(addr - 0x4000 & 0x7e00) return 6;
  return 12;
}


Where romSpeed is 6 when $420d.d0=1, and 8 when $420d.d0=0.

I'm pretty sure the region 80-BF:8000..FFFF should be included in the romSpeed calculation, but the "& 0x408000" prevents it? Unless it really is always slow...
Re: SNES Timing Questions
by on (#185021)
> I'm pretty sure the region 80-BF:8000..FFFF should be included in the romSpeed calculation, but the "& 0x408000" prevents it?

Nope. Either bit allows the condition to pass.

if(addr & 0x408000) is true for:
00-ff:8000-ffff
40-7f:0000-ffff
c0-ff:0000-ffff

The next ternary inside, addr & 0x800000, narrows the range again to:
80-ff:8000-ffff
c0-ff:0000-ffff

The failure condition there captures both ROM regions that must be slow, and WRAM:
00-3f:8000-ffff
40-7d:0000-fff
7e-7f:0000-ffff
Re: SNES Timing Questions
by on (#185022)
byuu wrote:
Either bit allows the condition to pass.

Ah, right.

That's the bit I was missing :P