What kind of 3D calculations were used in SuperFX games?

2014-08-22

OK so I ended up in a discussion about the SuperFX and decided it'd be better to just see what the games actually did that impacted their framerate (SuperFX games aren't exactly fast for the most part...). So here's the question: does anybody know what kind of calculations were used in those games? (I mean actually used, not just speculation because I already know of many algorithms that were common back in the day) What's usually the bottleneck? Because it'd be nice to know.

No need to post code here, just discussing what approaches they took is fine.

2014-08-23

Given that the GSU (aka SuperFX) is only a specialized DSP without hardware support for rasterization, the bottleneck must in most cases have been triangle filling. Then of course, to minimize the amount of filling needed a lot of cycles was probably instead spent on culling and sorting triangles before actually commiting to rasterize one. Vertex transformation isn't exactly cheap either, but compared to other machines of the era (including mc68k) it was quite a beast at multiplications.

2014-08-23

Actually, it does have a dedicated PLOT instruction specifically meant for rasterization (it places a pixel at the given coordinates and then advances the coordinates to the next pixel). It only covers the filling part though, not the part where it gives shape to the triangle, but it's something.

It's worth noting that from what I've been seeing it's extremely slow at accessing memory (either ROM or RAM), those instructions eat up several more cycles than the rest, so all code must run from the instruction cache if you don't want to slow down like crazy (and that cache is only 512 bytes long). I imagine that avoiding memory accesses is probably a priority.

Also I noticed that while 8-bit multiplication is actually extremely fast, 16-bit multiplication is even slower than memory accesses =/ So if vertex computation relies a lot on 16-bit multiplication, yeah, it can be slow too (indeed, not as slow as the CPUs of that era, but still lots of cycles are lost compared to the rest of the code).

2014-08-23

I happen to work with one of the lead programmers of Star Fox, so if you guys want me to present him with specific/precise questions I can do so.

2014-08-24

How does the FX compare against the DMA at filling pixels?

2014-08-24

（・・！）

May as well get out of the way the stuff that brought this discussion in the first place (tell him to take his time, this is a lot):

1) What algorithms are used to process the vertices? Both transformation and projection.

2) What algorithm is used to raster (render) the triangles? Related, is there any special calculation to discard backfacing triangles or did it just rely on the algorithm failing early when the wrong culling is used?

3) What were the biggest bottlenecks when programming the SuperFX? (I'm aware that accessing memory is horribly slow, but I'd like to know what were the most serious problems that actually arised during the programming)

4) Not directly related to Star Fox but yes to the SuperFX, is it possible to render to sprites? (as in having the entire screen covered with sprites) I did find info about PLOT being able to render to sprites but I have no idea on what limits there's to that.

psycopathicteen wrote:

How does the FX compare against the DMA at filling pixels?

Considering the SuperFX can render during active scan while DMA can't (and that the PLOT instruction can do dithering and automatically handle pixel positioning)... I'd say it wouldn't even be a fair comparison =P

2014-08-24

Actually it can. You can DMA from CPU ROM to CPU RAM, using the CPU RAM access port at $2180.

2014-08-24

Sik wrote:

4) Not directly related to Star Fox but yes to the SuperFX, is it possible to render to sprites? (as in having the entire screen covered with sprites)

Yes, which Yoshi's Island does.

2014-08-24

Wait, now that you mention it, there's the title screen... OK, discard that question =P

2014-08-25

I wish to ask a more generic question:

What was the development process like?

1) Was there any sort of SuperFX processor simulator that ran on the development PC that could be used for testing snippets of code, or did the code constantly have to get burnt to UV erasable eproms or something else like a rom emulator?

2) Did/Does the SuperFX CPU itself have any sort of debugging features (single step, register dump) that can be triggered? If this is too specific, was debugging done: 1) with an LED, 2) writing to some port or memory location 3) using a full SNES dev system and just printing to the screen. 4) just simulated on the development PC as related to the question before.

3) Although I realize you were working on the software, but do you recall any discussions about why the SuperFX boards had to start using a dedicated clock resonator circuit instead of the 21 MHz signal from the cartridge edge? My opinion is that it was too unreliable due to signal strength and general contact corrosion, but I'd rather ask it from the source. Or was it a software performance reason (the 21 MHz gets gated off during dram refresh I think).

2014-08-25

Sik wrote:

Actually, it does have a dedicated PLOT instruction specifically meant for rasterization (it places a pixel at the given coordinates and then advances the coordinates to the next pixel). It only covers the filling part though, not the part where it gives shape to the triangle, but it's something.

It's worth noting that from what I've been seeing it's extremely slow at accessing memory (either ROM or RAM), those instructions eat up several more cycles than the rest, so all code must run from the instruction cache if you don't want to slow down like crazy (and that cache is only 512 bytes long). I imagine that avoiding memory accesses is probably a priority.

Also I noticed that while 8-bit multiplication is actually extremely fast, 16-bit multiplication is even slower than memory accesses =/ So if vertex computation relies a lot on 16-bit multiplication, yeah, it can be slow too (indeed, not as slow as the CPUs of that era, but still lots of cycles are lost compared to the rest of the code).

Sure, there's the PLOT instruction and related settings registers. With a dedicated rasterizer I mean something more like "here's a memory offset to 3 or more screen space coordinates, fill the area for me", or at least something like blitter filling on the Amiga.

2014-08-25

That's exactly the role the SuperFX is meant to fill! The 65816 gives it the data and tells it "render this for me". You could look at it like a programmable "blitter" =P

2014-08-25

I wonder if someone's alluding to the difference between the programmable vertex shader (RSP) and the fixed-function rasterizer (RDP) in the Reality Coprocessor of the Nintendo 64 console. The Super FX, as I understand it, has to act as both the vertex shader and the rasterizer.

2014-08-25

Sik wrote:

That's exactly the role the SuperFX is meant to fill! The 65816 gives it the data and tells it "render this for me". You could look at it like a programmable "blitter" =P

Speaking of, if you don't mind I'd also like to ask a question, namely how they handled interoperability between the scpu and gsu. How exactly did they split the tasks between the two processors and which one did what? What did the scpu do other than read pads, poke commands into the spc700, update the game engine, ppu, hdma registers and dma framebuffers into vram? Or, other than that, was it just idly waiting for the stop flag in $3030 the rest of the time?

2014-08-25

What I'd love is a timeline about the whole Argonaut project:

Something like
1987: r-e the NES
1988: Think about custom co-processing mapper for NES
19xx: Demo 3-D game to Nintendo
19xx: Nintendo shows us SFC for the first time; move coding over to that.
...
199x: Develop portable game system HW for Nintendo, they never use it.

I'd love any info about projects, both successful and cancelled. :-)

2014-08-25

I'd like to politely request someone take control of this thread and formulate all the questions that people want answered. I really need something clear/concise, like an itemised list, not just random jumbled posts intermixed with questions and statements (that doesn't apply to all of you by the way, just some :-) ).

I will bold and enlarge the font for this statement: the questions need to be precise and as terse as possible (within reason).

If someone could please do that, and reach a general consensus of what's wanted, I can submit them to Krister and see what he says. There is no guarantee he will respond (I sent him Email a week ago relating to work stuff and he hasn't responded, so I think he might be on vacation or just very very busy), so just please keep all that in mind. (The work he does where we're employed is quite important)

If there are general questions that are more vague, I'd suggest reading interviews like this one, and see if your questions are already answered there.

2014-08-25

Like so?

psychopathicteen wrote:

1)How does the SuperFX compare against the DMA at filling pixels?

Sik wrote:

1) What algorithms are used to process the vertices? Both transformation and projection.

2) What algorithm is used to raster (render) the triangles?

3) [split, trim] Related, is there any special calculation [in Starfox/the SuperFX?] to discard backfacing triangles?

4)[trim] What were the biggest bottlenecks when programming [with] the SuperFX?

whicker wrote:

1)What was the development process like?

2) [paraphrase]Did you debug on a PC or on the SNES? If on the SNES, how?

3) [trim] Did/Does the SuperFX CPU itself have any sort of debugging features?

4) [trim] Although I realize you were working on the software, but do you recall any discussions about why the SuperFX boards had to start using a dedicated clock resonator circuit instead of the 21 MHz signal from the cartridge edge?

ARM9 wrote:

1)[paraphrase, trim]How did they handle interoperability between the scpu and gsu; how exactly did they split the tasks between the two processors and which one did what?

ccovell wrote:

1) What I'd love is a timeline about the whole Argonaut project.

2) I'd love any info about projects, both successful and cancelled. :-)

93143 wrote:

I have a few technical Super FX questions

Revised List:

1) What are the absolute hardware bottlenecks on blitting (using PLOT with color #0 not written, or only PLOTting part of a pixel cache, so it has to read the old data from RAM before writing the new data back)?
1b) How many cycles does it take to empty the secondary pixel cache under those circumstances?
1c) How about transferring the primary cache to the secondary, once the secondary is free?

2) Apparently ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? For both reading and writing? And does this impact the answer(s) for (1)? Did this change at all between chip/board revisions?

3) Is the instruction cache on the latest version(s) of the GSU 256 bytes or 512 bytes? I'd like to be sure.

ed: bolded questions specific to SuperFX
ed2: updated 93143's questions

2014-08-25

[double-posting to provide partial answers when possible, and to definitively separate from the question-list]

whicker wrote:

2) [paraphrase]Did you debug on a PC or on the SNES? If on the SNES, how?

I have an unsourced recollection that devs did not have to burn chips, but had a cable-driven SNES. This is most likely a mangled memory from watching the Making of Solstice videos...which had a hint of Equinox. Take with several grains of salt.

psychopathicteen wrote:

1)How does the FX compare against the DMA at filling pixels?

The interview koitsu linked suggests 40x.

Quote:

3) [split, trim] Related, is there any special calculation to discard backfacing triangles?

Yes.[/useless] If you don't mean specifically in Starfox/SuperFX, either "dot product of the plane" or a cross-product of two edges will let you determine a triangle's facing.[source: Flights of Fantasy] Note that this is basically telling you from which direction the vertices are ordered clockwise-thus, one has to be consistent about that being the in- or outward direction. I expect other methods exist.

2014-08-25

koitsu wrote:

I'd like to politely request someone take control of this thread and formulate all the questions that people want answered. I really need something clear/concise, like an itemised list, not just random jumbled posts intermixed with questions and statements (that doesn't apply to all of you by the way, just some :-)

).

Yeah, seriously, this went out of hand, this was supposed to ask just about some specific working internals of the code >.>' (specifically SuperFX code) I wasn't even expecting people to ask other questions. At the rate this is going, soon we'll end up with people asking about stuff not related to even the SNES at all. If you want a good set of questions, just stick with the initial list I posted for now (except question #4 to which we already settled the answer was "yes").

Myask wrote:

Quote:

3) [split, trim] Related, is there any special calculation to discard backfacing triangles?

Yes.[/useless] If you don't mean specifically in Starfox/SuperFX, either "dot product of the plane" or a cross-product of two edges will let you determine a triangle's facing.[source: Flights of Fantasy] Note that this is basically telling you from which direction the vertices are ordered clockwise-thus, one has to be consistent about that being the in- or outward direction. I expect other methods exist.

To rasterize a triangle you may want to reorder the vertices to make it easier for the algorithm to scan the lines. You could do this on the assumption a certain winding is always used. If you do it this way, when the wrong winding is used, the two extremes of the scanlines will cross right in the first row, effectively killing the triangle before it even starts getting rendered. The end result is that you get backface culling for free without having to resort to doing any maths.

2014-08-25

Sik wrote:

To rasterize a triangle you may want to reorder the vertices to make it easier for the algorithm to scan the lines. You could do this on the assumption a certain winding is always used. If you do it this way, when the wrong winding is used, the two extremes of the scanlines will cross right in the first row, effectively killing the triangle before it even starts getting rendered. The end result is that you get backface culling for free without having to resort to doing any maths.

I don't follow. To rasterize a triangle, you must trace two vertical lines, one on the left side of the triangle, one on the right side of the triangle. For each pair of trace points, proceeding down, you render the pixels across from one to the other. If you could somehow (how?) arrange the sides to be reversed if the triangle is back facing, yes the horizontal render would be skipped, but you'd still be wasting effort tracing the sides. A simple cross product to determine facing would let you skip the whole operation, why do you want to avoid "doing any maths"?

The usual way I've seen it done is to sort the three points top to bottom, which gives you one long vertical side, and the other side is split into two segments (though if the mid and bottom are on the same line, for example, one of these segments has length 0). This gives you an upper wedge (fanning out from the top point to a flat bottom) and a lower wedge (a flat top converging to the bottom point). You then trace the sides of the wedge, drawing pixels across between them one line at a time. If you do things this way, you have to do the backface culling before sorting the points. Once you've sorted the three points, you no longer have any information about winding because you've changed their order.

There are indeed alternatives to backface culling (e.g. it's not strictly necessary if you have a depth buffer or painter's algorithm), or different ways to specify it (e.g. per-triangle face normal), but winding order of the vertices requires the least amount of extra data (i.e. none) and is very simple to compute. I've never seen any other method used in a game situation; it really is the "standard" way to do it.

2014-08-26

Three adds, two multiplies, and a compare to check for discard is cheaper than rendering it. And, like removing those polygons that are behind the viewpoint, it removes about half of them for a mere O(n-polys) operation. Much cheaper than having to determine where to start it and including it in the proper order.

rainwarrior wrote:

To rasterize a triangle, you must trace two vertical lines,

Well, Bresenham's algorithm is rather cheap...

rainwarrior wrote:

I've never seen any other method used in a game situation

I suppose the BSP precalculations used by Doom, Quake, and I think Unreal for solid geometry don't count, being more of the "make it unnecessary" variety (and BSP uses two-faced polygons besides). I'm not sure what they use for checking the mobile geometry of elevators, doors, enemies, etc.

Sik wrote:

To rasterize a triangle you may want to reorder the vertices to make it easier for the algorithm to scan the lines. You could do this on the assumption a certain winding is always used.

I'm not sure I get it either, as that sounds more like y-ordering than winding...?[diagram]

2014-08-26

Yeah, Y ordering, except you just rotate the vertices (i.e. don't flip their order), that preserves the winding.

2014-08-26

I'm not sure which comparisons you're using to decide which of three rotations to pick, that will exclude, say, 1-3-2 (bottom-left) from happening and partially-rendering.

2014-08-26

Myask wrote:

rainwarrior wrote:

I've never seen any other method used in a game situation

I suppose the BSP precalculations used by Doom, Quake, and I think Unreal for solid geometry don't count, being more of the "make it unnecessary" variety (and BSP uses two-faced polygons besides). I'm not sure what they use for checking the mobile geometry of elevators, doors, enemies, etc.

Yes, that's what I was referring to when I said with a depth buffer or painter's algorithm (i.e. BSP) it's not necessary to reject back-faces.

From what I remember reading, for movable stuff it's just regular backface culling with possibly some sub-sorting of convex objects on a character. It doesn't use a BSP on the triangles of the character, but I think it substitutes a simple stand-in for it into the world's BSP so it gets occluded by the world. That's what I recall, anyway. I am of a mind to dig that stuff up and read it again now though...

There was another fun tidbit about Quake, that perspective correct texturing was implemented, but as a tradeoff for performance it's applied at the endpoints of a horizontal span, and then the span is subdivided recursively (with perspective correction) until individual spans are short enough, at which point the spans are rendered with linear texturing. They tuned the size where rendering starts until they were happy with the balance of visual appearance vs. performance.

Myask wrote:

I'm not sure I get it either, as that sounds more like y-ordering than winding...?

Ah, is Y-ordering an alternative backface removal technique? Does it put restrictions on rotation to preserve the ordering?

2014-08-26

I have a few technical Super FX questions, but I'm not sure if they're the sort of thing one would need to ask one of the Argonauts, and anyway they partly overlap Sik's questions. Does anyone happen to know the answer to these?

1) What are the absolute hardware bottlenecks on blitting (using PLOT with color #0 not written, so it has to read the old data from RAM before writing the new data back, if I understand correctly)? How many cycles does it take to empty the secondary pixel cache under those circumstances (I'm thinking not less than 48 cycles for 8bpp, or less for lower bit depth)? How about transferring the primary cache to the secondary, once the secondary is free?

2) Per the manual, ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? And does this impact the answer for (1)?

3) Is the GSU2's instruction cache 256 bytes (as byuu keeps saying re: the Super FX in general) or 512 bytes (as the manual claims)? I'm thinking the latter, and that the former was for an earlier chip rev, but I'd like to be sure.

4) Is there anything special that a developer using a GSU2 would need to do in order to use those 6 MB of extra ROM at the top of the CPU's memory map? (Preferably in FastROM mode?) As far as I know no games ever did...

2014-08-26

93143 wrote:

2) Per the manual

Where is this?

Y'know, it'd be useful for both the SNESDev and GBDev subfora to have stickies with links to reference documents/knowledge bases.

2014-08-26

93143 wrote:

1) What are the absolute hardware bottlenecks on blitting (using PLOT with color #0 not written, so it has to read the old data from RAM before writing the new data back, if I understand correctly)? How many cycles does it take to empty the secondary pixel cache under those circumstances (I'm thinking not less than 48 cycles for 8bpp, or less for lower bit depth)? How about transferring the primary cache to the secondary, once the secondary is free?

Isn't 8bpp (mode 7) packed instead of planar? That'd make it much faster than 2bpp and 4bpp since one pixel would need only one byte access (also, the documentation says that 48 cycles is the maximum for PLOT if I recall correctly)

93143 wrote:

2) Per the manual, ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? And does this impact the answer for (1)?

Where is this mentioned? The opcode documentation seems to assume the clock doesn't affect the speed (it mentions three different speeds depending on whether it's running from ROM, RAM or the cache, but nothing else except for opcodes that may be affected by the GSU status)

93143 wrote:

4) Is there anything special that a developer using a GSU2 would need to do in order to use those 6 MB of extra ROM at the top of the CPU's memory map? (Preferably in FastROM mode?) As far as I know no games ever did...

Doesn't it use the same bank numbers the 65816 uses?

Myask wrote:

Y'know, it'd be useful for both the SNESDev and GBDev subfora to have stickies with links to reference documents/knowledge bases.

You'd have some people here complain about copyright infringement then. It's in chapter 2-9 of book II if you get the SNES documentation though (it describes every opcode of the SuperFX).

2014-08-26

Sik wrote:

Isn't 8bpp (mode 7) packed instead of planar?

I can't find that data in the manual. Did Doom use Mode 7? I saw a reference that said it didn't. The SNES has two other modes with 8bpp layers, and they're planar. Plus I don't see anything in the manual that indicates radically different behaviour when plotting 8bpp.

Do you even need the PLOT functionality if you're using packed-pixel?

Quote:

93143 wrote:

2) Per the manual, ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? And does this impact the answer for (1)?

Where is this mentioned?

Page 2-7-2, right before section 7.1.2 begins. There may be multiple versions out there...

Quote:

93143 wrote:

4) Is there anything special that a developer using a GSU2 would need to do in order to use those 6 MB of extra ROM at the top of the CPU's memory map? (Preferably in FastROM mode?) As far as I know no games ever did...

Doesn't it use the same bank numbers the 65816 uses?

The ROM area I'm talking about is outside the GSU's jurisdiction. It outright can't see it, nor can it cut off the S-CPU's access to it. I thought the GSU was limited to 2 MB of ROM (or possibly 4 MB if you fiddled with the memory map?), but this is a totally separate block and that limit doesn't apply.

I don't see any real reason why an actual game couldn't have used it (except that a GSU was expensive, ROM was expensive, FastROM was expensive, add 'em all up and...). I'm mostly worried about unemulated behaviour, either in an emulator or on the SD2SNES when/if it supports Super FX, but I don't understand SNES memory mapping well enough to be confident that there isn't a gotcha hiding in there even in terms of real/original hardware...

2014-08-27

93143 wrote:

I can't find that data in the manual. Did Doom use Mode 7? I saw a reference that said it didn't. The SNES has two other modes with 8bpp layers, and they're planar. Plus I don't see anything in the manual that indicates radically different behaviour when plotting 8bpp.

What modes are actually supported by PLOT?

93143 wrote:

Do you even need the PLOT functionality if you're using packed-pixel?

Yes, because the main advantage of the PLOT instruction is that it converts screen coordinates into tilemapped coordinates.

Note that it's possible to draw without that instruction (by writing to RAM directly). It's perfectly possible Doom isn't even using the PLOT instruction, when drawing vertical spans there isn't much of a problem (every consecutive pixel is always at the same distance in bytes).

93143 wrote:

Page 2-7-2, right before section 7.1.2 begins. There may be multiple versions out there...

Looks like I skipped it =/ Odd that they didn't mention it in the opcode details though.

93143 wrote:

The ROM area I'm talking about is outside the GSU's jurisdiction. It outright can't see it, nor can it cut off the S-CPU's access to it. I thought the GSU was limited to 2 MB of ROM (or possibly 4 MB if you fiddled with the memory map?), but this is a totally separate block and that limit doesn't apply.

I don't see any real reason why an actual game couldn't have used it (except that a GSU was expensive, ROM was expensive, FastROM was expensive, add 'em all up and...). I'm mostly worried about unemulated behaviour, either in an emulator or on the SD2SNES when/if it supports Super FX, but I don't understand SNES memory mapping well enough to be confident that there isn't a gotcha hiding in there even in terms of real/original hardware...

Well, technically the GSU is wired directly to the cartridge (it could even be wired to access parts not accessible to the 65816), so in the worst case they could just wire the banks to the relevant portions... The only limitation here would be Nintendo's policies =P

2014-08-27

Sik wrote:

93143 wrote:

I can't find that data in the manual. Did Doom use Mode 7? I saw a reference that said it didn't. The SNES has two other modes with 8bpp layers, and they're planar. Plus I don't see anything in the manual that indicates radically different behaviour when plotting 8bpp.

What modes are actually supported by PLOT?

2bpp, 4bpp (tile and sprite, the latter behaving differently iirc) and 8bpp. You also have some settings in the plot option register for dithering and "transparency" (skipping on color 0) etc.. 8bpp can be used for both mode7 and mode3/4, it depends on how you upload it to vram (port $2118 byte transfers interleaved or word to 2118/2119) and how you build your map.

Quote:

Well, technically the GSU is wired directly to the cartridge (it could even be wired to access parts not accessible to the 65816), so in the worst case they could just wire the banks to the relevant portions... The only limitation here would be Nintendo's policies =P

The gsu sits between the cartridge rom/ram and the scpu so the address bus on the gsu is the limit here, which can only address 2MiB on all but the first version. Shouldn't be too much of a hassle to increase that on something like the sd2snes.

2014-08-27

ARM9 wrote:

8bpp can be used for both mode7 and mode3/4, it depends on how you upload it to vram (port $2118 byte transfers interleaved or word to 2118/2119) and how you build your map.

You sure? How would you account for packed-pixel vs. bitplane?

Quote:

Well, technically the GSU is wired directly to the cartridge (it could even be wired to access parts not accessible to the 65816), so in the worst case they could just wire the banks to the relevant portions... The only limitation here would be Nintendo's policies =P

The gsu sits between the cartridge rom/ram and the scpu so the address bus on the gsu is the limit here, which can only address 2MiB on all but the first version. Shouldn't be too much of a hassle to increase that on something like the sd2snes.

I'm referring to the memory maps in the manual. The GSU doesn't see anything above bank 71, but the S-CPU can access a bunch of stuff past that point, including 2 MB of LoROM from 80 to BF and 4 MB of HiROM from C0 to FF. This 6 MB ROM region is in addition to what the GSU can access, and according to the diagrams in sections 1.3 and 1.4, the CPU can access this ROM irrespective of what the GSU is doing or the status of the access control switch.

2014-08-27

Re: 8bpp available in mode 3, mode 4, and mode 7: this is correct.

2014-08-28

I don't care what the manual says, the layout is:

[SNES cartridge connector] <-> [GSU] <-> [ROM]

Only one of them can actually read back valid ROM contents at a time, because you can't have two chips reading the same chip at different locations at the exact same time. It's not physically possible.

The SA-1 is the only coprocessor that appears to do it, but in fact it uses another logic block that controls memory accesses. It will actually stall the SA-1 CPU when the host SNES CPU is accessing the same chip at the same time. Which as you can imagine, results in the code taking longer to execute. The GSU does not have this logic, and neither does the Cx4.

Now ... if you wire up your own cart, you can easily have:

[SNES cartridge connector] <-> [GSU] <-> [ROM1]
[SNES cartridge connector] <-> [ROM2]

Where obviously the GSU won't be able to access ROM2, but the SNES CPU can continue to use ROM2 while the GSU is using ROM1.

Now, what is the max ROM the GSU can address? I'd probably go with what the docs say, depending on each revision. But that's strictly a matter of how many ROM address pins there are on the GSU chip itself.

Cheating this with the sd2snes won't do you much good, since unfortunately that chip's not emulated there (yet? may prove too demanding for the FPGA used.)

2014-08-28

byuu wrote:

I don't care what the manual says, the layout is:

[SNES cartridge connector] <-> [GSU] <-> [ROM]
[...]
Now ... if you wire up your own cart, you can easily have:

[SNES cartridge connector] <-> [GSU] <-> [ROM1]
[SNES cartridge connector] <-> [ROM2]

I have avoided seeing the manual. But based on what's been said so far in this topic, it appears the manual mentions the latter configuration, which never ended up used in a commercial game due to cost.

2014-08-28

koitsu wrote:

Re: 8bpp available in mode 3, mode 4, and mode 7: this is correct.

For the SNES, or for the Super FX?

2014-08-28

93143 wrote:

koitsu wrote:

Re: 8bpp available in mode 3, mode 4, and mode 7: this is correct.

For the SNES, or for the Super FX?

SNES. I know absolutely *jack squat* about the Super FX or any extension chips.

2014-08-28

Okay, found a reference that isn't the dev manual (not as explicit about the circuit layout, unfortunately):

viewtopic.php?f=12&t=5964&hilit=Additional&start=45#p103957

nocash wrote:

GSU Memory Map (at SNES Side)
This is more or less as already known. The 8K at xx:6000h-xx:7FFFh is always mirroring to 700000h-701FFFh (no matter if the "xx" bank is 00h..3Fh or 80h..BFh).

Code:

  00-3F:3000-34FF  GSU I/O Ports
  00-3F:6000-7FFF  Mirror of 70:0000-1FFF (ie. FIRST 8K of Game Pak RAM)
  00-3F:8000-FFFF  Game Pak ROM in LoRom mapping (2Mbyte max)
  40-5F:0000-FFFF  Game Pak ROM in HiRom mapping (mirror of above 2Mbyte)
  70-71:0000-FFFF  Game Pak RAM       (128Kbyte max, usually 32K or 64K)
  78-79:0000-FFFF  Additional "Backup" RAM  (128Kbyte max, usually none)
  80-BF:3000-32FF  Mirror of GSU I/O Ports
  80-BF:6000-7FFF  Mirror of 70:0000-1FFF (ie. FIRST 8K of Game Pak RAM)
  80-BF:8000-FFFF  Additional "CPU" ROM LoROM (2Mbyte max, usually none)
  C0-FF:0000-FFFF  Additional "CPU" ROM HiROM (4Mbyte max, usually none)
  Other Addresses  Open Bus

The above "Additional" areas aren't installed on existing boards (=are seen as open bus).

tepples wrote:

byuu wrote:

I don't care what the manual says, the layout is:

[SNES cartridge connector] <-> [GSU] <-> [ROM]
[...]
Now ... if you wire up your own cart, you can easily have:

[SNES cartridge connector] <-> [GSU] <-> [ROM1]
[SNES cartridge connector] <-> [ROM2]

I have avoided seeing the manual. But based on what's been said so far in this topic, it appears the manual mentions the latter configuration, which never ended up used in a commercial game due to cost.

As far as I can tell, that's exactly right. The SNES is supposed to be wired straight into the "additional" ROM and RAM, parallel to the GSU and the memory behind it. But no games actually did this...

2014-09-01

If it's not too late, I'd just like to reiterate that if anyone on here knows the answers to my questions, I do not want them put to an original Star Fox programmer who is reportedly very busy and might, given the time elapsed, have to look up detailed chip information like anyone else. For instance, in light of nocash's old post and byuu's newer one, I consider my question #4 answered.

Also, it turns out nocash has enough data in his fullsnes document that I don't need to reference the manual for my questions.

Revised list:

1) What are the absolute hardware bottlenecks on blitting (using PLOT with color #0 not written, or only PLOTting part of a pixel cache, so it has to read the old data from RAM before writing the new data back)?
1b) How many cycles does it take to empty the secondary pixel cache under those circumstances?
1c) How about transferring the primary cache to the secondary, once the secondary is free?

2) Apparently ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? For both reading and writing? And does this impact the answer(s) for (1)? Did this change at all between chip/board revisions?

3) Is the instruction cache on the latest version(s) of the GSU 256 bytes or 512 bytes? I'd like to be sure.

2014-09-03

93143 wrote:

2) Apparently ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? For both reading and writing? And does this impact the answer(s) for (1)? Did this change at all between chip/board revisions?

Since the RAM access is documented to be similar to ROM in most cases (other than where executing in RAM would impact RAM access) I'd think fullsnes is correct on this point.
Storing to ram (sm,st,sbk) uses a buffer so the cpu can continue executing opcodes without having to wait (except when running code in ram). If you execute other code while ram is being written you can perform 1-2 cycle writes (when running in cache). This is all documented in the pdf that I'll assume you have, you should read through the gsu chapter.
It's a bit inconsistent and just plain wrong at times, but it's the best we have at this point until somebody finds argonaut documents.
>According to cache description (page 132), Cache-Code is 6 times faster than ROM/RAM. However, according to opcode descriptions (page 160 and up), cache is only 3 times faster than ROM/RAM. Whereas, maybe 6 times refers to 21MHz mode, and 3 times to 10MHz mode?

93143 wrote:

3) Is the instruction cache on the latest version(s) of the GSU 256 bytes or 512 bytes? I'd like to be sure.

512 bytes, all revisions, it's in the manual, fullsnes and bsnes. And you can test it yourself with $3100-$32FF. Where'd you read that it's 256 bytes?

As for question 1, I'm curious about this as well, it's not documented in the manual other than plot having a worst case of 48 cycles. Generally you want to put as much general processing as possible between plot and load/store instructions. Considering the worst case it might be wise to try and put more code after a plot until you access ram.
If you want exact timings, consider profiling on hardware.

2014-09-04

ARM9 wrote:

93143 wrote:

2) Apparently ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? For both reading and writing? And does this impact the answer(s) for (1)? Did this change at all between chip/board revisions?

Since the RAM access is documented to be similar to ROM in most cases (other than where executing in RAM would impact RAM access) I'd think fullsnes is correct on this point.

I'd think so too, but it doesn't seem to be too definite on the subject, what with all the question marks and the caveat about poor documentation...

Quote:

Storing to ram (sm,st,sbk) uses a buffer so the cpu can continue executing opcodes without having to wait (except when running code in ram). If you execute other code while ram is being written you can perform 1-2 cycle writes (when running in cache).

Yeah, but that doesn't change the fundamental fact that the throughput to RAM is one byte every X cycles, which would bottleneck a sufficiently lean continuous write loop.

According to my calculations, the application I have in mind (a port of a bullet hell shooter) is pretty much right on the edge of the chip's capabilities. The difference between 24 and 40 cycles for a 4bpp cache flush with unset bit-pend flags could be the difference between being able to exactly duplicate the original bullet patterns and having to simplify them.

I do not want to have to simplify the patterns, because that probably means rebalancing the game, which I don't trust myself to do.

I suppose I could leave the chip in low-speed mode and overclock it, but that's cheating (good luck getting Nintendo to agree to let you do that for a commercial release), and might result in errors with the memory used in the original games...

Quote:

93143 wrote:

3) Is the instruction cache on the latest version(s) of the GSU 256 bytes or 512 bytes? I'd like to be sure.

512 bytes, all revisions, it's in the manual, fullsnes and bsnes. And you can test it yourself with $3100-$32FF. Where'd you read that it's 256 bytes?

Well, byuu has used the number a few times. I figured there had to be a reason...

Quote:

If you want exact timings, consider profiling on hardware.

I guess that would be ideal, but I don't really have the resources (or skills) to do that right now. Ultimately I may well end up running on a real GSU, but I'd rather not have to choose between doing that up front (stalling the whole project until I can get the time and resources together) and potentially getting a nasty surprise after writing a ton of code...

I suppose I could just assume higan is close enough and test it there, but byuu has complained about Super FX timing in the past and I don't know if the current GSU code is as accurate as the core system emulation...

2016-12-03

Been a while since anyone posted in this thread. I've actually been worried that I hijacked the thread and prevented the original questions from being answered... Is the opportunity still open? Did I miss the resolution?

I was going to take this opportunity ask the forum experts a question about FROM/TO/WITH, but it turns out the answer was RTFM, so... bump, I guess.

I here reproduce the list of questions, with one of mine deleted because it's been answered. (My remaining questions are basically just a more detailed version of psycopathicteen's question, possibly too detailed for the amount of time that's passed (byuu et al. may be a better source), and honestly we already know what the answers probably are, so I don't consider them high priority...)

psycopathicteen wrote:

1)How does the SuperFX compare against the DMA at filling pixels?

Sik wrote:

1) What algorithms are used to process the vertices? Both transformation and projection.

2) What algorithm is used to raster (render) the triangles?

3) [split, trim] Related, is there any special calculation [in Starfox/the SuperFX?] to discard backfacing triangles?

4)[trim] What were the biggest bottlenecks when programming [with] the SuperFX?

whicker wrote:

1)What was the development process like?

2) [paraphrase]Did you debug on a PC or on the SNES? If on the SNES, how?

3) [trim] Did/Does the SuperFX CPU itself have any sort of debugging features?

4) [trim] Although I realize you were working on the software, but do you recall any discussions about why the SuperFX boards had to start using a dedicated clock resonator circuit instead of the 21 MHz signal from the cartridge edge?

ARM9 wrote:

1)[paraphrase, trim]How did they handle interoperability between the scpu and gsu; how exactly did they split the tasks between the two processors and which one did what?

ccovell wrote:

1) What I'd love is a timeline about the whole Argonaut project.

2) I'd love any info about projects, both successful and cancelled. :-)

93143 wrote:

1) What are the absolute hardware bottlenecks on blitting (using PLOT with color #0 not written, or only PLOTting part of a pixel cache, so it has to read the old data from RAM before writing the new data back)?
1b) How many cycles does it take to empty the secondary pixel cache under those circumstances?
1c) How about transferring the primary cache to the secondary, once the secondary is free?

2) Apparently ROM access in high speed mode (21 MHz) is 5 cycles instead of 3. Is the same true of RAM access? For both reading and writing? And does this impact the answer(s) for (1)? Did this change at all between chip/board revisions?