SDD1 FGPA implementation

SDD1 FGPA implementation
by magno on 2018-07-15 (#221146)

This project began some years ago, about 2011, when I was interested on implementing SDD1 chip on an FPGA using Andreas Naive documentation about the chip. I first create the core files which decoded 2BPP and 4BPP modes, 8BPP was buggy bt then and the last mode was not implemented. I tries to contact with some guys from SD2SNES to ask if they were interested on the project but never got an answer and I gave up the poject due to lack of interested and come personal issues.

But finally, some months ago I finished it all, checked that all modes worked perfectly and decided to implement the chip on a Zedboard. My final goal is to connect the board to SNES using a custom interface board and check it on the real hardware.

I uploaded a video in Youtube in case some of you are interesed:

https://youtu.be/dsewU6s4Nrs

Re: SDD1 FGPA implementation
by srg320 on 2018-07-15 (#221184)

Hi.
I also worked on the SDD1 FGPA implementation for my SNES FPGA. Decoder is working, but embed in SNES yet it does not work.

I'm interested, how many master cycles does it take to decode first two bytes (one row) and the rest in your implementation?

Re: SDD1 FGPA implementation
by magno on 2018-07-16 (#221237)

srg320 wrote:

I'm interested, how many master cycles does it take to decode first two bytes (one row) and the rest in your implementation?

Well, it depends on the bit-depth and the context. For 2BPP and context 3, it tooks 31 master cycle from wirte to $4801 to first 2 bytes are ready on the output FIFO, plus 1 cycle to get first byte on the DMA data bus. It takes 15 cycles to fill the pipeline, 16 master cycles to complete each pixel on 2 bitplanes and 1 cycle in reading from output FIFO.

Re: SDD1 FGPA implementation
by srg320 on 2018-07-17 (#221303)

My Decoder takes 16 master cycles for 16 bit of 2 bitplanes (plus 5 cycles for fist 2 bitplanes for read header and initialize decoder). When I created DMA module and SDD1 I was based on this post.
I think, decoder should start after write to $420B and fetch next opcode (from line 2490 in post). But then remains few cycles to initialize decoder and decode the first 2 bitplanes. Also we must not forget about pause for DRAM REFRESH.
Maybe SDD1 chip decode 2 bitplanes over 8 master cycles.

Re: SDD1 FGPA implementation
by magno on 2018-07-17 (#221313)

srg320 wrote:

My Decoder takes 16 master cycles for 16 bit of 2 bitplanes

Yes, any compliant decoder must decode 16 bits on 16 master cycles, if not, DMA would "ask" for data faster than decoder generated the output bytes.

srg320 wrote:

(plus 5 cycles for fist 2 bitplanes for read header and initialize decoder).

That sounds fast! Do you use input FIFO for decoding? Andreas Naive wondered if it was necessary, and looks it is for high decodin ratios (up to 128:1).

srg320 wrote:

I think, decoder should start after write to $420B and fetch next opcode (from line 2490 in post). But then remains few cycles to initialize decoder and decode the first 2 bitplanes.

I'm pretty sure decoding starts after writing to $4801; I take that point as 0t (start of time) to measure master cycles. If I measure time from writting to $420B, my decoder is as fast as 4 master cycles, but that's tricky, because after triggering DMA (ie, after writting to $420B) data must be present on data bus at most 7 cycles later so CPU latches data in the 8th cycle.

srg320 wrote:

Also we must not forget about pause for DRAM REFRESH

DRAM refresh doesn't affect at all to S-DD1, in fact, that cartrdige connector pin is not even routed to the chip (neither in Star Ocea nor in SFA2). When DRAM occurs, CPU is halted, so /RD and /WR strobes are drive to '1'. S-DD1 must obey to those strobes to present decoded data to the DMA.

srg320 wrote:

Maybe SDD1 chip decode 2 bitplanes over 8 master cycles.

I thought a lot about this; at first, I also thought maybe the chip had some kind of parallelism, but if you check carefully the decoding algorithm, context information is only available after decoding current pixel, so it's impossible to decode (n-1)th bit if nth bit is not decoded first to create the context.
Then, if S-DD1 is designed fully synchronous, 1 pixel is decoded each clock cycle. If S-DD1 is partially synchronous, there is high risk of combinational loops, so latches must be instantiated.

Re: SDD1 FGPA implementation
by srg320 on 2018-07-17 (#221315)

magno wrote:

srg320 wrote:

(plus 5 cycles for fist 2 bitplanes for read header and initialize decoder).

That sounds fast! Do you use input FIFO for decoding? Andreas Naive wondered if it was necessary, and looks it is for high decodin ratios (up to 128:1).

I mean 5 cycles for init decoder and header + 16 cycles for 2 bitplanes = 21.
I am not use FIFO, I wait end of DMA RD signal (every second) and run decoding new 2 bitplanes.

magno wrote:

I'm pretty sure decoding starts after writing to $4801

Ok, what will happen if 43x2-43x4 is writed after writing 4801, or selected more than one channel SDD1. And how SCPU will fetch next opcode after writing to 4801, if SDD1 will use ROM bus? Or, SDD1 reading ROM between SCPU access when RD and WR is high level?

Re: SDD1 FGPA implementation
by magno on 2018-07-17 (#221317)

srg320 wrote:

I am not use FIFO, I wait end of DMA RD signal (every second) and run decoding new 2 bitplanes.

How would you fetch data from ROM if for each output pixel you needed to decode a 7-order Golomb code? You'd need 1 input byte each output pixel (ie, each master cycle), to achieve that your ROM should be 47ns time access or better. What if you must mantain this input rate because context is switching with each output pixel?

srg320 wrote:

magno wrote:

I'm pretty sure decoding starts after writing to $4801

Ok, what will happen if 43x2-43x4 is writed after writing 4801, or selected more than one channel SDD1. And how SCPU will fetch next opcode after writing to 4801, if SDD1 will use ROM bus? Or, SDD1 reading ROM between SCPU access when RD and WR is high level?

These are good questions, I should check in the real hardware, but I haven't had free time to mount the components on my interface board (between zedboard and SNES). My guess is:

srg320 wrote:

what will happen if 43x2-43x4 is writed after writing 4801

you shouldn't do that, in fact, neither SO nor SFA2 do that. But if you did, S-DD1 would start decoding from the last source address it had sniffed from SNES dara bus, I guess.

srg320 wrote:

selected more than one channel SDD1

that's not a problem, you select which DMA channel to sniff writting to $4800 and which channel to decode writting to $4801. If you trigger a decompression from a different channel you sniffed, nothing happens, ie, DMA is filled with the same byte on each beat. I checked this on emulators, so maybe is not accurate.

srg320 wrote:

how SCPU will fetch next opcode after writing to 4801, if SDD1 will use ROM bus?

SCPU is much slower than master cycle (6 or 8 cycles down), so it is easy to time-multiplex acces from SCPU and S-DD1 decompression core. But you need an input FIFO for data which will feed the decompression core.

srg320 wrote:

SDD1 reading ROM between SCPU access when RD and WR is high level?

That can happen only during DMA: DMA engine stalls the SNES CPU while DMA is in progress; the CPU resumes after all bytes are transferred. In any other cases, S-DD1 decompression core doesn't need to access ROM if no decompression is running.
The only situation when collision occurs is after writting to $4801 and writting to $420B, because both decompression core and SCPU need data.
Star Ocean has some padding instructions between them (PLA - PHA) for delaying start of DMA so SDD1 has time enough to read first words from ROM and begin decoding.

Re: SDD1 FGPA implementation
by srg320 on 2018-07-17 (#221320)

magno wrote:

srg320 wrote:

what will happen if 43x2-43x4 is writed after writing 4801

you shouldn't do that, in fact, neither SO nor SFA2 do that. But if you did, S-DD1 would start decoding from the last source address it had sniffed from SNES dara bus, I guess.

srg320 wrote:

selected more than one channel SDD1

that's not a problem, you select which DMA channel to sniff writting to $4800 and which channel to decode writting to $4801. If you trigger a decompression from a different channel you sniffed, nothing happens, ie, DMA is filled with the same byte on each beat. I checked this on emulators, so maybe is not accurate.

srg320 wrote:

how SCPU will fetch next opcode after writing to 4801, if SDD1 will use ROM bus?

SCPU is much slower than master cycle (6 or 8 cycles down), so it is easy to time-multiplex acces from SCPU and S-DD1 decompression core. But you need an input FIFO for data which will feed the decompression core.

The only situation when collision occurs is after writting to $4801 and writting to $420B, because both decompression core and SCPU need data.
Star Ocean has some padding instructions between them (PLA - PHA) for delaying start of DMA so SDD1 has time enough to read first words from ROM and begin decoding.

Thanks, you very helped me

Re: SDD1 FGPA implementation
by marvelus10 on 2018-07-18 (#221475)

Have you guys joined the Classic Gaming discord, there is quite a lot of discussion on new SD2SNES projects there.

Re: SDD1 FGPA implementation
by magno on 2018-07-19 (#221498)

marvelus10 wrote:

Have you guys joined the Classic Gaming discord, there is quite a lot of discussion on new SD2SNES projects there.

Yes, sure!

Re: SDD1 FGPA implementation
by Markfrizb on 2018-07-19 (#221583)

Forgive me for asking, but why not tackle the SFA2 decompression similar to what was done with Star Ocean? Then a "standard" cart could be used? Or am I missing the point?

Re: SDD1 FGPA implementation
by magno on 2018-07-20 (#221593)

Markfrizb wrote:

Forgive me for asking, but why not tackle the SFA2 decompression similar to what was done with Star Ocean? Then a "standard" cart could be used? Or am I missing the point?

Well, it's more exciting to replicating the chip than decompressing the graphics XD It could be done, of course, but the task is more tedious and less creative. Moreover, implementing the chip could e useful for the scene to create hacks that uses it.

Re: SDD1 FGPA implementation
by 93143 on 2018-07-20 (#221599)

magno wrote:

Moreover, implementing the chip could e useful for the scene to create hacks that uses it.

At least two large potential S-DD1 projects have been talked about here, although they are just speculation at this point (and although they aren't hacks, they do have copyright issues):

- a port of Metal Slug, as close to arcade-perfect as possible with period hardware
- a re-port of Street Fighter Alpha 2, employing advanced techniques to fix the music and sound, loading pauses, and cut-down graphics and animation

Both of these projects should fit comfortably in the S-DD1's available ROM space (which I believe is 16 MB addressable plus 3.875 MB in parallel) with the graphics compressed. Neither one is especially likely to fit in an ordinary cartridge, particularly since software decompression would take too much S-CPU power to be feasible, and the SA-1 imposes an 8 MB limit. In both cases, using the MSU1 would defeat the purpose of the project.

It could be argued that neither of these projects is likely to happen, but it's nice to know that they could.

Re: SDD1 FGPA implementation
by Markfrizb on 2018-07-20 (#221648)

Quote:

- a re-port of Street Fighter Alpha 2, employing advanced techniques to fix the music and sound, loading pauses, and cut-down graphics and animation

So the game has issues already.... humph, wasn't aware of that.

Quote:

Well, it's more exciting to replicating the chip than decompressing the graphics XD It could be done, of course, but the task is more tedious and less creative. Moreover, implementing the chip could e useful for the scene to create hacks that uses it.

But wouldn't that require a few options --- either A, someone would need to buy the FPGA SDD1 pcb (presumably a dev board), or B, wouldn't this lead to more cart destruction because some potential future game/hacks that needs the SDD1. Can the SD2Snes can run the SDD1 games?

The Star Ocean, since it's been decompressed, can play on a OEM style cart. https://youtu.be/_c2OoGkPA4o (video has no sound)
Truthfully, I'd like to see it decompressed, but I do understand the drive to replicate the sdd1 chip also. If I understood how FPGA's worked, I'd probably do the same thing.

Re: SDD1 FGPA implementation
by 93143 on 2018-07-20 (#221654)

Markfrizb wrote:

So the game has issues already.... humph, wasn't aware of that.

It's a port of a CPS2 game, so it was never going to be perfect.

It's just that I and others feel that it was probably possible to do better. Maybe it would have been unreasonable to expect better under reasonable time and budget constraints, or for an affordable price. Maybe it was a corporate afterthought or a contractual obligation and didn't get a reasonable schedule or budget. Maybe the RAM-limited nature of the next-gen consoles made devs wary of pushing the SNES too hard for fear of making the PlayStation look bad. Maybe the programmers were just lazy or incompetent. Or maybe it really is as good as it can get on the hardware - but I doubt it.

The graphics seem to be smaller than they need to be, the screen is letterboxed, and the animations are missing frames. Preliminary calculations suggest that it may be possible to remedy all of these things with a sufficiently advanced animation engine and more ROM.

The vocals are muddy, the music is terrible, and the game has loading pauses where everything freezes. These things are intimately related, and I think it's possible to fix all of them at once with a high-bandwidth HDMA streaming scheme. Even if I'm wrong, there are multiple examples of games that handle on-the-fly ARAM loading better than this one.

The game also has some slowdown, and I don't really see why it should.

Re: SDD1 FGPA implementation
by creaothceann on 2018-07-21 (#221666)

93143 wrote:

The graphics seem to be smaller than they need to be, the screen is letterboxed, and the animations are missing frames. [...] The game also has some slowdown, and I don't really see why it should.

AI?

Re: SDD1 FGPA implementation
by Señor Ventura on 2018-07-21 (#221678)

93143 wrote:

Markfrizb wrote:

So the game has issues already.... humph, wasn't aware of that.

It's a port of a CPS2 game, so it was never going to be perfect.

It's just that I and others feel that it was probably possible to do better. Maybe it would have been unreasonable to expect better under reasonable time and budget constraints, or for an affordable price. Maybe it was a corporate afterthought or a contractual obligation and didn't get a reasonable schedule or budget. Maybe the RAM-limited nature of the next-gen consoles made devs wary of pushing the SNES too hard for fear of making the PlayStation look bad. Maybe the programmers were just lazy or incompetent. Or maybe it really is as good as it can get on the hardware - but I doubt it.

The graphics seem to be smaller than they need to be, the screen is letterboxed, and the animations are missing frames. Preliminary calculations suggest that it may be possible to remedy all of these things with a sufficiently advanced animation engine and more ROM.

The vocals are muddy, the music is terrible, and the game has loading pauses where everything freezes. These things are intimately related, and I think it's possible to fix all of them at once with a high-bandwidth HDMA streaming scheme. Even if I'm wrong, there are multiple examples of games that handle on-the-fly ARAM loading better than this one.

The game also has some slowdown, and I don't really see why it should.

May be one day it can be fixed, just how today is going on with the final fight.

Take a look on this, its author has much merit:
http://www.baddesthacks.net/forums/view ... &start=260

I'm sorry for the intromission, magno ^^u

Re: SDD1 FGPA implementation
by 93143 on 2018-07-21 (#221697)

creaothceann wrote:

AI?

Is that typically a large chunk of the frame in a Super NES fighting game? I imagine it uses a bytecode scripting system, and it may be possible to substantially optimize it (maybe even hardcode it if all else fails), but I'm not an expert...

Markfrizb wrote:

But wouldn't that require a few options --- either A, someone would need to buy the FPGA SDD1 pcb (presumably a dev board), or B, wouldn't this lead to more cart destruction because some potential future game/hacks that needs the SDD1.

You can get what looks like an original copy of Star Ocean for less than $20, although it varies a fair bit. SFA2 seems to be in the $40 range. (Weird, considering it's nowhere near as highly regarded and was sold outside Japan. I'm guessing it's because Star Ocean has been extracted and can be cloned without an S-DD1, while SFA2 hasn't.)

Can an FPGA beat those prices?

Quote:

Can the SD2Snes can run the SDD1 games?

They're still on the incompatibility list. The Super FX games aren't, which means the list was updated very recently. But if this project works properly on real hardware, it should be possible to implement it, right? That should reduce demand for repros.

Re: SDD1 FGPA implementation
by Fisher on 2018-07-25 (#221974)

A thing that I've found interesting is that there's an early prototype that didn't use the SDD1.
Seems to be an early version, but maybe the chip can be just left out by using a larger ROM.

Re: SDD1 FGPA implementation
by magno on 2018-07-30 (#222333)

Fisher wrote:

A thing that I've found interesting is that there's an early prototype that didn't use the SDD1.
Seems to be an early version, but maybe the chip can be just left out by using a larger ROM.

Yes, that can be done; Nevitski did for Star Ocean and I myself did it for Star Ocean too, resulting in a 65.5 Megabit ROM. My way of doing it was dumping all and any S-DD1 compressed chunk and then, re-inserting each one as decompressed data. That is tedious, but you get the minimum ROM size.

Re: SDD1 FGPA implementation
by 93143 on 2018-07-30 (#222334)

magno wrote:

I myself did it for Star Ocean too, resulting in a 65.5 Megabit ROM.

...wait, what?

That substantially alters my estimate of the potential compression ratio available. Both SFA2 and Metal Slug should still fit without cuts, but it's a lot closer than it was. On the other hand, Star Ocean isn't really very much like either of those games...

What kind of average compression ratio do you see with graphics?

Re: SDD1 FGPA implementation
by srg320 on 2018-07-30 (#222335)

magno wrote:

I myself did it for Star Ocean too

Star Ocean has 8bpp modes data?

Re: SDD1 FGPA implementation
by magno on 2018-07-30 (#222345)

93143 wrote:

magno wrote:

I myself did it for Star Ocean too, resulting in a 65.5 Megabit ROM.

...wait, what?

Is 65.5 Megabit bigger or smaller than you expected?

93143 wrote:

What kind of average compression ratio do you see with graphics?

I could make calculations about it if you are interested.

Most of the SDD-1 chunks are 8x16 4BPP sprite tiles, ie, 64 bytes per chunk, which are compressed down to 40~50 bytes per chunk.

srg320 wrote:

Star Ocean has 8bpp modes data?

I can't remember any 8BPP modes.

Please, let my check when I have some free time and I will be able to provide reliable information about it.

Re: SDD1 FGPA implementation
by tepples on 2018-07-30 (#222347)

magno wrote:

93143 wrote:

magno wrote:

I myself did it for Star Ocean too, resulting in a 65.5 Megabit ROM.

...wait, what?

Is 65.5 Megabit bigger or smaller than you expected?

To me, the most unexpected part was that it's so close to a power of 2 bits: 67.1 megabits, which is 64 mebibits or 64 of what the game industry called "megabits" in the mask ROM era. 65.5 megabits sounds close to 64*1000*1024 bits, which has the same mixed-up logic as the old "1.44 MB" floppy disk, which held 1.44*1000*1024 bytes using the most common MFM sector format (80 tracks, 18 sectors per track, 512 bytes per sector).

And I too am curious about typical compression rates over the S-DD1 corpus. If you have time, it'd be interesting to read what causes a particular kind of graphic to be compressed more or less than average, or comparison with the rate performance of the SPC7110 (another context-adaptive arithmetic decoder). At least this might help inform design of pure-software tile codecs for Game Boy and Super NES.

Re: SDD1 FGPA implementation
by magno on 2018-07-30 (#222354)

tepples wrote:

65.5 megabits sounds close to 64*1000*1024 bits

In fact, my Star Ocean version is 65.5 * 1024 * 1024 bits. It spans 132 LoROM banks (32768 bytes each) and 64 HiROM banks (65536 bytes each).

tepples wrote:

And I too am curious about typical compression rates over the S-DD1 corpus. If you have time, it'd be interesting to read what causes a particular kind of graphic to be compressed more or less than average, or comparison with the rate performance of the SPC7110 (another context-adaptive arithmetic decoder). At least this might help inform design of pure-software tile codecs for Game Boy and Super NES.

That sounds interesting, although I still don't know SPC7110 inners

93143 wrote:

magno wrote:

I myself did it for Star Ocean too, resulting in a 65.5 Megabit ROM.

...wait, what?

That substantially alters my estimate of the potential compression ratio available. Both SFA2 and Metal Slug should still fit without cuts, but it's a lot closer than it was. On the other hand, Star Ocean isn't really very much like either of those games...

What kind of average compression ratio do you see with graphics?

I checked; Star Ocean japanese version has 4395668 uncompressed bytes and they are compressed down to 2180646 bytes, pretty near to 50%. That includes all graphic chunks.

Re: SDD1 FGPA implementation
by 93143 on 2018-07-30 (#222366)

magno wrote:

Is 65.5 Megabit bigger or smaller than you expected?

Smaller. The neviksti hack is 12 MB, so I assumed that was the uncompressed size of the game.

magno wrote:

I checked; Star Ocean japanese version has 4395668 uncompressed bytes and they are compressed down to 2180646 bytes, pretty near to 50%. That includes all graphic chunks.

Okay, that sounds better. Scared me for a moment.

Metal Slug seems to have 16 MB of C-ROM, based on the MAME manifest (though the MAME ROMs add up to more than 193 Mbit, so some of them are probably partly empty). Could be 14-15 MB rescaled to SNES resolution, since the graphics probably won't reduce by the full 256/304 ratio because of tiling. If that compresses down to about 7-8 MB, and audio can be squeezed down quite a bit because the SNES can re-pitch samples, it may be possible to do the game in about 9-10 MB, maybe with 8 MB behind the S-DD1 and 2 MB in parallel. At worst I imagine 12 MB would be more than adequate, and by the time the S-DD1 came out the N64 was on the cusp of hosting games that big.

SFA2 has 20 MB of graphics and 4 MB of audio data. The former could easily be well below 16 MB when rescaled to SNES resolution, and since a lot of it is cartoony with a bunch of flat colour it might compress better than Star Ocean or Metal Slug. Even if it doesn't, 7-8 MB is still a plausible estimate. The audio should compress to about 2 MB with BRR, and we're left with a very similar scenario as with Metal Slug.

So they should both work about as well as I had thought. Thanks for the estimate.

Re: SDD1 FGPA implementation
by magno on 2018-07-30 (#222394)

93143 wrote:

Smaller. The neviksti hack is 12 MB, so I assumed that was the uncompressed size of the game.

Neviksti's version is 12MByte but must of the ROM space is not filled with data. He left untouched the first 48Mbit (the original ROM) then used 3 bank to make a big lookup table for pairing each compressed address with new uncompressed location in expanded ROM. Then added all of Dejap's GFX packs in the rest of the ROM.
The fact is the original 48Mbit can be exploited because part of it is S-DD1 compressed data (about 18 MBit aprox), and hundreds of Dejap's GFX chunks are not valid, so they shouldn't be included in the expanded ROM.
If you make the calculations, the original game has about 35 megabits of uncompressed data; 18 Megabits of that uncompressed data can be allocated in the original ROM, and the rest of if, in the expanded space: 48Mbit (original) + 17 (expanded) = 65 Mbit.
You can check this in the ROM layout document I uploaded to RHDN.

93143 wrote:

Metal Slug seems to have 16 MB of C-ROM, based on the MAME manifest (though the MAME ROMs add up to more than 193 Mbit, so some of them are probably partly empty). Could be 14-15 MB rescaled to SNES resolution, since the graphics probably won't reduce by the full 256/304 ratio because of tiling. If that compresses down to about 7-8 MB, and audio can be squeezed down quite a bit because the SNES can re-pitch samples, it may be possible to do the game in about 9-10 MB, maybe with 8 MB behind the S-DD1 and 2 MB in parallel. At worst I imagine 12 MB would be more than adequate, and by the time the S-DD1 came out the N64 was on the cusp of hosting games that big.

SFA2 has 20 MB of graphics and 4 MB of audio data. The former could easily be well below 16 MB when rescaled to SNES resolution, and since a lot of it is cartoony with a bunch of flat colour it might compress better than Star Ocean or Metal Slug. Even if it doesn't, 7-8 MB is still a plausible estimate. The audio should compress to about 2 MB with BRR, and we're left with a very similar scenario as with Metal Slug.

So they should both work about as well as I had thought. Thanks for the estimate.

That MEtal Slug sounds exciting!!

Re: SDD1 FGPA implementation
by 93143 on 2018-07-31 (#222466)

magno wrote:

That MEtal Slug sounds exciting!!

It would be more exciting if there was any realistic prospect of it happening any time soon. Everybody who might contribute, including me, is busy doing other stuff.

But it's certainly fun to think about and talk about. And there is one member here who arrived harbouring the intention to port Metal Slug to the SNES, and has given it a lot of thought, even if he now feels it would be too big a task for him alone. A few other members are known to have been working on stuff that could be applicable to such a port, such as predictive animation, dynamic sprite VRAM allocation, and HDMA audio data streaming. Not to mention an FPGA implementation of the S-DD1... I can't rule out the possibility that this project may get off the ground someday.

Re: SDD1 FGPA implementation
by srg320 on 2018-07-31 (#222471)

So, with magno’s help I finally launched SDD1 on FPGA SNES. Not sure what will work on real hardware. Also not tested in 8BPP modes. So far I have achieved 3 master cycles for ROM access. Also I don't know how SDD1 should work if more than one DMA channel is selected for it (header is common or different).

sorry for the quality
https://youtu.be/FWvs6r8bg44
https://youtu.be/muA43DjAKS4

Re: SDD1 FGPA implementation
by magno on 2018-08-01 (#222474)

srg320 wrote:

So, with magno’s help I finally launched SDD1 on FPGA SNES. Not sure what will work on real hardware. Also not tested in 8BPP modes. So far I have achieved 3 master cycles for ROM access. Also I don't know how SDD1 should work if more than one DMA channel is selected for it (header is common or different).

sorry for the quality
https://youtu.be/FWvs6r8bg44
https://youtu.be/muA43DjAKS4

How did you record those videos? You say you don't know if your implementation will work on hardware, so... which hardware did you run SFA2 and SO?
As for ROM access, 3 master cycles is the miminum for FastROM, since that means 139ns time access. FastROM is slower than 120ns, so 3 master cycles is the proper time access for SDD1 core.

EDIT:
Ok, taking a look to the source code you provided, that implementation CAN'T work neither in real hardware nor in any other platform. The design lacks of 3 signals needed to manage the ROM output, and what's more, there is no SRAM_CS at all, so Star Ocean wouldn't work: if there is no SRAM present in the cartridge, audio engine stalls and the game freezes.
Still wondering how those videos were recorded...

Re: SDD1 FGPA implementation
by srg320 on 2018-08-01 (#222476)

magno wrote:

don't know if your implementation will work on hardware, so... which hardware did you run SFA2 and SO?

I run its on my FPGA SNES, my board with LCD 5" 800x480 RGB. I've never seen a real console

magno wrote:

As for ROM access, 3 master cycles is the miminum for FastROM, since that means 139ns time access. FastROM is slower than 120ns, so 3 master cycles is the proper time access for SDD1 core.

For ROM I use two SRAM 4Mx8 55ns, therefore it works on 2 cycles.

Re: SDD1 FGPA implementation
by magno on 2018-08-01 (#222477)

srg320 wrote:

I run its on my FPGA SNES, my board with LCD 5" 800x480 RGB. I've never seen a real console

.

Oh, do you mean SuperNT maybe? Or did you implement full SNES in an FPGA? That would be great then!

srg320 wrote:

For ROM I use two SRAM 4Mx8 55ns, therefore it works on 2 cycles.

But you don't enable the ROM from your SDD1 implementation, there is no ROM_CS nor ROM_OE, which are necessary even if you use SRAM as ROM...
And anyway, you need SRAM for Star Ocean. Some variables used by the audio engine are stored in SRAM, so if no SRAM is present, the game hungs. In your S-DD1 implementation, there isn't any SRAM_CS, so... how do you enable SRAM? If SRAM is not enabled, then those variables can't be read by the audio engine and StarOcean wouldn't work...

Re: SDD1 FGPA implementation
by srg320 on 2018-08-01 (#222478)

magno wrote:

Or did you implement full SNES in an FPGA? That would be great then!

Yes.

magno wrote:

But you don't enable the ROM from your SDD1 implementation, there is no ROM_CS nor ROM_OE, which are necessary even if you use SRAM as ROM...
And anyway, you need SRAM for Star Ocean. Some variables used by the audio engine are stored in SRAM, so if no SRAM is present, the game hungs. In your S-DD1 implementation, there isn't any SRAM_CS, so... how do you enable SRAM? If SRAM is not enabled, then those variables can't be read by the audio engine and StarOcean wouldn't work...

Of course SRAM is present, 8 Mbyte split by 6 Mbyte for ROM and 2 Mbyte (used 8Kbyte) for backup SRAM. When I'm home, I'll post its source.

Re: SDD1 FGPA implementation
by magno on 2018-08-01 (#222479)

srg320 wrote:

magno wrote:

Or did you implement full SNES in an FPGA? That would be great then!

Yes.

Great!! It would be nice there were some open-sorce projects for SNES FPGA implementation. It could be use for homebrew, getting the most of SNES system, adding features...

magno wrote:

Of course SRAM is present, 8 Mbyte split by 6 Mbyte for ROM and 2 Mbyte (used 8Kbyte) for backup SRAM. When I'm home, I'll post its source.

Maybe there is something I missed or you don't understand what I mean

S-DD1 chip is the address decoder for ROM and backup SRAM in the real cartridge, and so it should be in your FPGA design (there can't be 2 address decoders driving the same signals), so anytime CPU accesses ROM or backup SRAM, S-DD1 decodes the address and then reads ROM or backup SRAM.
In your SDD1 implementation, there isn't any output signal for enabling ROM access, even if you implement ROM in SRAM chips, so if SNES reads from $C0:8001... how do you enable ROM to read from it? There isn't such signal in your design.
As for backup SRAM, the same happens... But in this case, if CPU reads from backup SRAM to get some data needed by audio engine, your S-DD1 implementation won't decode the backup SRAM address, so the CPU won't get the proper data. Then, the audio engine will stall, and Star Ocean will freeze. In your videos, Star Ocean runs fine, so there is something weird on that...

Re: SDD1 FGPA implementation
by srg320 on 2018-08-01 (#222482)

magno wrote:

Great!! It would be nice there were some open-sorce projects for SNES FPGA implementation. It could be use for homebrew, getting the most of SNES system, adding features...

In the future I want to do something like hardware debugger via USB. Scheme of my board is applied.

magno wrote:

Maybe there is something I missed or you don't understand what I mean

S-DD1 chip is the address decoder for ROM and backup SRAM in the real cartridge, and so it should be in your FPGA design (there can't be 2 address decoders driving the same signals), so anytime CPU accesses ROM or backup SRAM, S-DD1 decodes the address and then reads ROM or backup SRAM.
In your SDD1 implementation, there isn't any output signal for enabling ROM access, even if you implement ROM in SRAM chips, so if SNES reads from $C0:8001... how do you enable ROM to read from it? There isn't such signal in your design.
As for backup SRAM, the same happens... But in this case, if CPU reads from backup SRAM to get some data needed by audio engine, your S-DD1 implementation won't decode the backup SRAM address, so the CPU won't get the proper data. Then, the audio engine will stall, and Star Ocean will freeze. In your videos, Star Ocean runs fine, so there is something weird on that...

xxxmap.vhdl files implement cartridge interface.

Re: SDD1 FGPA implementation
by magno on 2018-08-01 (#222526)

srg320 wrote:

xxxmap.vhdl files implement cartridge interface.

Ahh, ok, ok. You see? I missed something! :mrgreen:

In the first files you shared, there wasn't any interface for controlling ROM or SRAM, so that implementation could never work, whatever the implementation of the SNES (real hardware or FPGA). I see your mapper is the top wrapper for SDD1 core, now it makes more sense.
How did you resolve finally SNES CPU / SDD1 core collision after writing to $4801?

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222532)

magno wrote:

How did you resolve finally SNES CPU / SDD1 core collision after writing to $4801?

SDD1 accesses ROM when both CPURD and CPUWR in high level, it takes 3 master cycles always. Thus, one CPU cycle is one ROM reading for SDD1. After writing to $4801 and before starting DMA are 9-10 CPU cycles. I use first 6 for loading compressed data: 1 for header and current data, 1 for next data, 4 to fifo.

If I say something wrong, please correct me.

Re: SDD1 FGPA implementation
by magno on 2018-08-02 (#222533)

srg320 wrote:

Thus, one CPU cycle is one ROM reading for SDD1.

Nop, a CPU cycle is 6 master clocks in FastROM, or 8 master clocks in SlowROM (not the case). The read cycle is just a portion of 1 CPU cycle.

srg320 wrote:

After writing to $4801 and before starting DMA are 9-10 CPU cycles. I use first 6 for loading compressed data: 1 for header and current data, 1 for next data, 4 to fifo.

Nop, after writing to $4801, there are several CPU cycles before starting DMA. In Star Ocean, the sequence is:

Code:

1) STA $4801 -> 4 CPU cycles
2) PHA -> 3 CPU cycles
3) PLA  -> 4 CPU cycles
4) STA $420B  -> 4 CPU cycles
5) DMA triggers now after some master cycles

In steps 2 to 4, there could be collision when CPU fetches the instruction and SDD1 reads compressed data.

Re: SDD1 FGPA implementation
by tepples on 2018-08-02 (#222535)

Could the S-DD1 be prefetching the PHA during the $4801 write cycle and the PLA during the "internal operation" cycles of PHA?

Or could the S-DD1 be accessing memory during the half-cycle when the 65816 isn't accessing memory? I know the Apple II and Commodore 64 access RAM at 2.04 MHz even though the 6502 in those computers is only 1.02 MHz, as the video circuit gets priority during M2 low phase.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222537)

Can so you will understand that I mean

Attachment:

Документ1.png [ 12.85 KiB | Viewed 5449 times ]

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222538)

tepples wrote:

Could the S-DD1 be prefetching the PHA during the $4801 write cycle and the PLA during the "internal operation" cycles of PHA?

SFA2 instead PHA PLA uses:

Code:

STA $00
STA $00

and

Code:

LDY $00
LDY $00

Re: SDD1 FGPA implementation
by magno on 2018-08-02 (#222551)

srg320 wrote:

Can so you will understand that I mean

Attachment:

Документ1.png

I understand what you say, but that timing diagram is not correct. Address bank appears on data bus at most 33ns after PHI2 falling edge (your signals are rising-edge-aligned, but I think the correct is falling-edge as in PHI2). This byte is latched and after latching, it is decoded to generate CPU_RD and CPU_WR. Latch's propagation time is at most 36 ns (assuming 74LS573 as latch, although this logic is implemented inside 5A22, so probably is less), so that makes 69 ns after PHI2 falling edge; 69 ns is 3 master cycles. But read data hold (tDHR) is 10 ns, so CPU_RD will be '1' at most 3 master cycles. Assuming CPU_RD will be always 3 cycles long in '1' state could be dangerous.

Of course, all this applies to real hardware, not FPGA SNES implementation.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222552)

Thanks magno, you are the only one who gives very useful information. I have little experience in this.

magno wrote:

Address bank appears on data bus at most 33ns after PHI2 falling edge (your signals are rising-edge-aligned, but I think the correct is falling-edge as in PHI2). This byte is latched and after latching, it is decoded to generate CPU_RD and CPU_WR. Latch's propagation time is at most 36 ns (assuming 74LS573 as latch, although this logic is implemented inside 5A22, so probably is less), so that makes 69 ns after PHI2 falling edge; 69 ns is 3 master cycles. But read data hold (tDHR) is 10 ns, so CPU_RD will be '1' at most 3 master cycles. Assuming CPU_RD will be always 3 cycles long in '1' state could be dangerous.

You can point me where I can find out more about it, links or books?

Re: SDD1 FGPA implementation
by magno on 2018-08-02 (#222553)

srg320 wrote:

Thanks magno, you are the only one who gives very useful information. I have little experience in this.

Thanks to you! It is really nice to have such technical conversation with people who know what they are talking about.
I read all this information in the W65C816S datasheet from Western Digital (year 2010); table 4-2 has the timings for 5Vcc (even if the column say "14MHz", those timings apply because they are related to Vcc, not the maximum frequency). Figure 5-1 shows the bank latching circuit, the one I think SNES is based in too.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222554)

magno wrote:

I read all this information in the W65C816S datasheet from Western Digital (year 2010); table 4-2 has the timings for 5Vcc (even if the column say "14MHz", those timings apply because they are related to Vcc, not the maximum frequency). Figure 5-1 shows the bank latching circuit, the one I think SNES is based in too.

Found, I will study.

In the near future I plan to solder the cartridge slot and I will try to connect a real cartridge.

Re: SDD1 FGPA implementation
by magno on 2018-08-02 (#222556)

Hope this handmade timing diagram will help to explain why I don't think CPU_RD is 3 master cycles in FastROM mode:

Attachment:

File comment: STA.w $4801 timing diagram (incomplete)

STA_4801.jpg [ 1.03 MiB | Viewed 8229 times ]

There are 2 more CPU cycles left, but to illustrate how it works, it is sufficient.

tLATCH is latch's propagation delay from enable = '1' to output valid.
tDECOD is combinational propagation delay to decode CPU_RD from full address (bank + offset) and RWB signal.
When second read cycle starts (read $01 from ROM), CPU_RD is low, and it goes to '1' after bank latch changes its ouptut. Latch is enabled during low phase in PHI2, but new bank byte is not present in data bus until 33ns after PHI2 falling edge. Those 33ns in undeterminated state, CPU_RD is '1' again.
Maybe I missing something, but I've beeing thinking a lot on this lately and can't find any other explanation...

Re: SDD1 FGPA implementation
by srg320 on 2018-08-02 (#222563)

According to the logical analyzer traces from this post (LISTING8.txt) CPU_RD/CPU_WR is 3/3 and 3/5 master cycles long when work CPU and 4/4 when work DMA.

Re: SDD1 FGPA implementation
by magno on 2018-08-02 (#222564)

srg320 wrote:

According to the logical analyzer traces from this post (LISTING8.txt) CPU_RD/CPU_WR is 3/3 and 3/5 master cycles long when work CPU and 4/4 when work DMA.

Yes, but I can't tell why the differences between those logs and my timing diagram. Maybe bank latching is not done in SNES as explained in W65C816S datasheet.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-03 (#222566)

magno wrote:

tDECOD is combinational propagation delay to decode CPU_RD from full address (bank + offset) and RWB signal.

Why do you think that CPU_RD is decoded from address and not just RWB? I think it something like ~(RWB & PHI2).

Re: SDD1 FGPA implementation
by magno on 2018-08-03 (#222567)

srg320 wrote:

magno wrote:

tDECOD is combinational propagation delay to decode CPU_RD from full address (bank + offset) and RWB signal.

Why do you think that CPU_RD is decoded from address and not just RWB? I think it something like ~(RWB & PHI2).

Because some instruction have an Internal Operation cycle whilst RWB is '1', so CPU_RD would go low even if no read cycle is executing.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-03 (#222568)

magno wrote:

Because some instruction have an Internal Operation cycle whilst RWB is '1', so CPU_RD would go low even if no read cycle is executing.

To determine Internal Operation cycle I use VDA and VPA, maybe it wrong.

Re: SDD1 FGPA implementation
by magno on 2018-08-03 (#222569)

srg320 wrote:

magno wrote:

Because some instruction have an Internal Operation cycle whilst RWB is '1', so CPU_RD would go low even if no read cycle is executing.

To determine Internal Operation cycle I use VDA and VPA, maybe it wrong.

Non you are not wrong,, VDA and VPA are useful to determine IO cycle, but then, /CPU_RD would be decoded from RWB, PHI2, VDA and VPA, and this is a guess, not a truth... In fact, it is possible that it is that way, but I don't know.
/ROMSEL is decoded from address and bank, so maybe there is no need for /CPU_RD to be decoded from them again.

Re: SDD1 FGPA implementation
by srg320 on 2018-08-04 (#222648)

https://forums.nesdev.com/viewtopic.php?p=211643#p211643
https://forums.nesdev.com/viewtopic.php?p=189332#p189332

Re: SDD1 FGPA implementation
by Señor Ventura on 2018-08-04 (#222681)

93143 wrote:

It would be more exciting if there was any realistic prospect of it happening any time soon. Everybody who might contribute, including me, is busy doing other stuff.

But it's certainly fun to think about and talk about. And there is one member here who arrived harbouring the intention to port Metal Slug to the SNES, and has given it a lot of thought, even if he now feels it would be too big a task for him alone. A few other members are known to have been working on stuff that could be applicable to such a port, such as predictive animation, dynamic sprite VRAM allocation, and HDMA audio data streaming. Not to mention an FPGA implementation of the S-DD1... I can't rule out the possibility that this project may get off the ground someday.

The best part is always the theory XD

Maybe the snes has the colors, the sound, and most of the animations at a seemed rate to the original, but definitely i think it has a lack of proccessing and sprites fill rate (it only has to be seen how it works in the original neo geo game).

But there was so much legend when something did us to believe that some ports were impossible. Definitely a big number of games could fit in a snes almost at the same level than the original arcade games.

magno wrote:

Nop, a CPU cycle is 6 master clocks in FastROM, or 8 master clocks in SlowROM (not the case). The read cycle is just a portion of 1 CPU cycle.

That dammed WRAM...

8,43KB per frame could have been a reality

Re: SDD1 FGPA implementation
by 93143 on 2018-08-04 (#222684)

Señor Ventura wrote:

a lack of processing

Sure, the Neo Geo had a 12 MHz 68000, and the game still had slowdown. But the game engine was not very efficient, and they had to rewrite it for Metal Slug X. Also, the SNES CPU is perhaps close to twice as efficient per clock as the 68000. A homebrew port would be heavily optimized; I'd still expect slowdown, but it might not be all that horrible.

Note that the original game is 30 fps. This means that it has about twice as much going on as a hypothetical 60 fps game would be able to manage with the same CPU load, and comparisons with existing SNES software (which is almost all 60 fps) should take this into account.

Quote:

and sprites fill rate

To be fair, the Neo Geo had to build its backgrounds out of sprites, so the overdraw advantage wasn't quite as large as it looked. And most of the bosses could easily be done with BG layers on SNES. But yes, it's likely there would be a fair amount of sprite dropout/flicker during busy segments.

We discussed the idea of a SNES Metal Slug port recently, over here: viewtopic.php?f=23&t=17374

Re: SDD1 FGPA implementation
by Señor Ventura on 2018-08-05 (#222698)

93143 wrote:

We discussed the idea of a SNES Metal Slug port recently, over here: viewtopic.php?f=23&t=17374

Perfect, then it would be better if i respond there (sorry for the off topic, magno ^^).

Re: SDD1 FGPA implementation
by nocash on 2018-08-12 (#223371)

tepples wrote:

67.1 megabits, which is 64 mebibits or 64 of what the game industry called "megabits" in the mask ROM era.

That mask ROM era hasn't ended yet. As far as I know it's still absolutely common to use terms like "gigabit" or "gigabyte" for things like modern FLASH chips and SD cards, with "giga" meaning 1024^3, one reason is that chip manufacturing does somewhat require having a power of 2 for storage array. Whereas, SD cards are usually having an area reserved for automatic replacement of worn out sectors, so the available user space may appear to be less than expected, but that's unrelated to the 1000 vs 1024 thing.
Magnetic discs don't have that requirement for powers of 2, so manufacturers may often mean (1000^n) when they are saying "kilo/mega/giga" in their specifications (and for transfer rates and clock rates it's even more common to use 1000^n). Of course it would be neat to avoid that confusion, only, those new scientific units like "mebibits" are sounding a lot too much like childish language to me : /
As for myself, I would rather avoid to adopt those terms. Another approach would be defining the exact value in Hz or bytes somewhere in the introduction/specs, and then using an abbreviated value in the rest of the document; like 1234567Hz (1.2MHz), or 400h bytes (1Kbyte).

magno wrote:

Hope this handmade timing diagram will help to explain why I don't think CPU_RD is 3 master cycles in FastROM mode:

I've no idea which timings are correct... but your handmade diagrams & schematics are really pretty : )

Re: SDD1 FGPA implementation
by TmEE on 2018-08-12 (#223393)

Decimal bytes and bits are the exception not the rule so those should instead be bastardized. I refer to such things using e after the K/M/G/T etc. in my stuff and will not start using the i after them to refer to binary equivalents. A floppy can be 1.44MeB and some random HDD 250GeB.

Re: SDD1 FGPA implementation
by warham on 2018-11-15 (#228918)

I think these folks will be interested in converting this project to MISTer and the cyclone v
http://www.atari-forum.com/viewtopic.ph ... 76#p358276