experimental NES multicore architecture

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
experimental NES multicore architecture
by on (#43783)
quote from another thread (in General):
tepples wrote:
Hydra by XGameStation: Imagine a console with even less fixed-function video hardware than an Atari 2600. Your code has to generate each pixel by writing signal values to a DAC. But there are eight programmable 32-bit CPU cores that run at up to 20 MIPS, so one can dedicate half the cores to a soft PPU and still be able to run game and audio logic.


The 8-core, 32-bit chip featured there is the Propeller (P8X32A). Truth be told, I've been quite curious about it ever since I heard about it's release. So finally, last month for my birthday I got myself a Propeller demo board w/ book (not the Hydra) for the purpose of evaluating it for performing strange tricks on an NES cart (which could become the eventual Squeedo rev3).

Here's the possibilities I want to explore:
-sound, there's a multitude of possibilities. It could easily support both IRQ and expansion-wired audio.
-PPU, the Propeller has 32 I/Os - instructions at 80mhz are maybe 50ns(?), it may be possible for the chip to actually be the CHR memory. Someone could make the Hellraiser game, it'd work as dual-ported RAM, the chip could render the tiles, lots of unique possibilities..
-CPU, it'd like to write a 6502 emulator to meet or exceed the 2A03's speed. That way you could use your existing code if you wanted parallel processing (up to 8 6502s). I need to write the emulator using less than 500 instructions, it sounds doable.

Of course there's a lot more it could do (USB and other interfaces would be obvious), these are just the more unusual ones.

edit: I will start on this after Garage Cart #2 is ready to go out the door.

by on (#44166)
I'm not sure how practical it'd be to write multicore 6502 programs, even though it might sound cool ;) A high-speed mode for the CPU (like on the TurboGrafx) would be neat though, if the rest of the system can keep up. So would a beefed up PPU.

by on (#44186)
mic_ wrote:
I'm not sure how practical it'd be to write multicore 6502 programs, even though it might sound cool ;)

The "turbo loader" for Commodore 64 disk-based games was a multicore 6502 program.

by on (#45338)
The Propeller ain't got no stack. And it can't (usefully) run out of ROM only RAM, unless you don't want to support subroutine calls. It has to modify bytes in the code space to mark its return. And being forced to load your code into RAM to run it (not counting a bootstrap mode) means that huge bounty of 64K on-chip is not so huge anymore.

While the COGs are really cool, and the chip is fast and cheap. As a C programmer I'm not pleased with the non-traditional instruction set architecture. To me the propeller is nothing but a collection of coprocessors.

Multicore 6502 is very practical if they can have some private per processor memory and some shared memory/device between them. It helps greatly if there is a way to cheaply synchronize them, either through a message (irq), shared high speed fifo. Shared memory alone is rough because without a CPU level lock operation it's hard (but not impossible!) to reliably read-write a mutex or semaphore without creating a race.

When you want to put cores that take less than 10K transistors on a chip (like a 6502) I start to think you want 128 or so of them. And when you have that many you'll want a very different architecture than 6502, like it would be nice in that scenarios to have the ability to gang 2,4 or 8 cores together to run their instruction pointers in lock-step and issue up to 8 instructions at a time to operated in parallel with a coprocessor that is part of the gang-of-8 to provide an operation to swizzle registers between cores. (not unlike SSE, NEON, SPU and AltiVec). But I digress ..

by on (#45344)
Jon wrote:
The Propeller ain't got no stack

If you have indexed addressing, then you have a stack. If you also have an instruction to jump to the address in a register, you have a call stack. I thought ARM already demonstrated that (mov pc, lr anyone?). Or does the propeller not even have those?

Quote:
Shared memory alone is rough because without a CPU level lock operation it's hard (but not impossible!) to reliably read-write a mutex or semaphore without creating a race.

I thought atomic INC and DEC instructions were exactly what was needed to make a semaphore.

Quote:
When you want to put cores that take less than 10K transistors on a chip (like a 6502)

I forget: how many transistors are in an ARM7TDMI?

by on (#45359)
tepples wrote:
If you have indexed addressing, then you have a stack. If you also have an instruction to jump to the address in a register, you have a call stack. I thought ARM already demonstrated that (mov pc, lr anyone?). Or does the propeller not even have those?


As far as I can tell, the answer is no. I urge you to go onto parallax's site and look over the manuals for the prop. The thing is, even thought (in my opinion) it is weird. It's still very powerful, and people have done some really great projects with it. (OpenStomp, HYDRA, etc). the SPIN bytecode is actually more normal, and even thought it's just a bytecode it's still relatively fast. (you still will want to write prop assembler for timing related thing).

Quote:
I thought atomic INC and DEC instructions were exactly what was needed to make a semaphore.


From my limited understanding, there is no bus atomic way to do INC/DEC on 6502 (many other processors have this same issue). INC/DEC is, at the lowest level, a read, modify, write. If bus access is atomic only on a single bus cycle for a "dumb" participant and it takes at least 2 bus cycles to a read then write for increment/decrement, then I believe you are stuck. If you had some IO memory where it went into a state for 2 or more cycles where writes to the same address would force a bus lock, then I think you can do it. CPU A reads value at X, cpu B tries to access X but it is "locked" pausing that unit until some controller stops asserting a control line, CPU A modifies value and writes it back to X. the clock counter times out and releases the assert allowing cpu B to continue. This is just a hypothetical solution that I pulled out of thin air, so take it with a grain of salt.

Quote:
I forget: how many transistors are in an ARM7TDMI?


I believe the base arm7tdmi core without any peripherials is 45K gates. The original ARM2 was more like 25-35K iirc. (about 20% smaller basically is what I remember)

(quotes liberally reformatted)

by on (#45378)
First, thanks for the great input.

Yeah, this chip could always use more memory. Instructions are 4 bytes, and the shared RAM is 32kB. There could be another chip in the series, but who knows if/when that will be.

Spin looks OK, but I want to program it in assembly since that's the only way I ever do anything. And actually your assembly code needs to fit in 2kB, which is 512 instructions. Sounds like fun to me. The instruction set is suited for self-modifying code, some people have written programs that read out of the shared memory, and something similar to that is how I imagine a 6502 emulator would work.

Quote:
Multicore 6502 is very practical if they can have some private per processor memory and some shared memory/device between them. It helps greatly if there is a way to cheaply synchronize them, either through a message (irq), shared high speed fifo. Shared memory alone is rough because without a CPU level lock operation it's hard (but not impossible!) to reliably read-write a mutex or semaphore without creating a race.


For the emulator I had considered having some portion of zero-page (or all) to be local to the cog, all the other memory in the main shared RAM. I figured there should be some custom 6502 instructions added in (use macros in asm) to use the native locking instructions.

Quote:
If you have indexed addressing, then you have a stack. If you also have an instruction to jump to the address in a register, you have a call stack. I thought ARM already demonstrated that (mov pc, lr anyone?). Or does the propeller not even have those?


There's no stack, and there's no indexing either. Each 32-bit word (or whatever it'd be called) contains an opcode, source address, and destination address. The remaining bits are quite interesting - you can (dis)allow the instructions to affect the status flags, and also conditional execution which can (dis)allow instructions from being executed based on the status flags. So you modify the 'source' in the instruction to do something like indexing.

by on (#45611)
Semi-offtopic.. A pretty cool example of what can be done with a propeller board and a lot of work: http://www.linusakesson.net/files/lft_turbulence_h264_presentation_and_cam_1280x720.avi

by on (#45907)
That is a really cool demo. The best one I've seen. And all that, from a 32kB EEPROM.. Now I have a reason to hook up the VGA-out from Propeller demo board. :)

It's a shame there's not a hardware multiplier. I bet that's why there's no volume control on the sound channels there.

by on (#60276)
6502 multiprocessing has a number of issues that will complicate the hell out of it, the atomic bus issue is only one of them.

The 2A03 lacks an external RDY line, so there's no way to stall the CPU out in the case of a conflict. There's also no SYNC line, so you can't easily detect instruction fetches. Even if those did exist, RDY only affects reads, so your general shared mem arbitration policy would require all accesses to be reads or RMWs, with the grant running from the first read cycle to shared memory to SYNC going active again for that processor.

Allowing one to use STA/X/Y to the shared memory would require some extra logic to snoop the bus for the instruction at SYNC high, and then snoop the address during the rest of the instruction fetch, doing the bus arbitration during the last read. Depending on how fast your logic is, this might involve adding a cycle to every STx instruction to let the decoder figure out if it needs to sieze the shared bus or not.

On top of all that, you have to isolate each 6502 from the shared memory with tristate buffers for all the address lines, R/W, and all the data lines, since there's no way to disable them externally otherwise. Though, you'd need to do that anyways, since the 6502 has no such thing as an internal operation cycle, where it doesn't need the bus.

The 65816 might be a better choice, though maximum throughput would still require the tristate buffers, as while it does have IO cycles, there aren't a lot of them. Decent 8/16 bit multiprocessing is usually going to require Z80s or 80xx's, since they tend to support wait states on just about anything, have a full complement of bus control signals, and have enough cycles where they don't care about the bus to give a second or third chip plenty of time to get something done.

The arcade Galaga does this with 3 Z80s, and a ton of custom chips for bus interfacing.

e: damnit, I missed, hit the wrong forum, and thought this was a new thread :fail:

by on (#60288)
Right on, this thread needs brought back anyways since I've made progress.

The previous idea with the Propeller chip, the NES CPU wasn't in on the multiprocessing, it would have communicated to all the other CPUs through a port. Later the idea evolved to moving it to the PPU bus, so you'd have multiple emulated 6502 CPUs (or native asm) basically running inside of the PPU memory. I think that could be very interesting. Unfortunately, the fastest routine in prop asm I could come up with was not quite fast enough to make it work, in theory. The next version of the Propeller chip is supposed to be faster (pipelined for 1 clock vs 4 clock instructions), so maybe I'll look at that approach again whenever that gets produced.

So instead now I'm heading towards something entirely different, I'll likely use a PIC32 (MIPS32 @ 80mhz) but what's for sure, it will have a lot nicer interface to the NES than the PIC18 on the old Squeedo.