Question about OAM and vertical comparator

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Question about OAM and vertical comparator
by on (#65688)
Hi, i'm trying to emulate the ppu in a fpga, but first i'm writing a simple ppu emulator in order to test all the functions. Now i'm reading the patent (U.S. Patent 4824106) and Brad Taylor's NTSC 2C02 technical reference.

My first question is:
Are the OAM and the temporary memory in the same RAM so the OAM address register and the Temporary Memory register are multiplexed in order to address the same RAM (OAM + Temp)?


The other question is:
How the 4-bit result of the comparator is stored in the temporary memory if there is not a direct conection to the OAM data bus in the PPU diagram?

Brad Taylor says that the 4 bit result from the comparator is stored in the temporary memory, but i dont see a bus that connects this result to the "OAM + Temp" data bus.





i hope anyone can understand what i try to ask and clarify me about the function of this part of the ppu.

by on (#65692)
Don't take Brad's doc (or a patent) literally, I found it best to read emulator source and study the forum here and wiki. Once you understand the operation, then choose the hardware implementation that makes the most sense to you. When you learn of more hardware nuances, then update the implementation.

OAM is copied to "temp RAM" so it wouldn't really make sense for them to be the same.

What 4-bit comparator are you referring to?

OAM is either dual port (CPU & "temp RAM") or is multiplexed between CPU & "temp RAM" with one port having priority. Temp RAM then feeds the counters, attribute registers and shift registers.

by on (#65733)
kyuusaku wrote:
What 4-bit comparator are you referring to?

I think it refers to the result of scanline_y - sprite_y, evaluated for each sprite on each scanline.

by on (#65746)
In that case the comparator is implemented as an 8-bit half-adder, 8-bit inverter, 8-bit NOR gate.

copy_sprite_to_temp = ~|(~current_line + sprite_y)

by on (#65781)
kyuusaku wrote:
OAM is copied to "temp RAM" so it wouldn't really make sense for them to be the same.


In the hardware it really makes sense, because during pixel rendering, the ppu is drawing the contents of the sprites found in range for this scanline (these are in the special buffer for sprites), but at the same time is searching for the sprites valid in the next scanline so the temporary memory is filled with 8 sprites (it has 8x4 bytes. but instead of the vertical position, it has the result of the comparator that tells which row is going to be drawn).

When the horizontal blank is reached, the ppu scan this temporary memory in order to fill the buffer with the horizontal position, attribute bits and the 2 pattern bytes.

kyuusaku wrote:
What 4-bit comparator are you referring to?


In the patent you see a comparator between scanline#+1 and Ypos of the sprite, so the IN-RANGE validator recieves 5 bits that only tells that you have to draw that sprite if the result is less than 8 (the substraction is done without sign).


But in case it has a valid sprite the 4bit result tells you which line you are going to fetch from the pattern table.


kyuusaku wrote:
OAM is either dual port (CPU & "temp RAM") or is multiplexed between CPU & "temp RAM" with one port having priority.
Temp RAM then feeds the counters, attribute registers and shift registers.


The problem here is that the OAM (Motion Picture Attribute Table in the patent fig.5) is 256 bytes (really 64 x 29 bits) with an 8bit address register, but the temporary memory is drawn within the same memory, so the OAM and the temporary memory is in the same entity (256 + 32 addresses). Then we need two different address registers, one for the OAM and one for the temporary memory in the same RAM, but we only need the OAM dual ported to CPU and PPU internal.

See this part of the schematic from this patent.

by on (#65782)
Drawing conclusions on where some chunk of memory might live in reality from the patent diagrams is not a good idea. The address register for the temp memory is just a counter that gets cleared at the start of the active line (for evaluating) and again at the start of hblank (for fetching) The "temporary memory" could be just about anything from a raw copy of the sprite entry in question, to an index and a y-offset per sprite entry that gets examined in hblank.

by on (#65791)
Quote:
In the hardware it really makes sense, because during pixel rendering, the ppu is drawing the contents of the sprites found in range for this scanline (these are in the special buffer for sprites)

the line is delayed so the buffer is for the NEXT line. these registers are held until they are used for pattern table fetching.

Quote:
In the patent you see a comparator between scanline#+1 and Ypos of the sprite, so the IN-RANGE validator recieves 5 bits that only tells that you have to draw that sprite if the result is less than 8 (the substraction is done without sign).

What is your question about the comparator? I wasn't thinking in my last post, you NOR the high 5 bits for 8x8 sprites and high 4 bits for 8x16.

Quote:
The problem here is that the OAM (Motion Picture Attribute Table in the patent fig.5) is 256 bytes (really 64 x 29 bits) with an 8bit address register, but the temporary memory is drawn within the same memory, so the OAM and the temporary memory is in the same entity (256 + 32 addresses). Then we need two different address registers, one for the OAM and one for the temporary memory in the same RAM, but we only need the OAM dual ported to CPU and PPU internal.

What is the problem? You don't know that they're the same. I think temp RAM is separate, and simply addressed by a "found valid sprite" counter.
Re: Question about OAM and vertical comparator
by on (#65801)
alexwinger wrote:
My first question is:
Are the OAM and the temporary memory in the same RAM so the OAM address register and the Temporary Memory register are multiplexed in order to address the same RAM (OAM + Temp)?

I believe that OAM and TempM are separate units because their suggested structure differs (64x29 vs 32x8)
and they are referred to as separate entities. But why are you so concerned about their internal structure?
As I see it, nowadays there is little sense in making such RAM units authentic when targeting a FPGA.
For example, in those days it was feasible to conserve two hundred SRAM bits on a die (3*64 in case of OAM).
But today the size of minimal separately addressable RAM chunk found in low-cost Altera Cyclone II FPGA (that I use for my hw emu) is 9*256 bits. So why bother?


alexwinger wrote:
The other question is:
How the 4-bit result of the comparator is stored in the temporary memory if there is not a direct conection to the OAM data bus in the PPU diagram?

Brad Taylor says that the 4 bit result from the comparator is stored in the temporary memory, but i dont see a bus that connects this result to the "OAM + Temp" data bus.

Let's take a look at http://wiki.nesdev.com/w/index.php/PPU_sprite_evaluation:
Quote:
3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite)
    * 1-4: Read the Y-coordinate, tile number, attributes, and X-coordinate of the selected sprite

This statement implies that Y coordinate must actually be stored in TempM. So there is no need in storing calculated sprite line (i.e. the 4 bit result from the comparator) separately. Just route q outputs of both OAM and TempM via multiplexer to the comparator and it's done. Yes, this requires a 8bit 2-in-1 multiplexer, but, well, who cares? :)
Re: Question about OAM and vertical comparator
by on (#65806)
dekoder wrote:
Let's take a look at http://wiki.nesdev.com/w/index.php/PPU_sprite_evaluation:
Quote:
3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite)
    * 1-4: Read the Y-coordinate, tile number, attributes, and X-coordinate of the selected sprite

This statement implies that Y coordinate must actually be stored in TempM

But there's really no reason to do so. One could take the "Y coordinate" in this case to mean the 0-15 sprite_texel_y from the adder. It's exclusive-ORed with the vertical flip attribute and combined with the tile number and two bits from PPUCTRL ($2000) to form the source address for a sliver fetch.
Re: Question about OAM and vertical comparator
by on (#65810)
tepples wrote:
But there's really no reason to do so. One could take the "Y coordinate" in this case to mean the 0-15 sprite_texel_y from the adder. It's exclusive-ORed with the vertical flip attribute and combined with the tile number and two bits from PPUCTRL ($2000) to form the source address for a sliver fetch.

Well, of course, she/he can (and actually suggested to do it this way) but as I see it, the question was not about the way of getting pattern line number but about the right moment to do it.
Re: Question about OAM and vertical comparator
by on (#65828)
dekoder wrote:
alexwinger wrote:
The other question is:
How the 4-bit result of the comparator is stored in the temporary memory if there is not a direct conection to the OAM data bus in the PPU diagram?

Brad Taylor says that the 4 bit result from the comparator is stored in the temporary memory, but i dont see a bus that connects this result to the "OAM + Temp" data bus.

Let's take a look at http://wiki.nesdev.com/w/index.php/PPU_sprite_evaluation:
Quote:
3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite)
    * 1-4: Read the Y-coordinate, tile number, attributes, and X-coordinate of the selected sprite

This statement implies that Y coordinate must actually be stored in TempM. So there is no need in storing calculated sprite line (i.e. the 4 bit result from the comparator) separately. Just route q outputs of both OAM and TempM via multiplexer to the comparator and it's done. Yes, this requires a 8bit 2-in-1 multiplexer, but, well, who cares? :)


Thanks dekoder, this info helped me out to understand the sprite part of the ppu, and it makes sense:
I think that the OAM and the temp memory are in the same memory array, because the access is done in different cycles, the drawing expose it because they share the same buses and the same symbol. I think the addresses buses are multiplexed and with an ninth bit that enables each area. Mmm, maybe the multiplexer is a little waste, so i prefer if they are separated entities with their data bus interconnected.

The result of the comparator is never stored but the 2 buses of its output are used 1 for the oam entry y-evaluation and 1 for the row # for pattern fetching.

I'm going to write some code in c to test this. If you are interested, i'll show you the results.

dekoder wrote:
I believe that OAM and TempM are separate units because their suggested structure differs (64x29 vs 32x8)
and they are referred to as separate entities. But why are you so concerned about their internal structure?
As I see it, nowadays there is little sense in making such RAM units authentic when targeting a FPGA.
For example, in those days it was feasible to conserve two hundred SRAM bits on a die (3*64 in case of OAM).
But today the size of minimal separately addressable RAM chunk found in low-cost Altera Cyclone II FPGA (that I use for my hw emu) is 9*256 bits. So why bother?



I just want to emulate the ppu the most accurately possible, so first i really want to understand the patent info and other documents.
I can model these kinds of RAM bit by bit with many ports. So Why not replicate the hardware with minor changes? I'm using a Xilinx Spartan 3e FPGA. I want to buy an Altera for testing purposes.

Do you have any online info of your hard-emu?
Do you use VHDL or Verilog?
Re: Question about OAM and vertical comparator
by on (#65829)
tepples wrote:
dekoder wrote:
Let's take a look at http://wiki.nesdev.com/w/index.php/PPU_sprite_evaluation:
Quote:
3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite)
    * 1-4: Read the Y-coordinate, tile number, attributes, and X-coordinate of the selected sprite

This statement implies that Y coordinate must actually be stored in TempM

But there's really no reason to do so. One could take the "Y coordinate" in this case to mean the 0-15 sprite_texel_y from the adder. It's exclusive-ORed with the vertical flip attribute and combined with the tile number and two bits from PPUCTRL ($2000) to form the source address for a sliver fetch.


The problem here is that you can't access these 4 bytes at the same time.

We need 8 bytes to get all the info into the buffer memory:

1.-You read the Y-cordinate just to feed the comparator that feeds the vertical inversion module (i think the v.inversion module is a holding register because we haven't loaded the attribute part of the sprite).
2.-The attribute part is loaded so vertical and horizontal inversion modules holds these values.
3.-The Tile number is loaded in the picture address register
4.-The X-data is loaded in the buffer memory, the info for the pattern fetching is complete and we are going to fetch the 2 pattern bytes from the pattern table so we have to wait 4 cc (2 bytes fetching from PPU memory).
5.-We are waiting.
6.-1 pattern byte reached, but we need to wait for another 2 cycles.
7.- we have to wait just one more clock cycle
8.-All data is complete just for 1 sprite, now the same for the next sprite until the 8 are transferred from the temporary memory.


The last 4 cycles are in a waiting state with the last X-data loaded in the data bus, i think that's the reason for the 4cycle repeatead read of the X-data in the "PPU sprite evaluation" doc.

I'm not really sure because i just readed the document.

by on (#65834)
There's a lot of nonsense going around in here regarding what could and could not be done, and what is and is not a reasonable internal architecture.

The PPU sprite evaluation page, AFAIK, is tested on the real thing. Whatever you come up with should be able to produce the output it describes.

Patent diagrams are notorious for not telling the whole story, or even a particularly accurate story. Most notably, there are 3 8-bit outputs coming out of that temp memory block, and one of those goes directly to the CPU lines, so it could be reading more than one thing at a time, and routing one of them to the CPU anyways.

If it's built internally like the 2A03 is (fairly likely) then there isn't so much a register involved as much as there is some transmission gates acting as a dynamic latch for the comparator input. You're not implementing that directly in an fpga.

by on (#65835)
alexwinger wrote:
Thanks dekoder, this info helped me out to understand the sprite part of the ppu, and it makes sense:
I think that the OAM and the temp memory are in the same memory array, because the access is done in different cycles, the drawing expose it because they share the same buses and the same symbol. I think the addresses buses are multiplexed and with an ninth bit that enables each area. Mmm, maybe the multiplexer is a little waste, so i prefer if they are separated entities with their data bus interconnected.

You're welcome :) But let me clarify the topic a bit. The matter discussed is somewhat superfluous because I depends on how you look at it. At the low hardware level there is just no sensible way to make a RAM unit with depth != 2^n (you'll just end up wasting more resources on address decoding). So even if RAM+TempM does have the shared addr bus in real hardware (the bus being open-drain, so there is no need in explicit multiplexer), they are internally two separate entities of the depth of 64 and 32, having "ninth bit" acting as chip(and bus) select. They can even have similar open-drain shared output bus and it is th only bus routed to the subtractor. It made sense 30 years ago. Now it doesn't. The FPGAs don't have internal open-drain outputs. You can describe them in HDL but you'll end up with multiplexed bus after the synthesis. In our case we definitely need two separate address pointer registers at the sprite evaluation state (one holding OAM address for reading, second - TempM for writing), so there is absolutely no point in multiplexing them into a single bus. You can do it but it is really unfeasible for internal logic when targeting an FPGA. Even if you do it that way, your synthesizing software will throw multiplexer away and simply route each pointer to it's memory unit - one for OAM, another for TempM. It is just the same consideration suggests you to make separate data_in and data_out buses for internal CPU for example, even if the real-life CPU has just one shared open-drain data bus.
But you have to use a multiplexer for OAM and TempM outputs that go to subtractor of course.

Feel free to reply :)

alexwinger wrote:
The result of the comparator is never stored but the 2 buses of its output are used 1 for the oam entry y-evaluation and 1 for the row # for pattern fetching

Yep, you can NOR upper N bits of the result. If you use a 9 bit result, than N = 5 for tall sprites, N = 6 otherwise. But you hardly will express it in terms on NOR in HDL anyway :)


alexwinger wrote:
I just want to emulate the ppu the most accurately possible..

Then you should definitely check the forum a lot. You can find info on all sorts of edge cases and quirks.
Just as an example:
http://nesdev.com/bbs/viewtopic.php?t=6424
http://nesdev.com/bbs/viewtopic.php?t=4647

by on (#65838)
alexwinger wrote:
Do you have any online info of your hard-emu?
Do you use VHDL or Verilog?

No. VHDL.

At the moment my emu is in it's early stage of development, I work on it from time to time in my spare time with no any schedule and I don't like blogging (it would take my precious time. I do all my hw designs using Terasic Altera DE2 board. It is really great in a way it has almost all needed chips already soldered fo you. It took me two weeks of intense coding to make a first working release of PPU (spr 0 hit test and spr overflow unimplemented at all, many bugs) + NTSC encoder. It was so boring watching he almost static screen, so I added Peter's Wendrich great and perfect (I really treat this core this way, the ones you can find on opencores.org have much less precise timings) 6502 core (http://www.syntiac.com/fpga64.html). Just disabled BCD and got NESstres passing all tests (at that moment I didn't know this test is not so great) and Galaxians (he-he :) running. A half of a year later (today) there is not a lot of progress, it's very difficult to find a time to spare on the topic. The current state:
    1. No APU, though it is on the todo list and there is a sketch
    2. No mappers other than 0 because there is no rom loader at the moment (the highest priority) and I use FPGA's internal memory as ROM.
    3. PPU passing all blargg's tests (they are really handy), except for the "palette after reset" one that doesn't matter
    4. Super Mario Bros. running, no visual glitches but without any sound it's just not fun at all
    5. Playstation 2 controllers as gamepads
    6. Double coherent video output: analog NTSC S-video from VGA DAC (found on DE2) and digital 320x240 color TFT (Terasic LCM, sells separately :))

As you can see, no point in publishing yet.

Why VHDL? Well, it has more consistent syntax and semantics for my taste. I'm a software engineer with no formal education in electronics so when I found myself interested in it several years ago, a made a comparison of VHDL vs Verilog and had chosen the former (though I'm a C/C++, not ADA/Pascal guy). And VHDL is open from the start, it makes some difference for me.
Oh, and jwdonal has been using SystemVerilog for his emu for a time.