Hello,
I'm new to the forums but have been lurking in #nesdev for a few months. Making yet another emulator.
The information on PPU frame events and timing on the wiki felt a bit scattered (and is contradictory in a few spots :/), so I put together an SVG diagram in Inkscape to get an overview and help my own understanding. I think it might be useful to other people too, so I'm putting it here. Tell me what you think.
Link to SVG file:
https://www.dropbox.com/s/84k9ypwct3zwomu/ppu.svgEdit: Rearranged things slightly for better clarity.
Here's a few questions that came up while I was drawing it btw:
1.
http://wiki.nesdev.com/w/index.php/Skinny says that the horizontal position in v is incremented at dot 256:
Quote:
If rendering is enabled, the PPU increments the horizontal position in v many times across the scanline, it begins at dots 328 and 336, and will continue through the next scanline at 8, 16, 24... 240, 248, 256 (every 8 dots across the scanline until 256).
Is that a typo, or are both the vertical
and the horizontal position updated at dot 256? Doesn't matter that much since the horizontal bits are reloaded at dot 257 anyway, but might as well get it right.
2. When is the secondary OAM cleared? Only at the beginning of the visible scanlines?
3. How are the two-cycle VRAM fetches split up? What addresses appear on the bus during each cycle? (I think it has to do with loading the VRAM address in two steps, but not sure.)
Some of the information on that page predates analysis of
Visual2C02, so there are actually some minor differences:
1. The visible scanlines are 0-239, VBLANK is scanlines 241-260, the prerender scanline is 261, and the postrender scanline is 240.
2. The fetches are actually offset by 1 cycle: 0 is idle, 1-2 is NT byte, 3-4 is AT byte, 5-6 is low BG tile, 7-8 is high BG tile (and the H increment happens during 8), etc. The last 2 fetches in the scanline happen at 321-328 (with H increment during 328) and 329-336 (with H inc at 336), and two garbage fetches at 337-338 and 339-340 (and an idle cycle at 0 where all it does is pulse the Address Latch Enable output).
Regarding your questions:
1. Yes, that is correct - cycle 256 increment both H
and V, and cycle 257 reloads H.
2. Secondary OAM is cleared during cycles 1-64 of every scanline where "rendering" fetches are done (including the pre-render scanline), evaluation occurs during cycles 65-256, and then the fetches happen at 257-264 (with the first 4 cycles being reads from secondary OAM and the second 4 cycles being the tile reads from VRAM), 265-272, 273-280, 281-288, 289-296, 297-304, 305-312, and 313-320.
3. For VRAM reads during rendering, during the 1st cycle the PPU outputs the address it's going to read and raises the ALE (Address Latch Enable) line for the first half of the cycle, and during the 2nd cycle it pulls /RD low for the entire cycle and performs the read. Reading or writing $2007 is notably different - after your $2007 access finishes, it waits 1 full cycle, then it outputs the address and raises ALE for one full cycle, then lowers it for one full cycle, then spends a 3rd cycle performing the read or write (with /RD or /WR pulled low for the entire cycle).
Before I update the diagram: Where does the skipped tick occur? On the first pixel of the first scanline?
Here's an updated version based on the new information. Tell me if you spot any errors.
The linked SVG file has been updated as well.
Should be an idle cycle at the beginning of sprite evaluation (I lump secondary OAM clear in with it ATM). Fixed.
I'll separate secondary OAM clear and sprite evaluation, but I have a question first:
Is sprite evaluation done at all on the prerender scanline, or just the OAM clear? If it's done, then what prevents it from finding sprites on the first visible scanline?
Beannaich noted that the two dummy nametable fetches are both proper nametable fetches (since they're the same fetch as at the beginning of the next scanline), so the last one shouldn't be labeled "AT". Fixed.
Should VBLANK be before or after the main 240 scanlines (I know that it doesn't really matter).
From Quietust's post it looks like the visible scanlines should be first, so the numbering is still off in that respect in my diagram. I'm guessing that's based on how the PPU counts scanlines internally.
As far as I've understood, the proper NTSC vblank is 22 scanlines long and includes the prerender and postrender scanlines by the way.
Did some tracing with beannich in visual2c02, and it looks like the vblank flag goes high on v/h:241/1 (note: h = 1, not 0), and low on v/h:261/1 (exactly 20 scanlines later).
Here's a trace around the beginning of v:241 (cycle, hpos, vpos, various stuff, and then vbl_flag at the end):
Quote:
330088 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330088 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330089 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330089 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330090 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330090 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330091 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330091 000 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 0
330092 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330092 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330093 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330093 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330094 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330094 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330095 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
330095 001 0f1 0000 0802 1e 1 0 1 1 1 0802 0 02 1
By the way (since this seems undocumented):
Additional node names (like vbl_flag in this case) can be found in
http://www.qmtpro.com/~nes/chipimages/v ... denames.js. These can then be added to 'Trace these too:' (or searched for using 'Find:'). The simulation can be sped up a lot by disabling animations and logging. (Still takes 1h+ to simulate an entire frame though.)
Did more tracing in visual 2c02 and figured out some more stuff:
- The starting frame is an even frame, with the prerender scanline being 341 ticks long. (The simulation starts on the prerender scanline - the last line of the frame.)
- The next frame is an odd frame, with the skipped tick being at the end of the prerender scanline. Pixel: 339 jumps to Pixel: 0 instead of from 340 to 0 like on even frames.
- The sprite overflow flag is cleared on the same tick as the VBL flag, as expected.
I should've logged vbl_flag on the odd frame as well to see that it gets cleared one tick earlier as expected, but forgot. >:|
ulfalizer wrote:
Did more tracing in visual 2c02 and figured out some more stuff:
- The starting frame is an even frame, with the prerender scanline being 341 ticks long. (The simulation starts on the prerender scanline - the last line of the frame.)
- The next frame is an odd frame, with the skipped tick being at the end of the prerender scanline. Pixel: 339 jumps to Pixel: 0 instead of from 340 to 0 like on even frames.
- The sprite overflow flag is cleared on the same tick as the VBL flag, as expected.
I should've logged vbl_flag on the odd frame as well to see that it gets cleared one tick earlier as expected, but forgot. >:|
The cycle could still be skipped at the beginning depending on how visual 2C02 calculates the pixel value for the UI. If vbl_flag is cleared on dot 0 instead of dot 1, that'll prove that.
EDIT: Ran another frame of simulation, this time with vbl_flag logged, here are the results:
Code:
Cycle H V VBL
357368 000 105 1
357368 000 105 1
357369 000 105 1
357369 000 105 1
357370 000 105 1
357370 000 105 1
357371 000 105 1
357371 000 105 1
357372 001 105 0 <- VBL still cleared on dot 1..
357372 001 105 0
357373 001 105 0
357373 001 105 0
357374 001 105 0
357374 001 105 0
357375 001 105 0
357375 001 105 0
Cycle H V VBL
358724 153 105 0
358724 153 105 0
358725 153 105 0
358725 153 105 0
358726 153 105 0
358726 153 105 0
358727 153 105 0
358727 153 105 0
358728 000 000 0 <- Skipped clock
358728 000 000 0
358729 000 000 0
358729 000 000 0
358730 000 000 0
358730 000 000 0
358731 000 000 0
358731 000 000 0
Cycle H V VBL
1356 153 105 0
1356 153 105 0
1357 153 105 0
1357 153 105 0
1358 153 105 0
1358 153 105 0
1359 153 105 0
1359 153 105 0
1360 154 105 0 <- Non-skipped clock
1360 154 105 0
1361 154 105 0
1361 154 105 0
1362 154 105 0
1362 154 105 0
1363 154 105 0
1363 154 105 0
1364 000 000 0
1364 000 000 0
1365 000 000 0
1365 000 000 0
1366 000 000 0
1366 000 000 0
1367 000 000 0
1367 000 000 0
It's pretty odd. Does this mean that a nametable fetch is dropped on shortened pre-render lines?
Here's a log around the end of the prerender line on even frames (no skipped tick):
Code:
hpos vpos vramaddr_t vramaddr_v io_db io_ab io_rw io_ce rd wr ab ale db vbl_flag spr0_hit spr_overflow
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
154 105 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 1 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 1 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 1 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 1 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 1 1 1000 0 00 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
Here's the same area for odd frames:
Code:
hpos vpos vramaddr_t vramaddr_v io_db io_ab io_rw io_ce rd wr ab ale db vbl_flag spr0_hit spr_overflow
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
153 105 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
000 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 1 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
001 000 0000 0002 1e 1 0 1 1 1 2002 0 02 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
002 000 0000 0002 1e 1 0 1 0 1 2000 0 00 0 0 0
It looks you go from v/h=261/339 to v/h=0/0 on odd frames instead of from v/h=261/340 to v/h=0/0 like on even frames. It also looks like the last tick of the last dummy nametable fetch "moves down" to fill the idle tick at the beginning of the first visible scanline.
I guess a reasonably close way to represent that in the diagram would be to say that the idle tick at the beginning of the first visible scanline is optional. (I messed up pretty bad in the last version of the diagram. The skipped tick is off by an entire scanline. I hate +-1 errors
.)
Attaching a diff.
Makes sense. Instead of the last cycle of pre-render, the first cycle of raster 0 can be skipped. I noticed some possible Visual2C02 bugginess, if you look at the AB for the tile fetches at the end of pre-render (happens on all lines, though):
Code:
321: $2000 <- name 1/2
322: $2000 <- name 2/2
..
329: $2001 <- name 1/2
330: $2000 <- name 2/2
the same occurs for the two dummy fetches:
337: $2002
338: $2000 <-+ why does the address go back to $2000 during the second cycle of nametable fetch?
339: $2002 |
340: $2000 <-+
The same thing seems to happen with attributes:
323: $23C0 <- attr 1/2
324: $2300 <- attr 2/2
I'm chalking that up to bugginess for now.
beannaich wrote:
Makes sense. Instead of the last cycle of pre-render, the first cycle of raster 0 can be skipped. I noticed some possible Visual2C02 bugginess, if you look at the AB for the tile fetches at the end of pre-render (happens on all lines, though):
Code:
321: $2000 <- name 1/2
322: $2000 <- name 2/2
..
329: $2001 <- name 1/2
330: $2000 <- name 2/2
the same occurs for the two dummy fetches:
337: $2002
338: $2000 <-+ why does the address go back to $2000 during the second cycle of nametable fetch?
339: $2002 |
340: $2000 <-+
The same thing seems to happen with attributes:
323: $23C0 <- attr 1/2
324: $2300 <- attr 2/2
I'm chalking that up to bugginess for now.
Might be related to the PPU loading the address in two stages too. Can't remember exactly how that works at the moment.
Here's an updated diagram with the skip location fixed and the lines renumbered to reflect how the PPU views things.
One thing that still isn't clear to me is how the idle ticks at the beginning of the scanlines affect rendering. Will there be a black column on the far left on the display, or does the timing work out differently somehow?
ulfalizer wrote:
One thing that still isn't clear to me is how the idle ticks at the beginning of the scanlines affect rendering. Will there be a black column on the far left on the display, or does the timing work out differently somehow?
There wouldn't be a black column because the rendering pipeline is initialized in the pre-render scanline. The two buffered tiles would cover the gap. The 16-bit shift registers for the bit-planes isn't clocked until dot 1, allowing for a complete display even with the skipped cycle.
beannaich wrote:
ulfalizer wrote:
One thing that still isn't clear to me is how the idle ticks at the beginning of the scanlines affect rendering. Will there be a black column on the far left on the display, or does the timing work out differently somehow?
There wouldn't be a black column because the rendering pipeline is initialized in the pre-render scanline. The two buffered tiles would cover the gap. The 16-bit shift registers for the bit-planes isn't clocked until dot 1, allowing for a complete display even with the skipped cycle.
Ah, yeah, I think I see how that works now. The "fill levels" at the different ticks would be
Code:
0: [xxxx xxxx] <- [xxxx xxxx]
1: [xxxx xxxx] <- [xxxx xxx-]
2: [xxxx xxxx] <- [xxxx xx--]
3: [xxxx xxxx] <- [xxxx x---]
4: [xxxx xxxx] <- [xxxx ----]
5: [xxxx xxxx] <- [xxx- ----]
6: [xxxx xxxx] <- [xx-- ----]
7: [xxxx xxxx] <- [x--- ----]
8: [xxxx xxxx] <- [xxxx xxxx]
9: [xxxx xxxx] <- [xxxx xxx-]
...
Here's a new version with some clarifying notes about rendering.
ulfalizer wrote:
beannaich wrote:
Makes sense. Instead of the last cycle of pre-render, the first cycle of raster 0 can be skipped. I noticed some possible Visual2C02 bugginess, if you look at the AB for the tile fetches at the end of pre-render (happens on all lines, though):
Code:
321: $2000 <- name 1/2
322: $2000 <- name 2/2
..
329: $2001 <- name 1/2
330: $2000 <- name 2/2
the same occurs for the two dummy fetches:
337: $2002
338: $2000 <-+ why does the address go back to $2000 during the second cycle of nametable fetch?
339: $2002 |
340: $2000 <-+
The same thing seems to happen with attributes:
323: $23C0 <- attr 1/2
324: $2300 <- attr 2/2
I'm chalking that up to bugginess for now.
Might be related to the PPU loading the address in two stages too. Can't remember exactly how that works at the moment.
That would almost definitely be it - if you pay attention, you'll notice that the last 2 digits of AB are always exactly the same as DB, because they're actually the same signal, aliased together to make it easier to read the logs.
Quietust wrote:
That would almost definitely be it - if you pay attention, you'll notice that the last 2 digits of AB are always exactly the same as DB, because they're actually the same signal, aliased together to make it easier to read the logs.
So for example:
Code:
H AB DB
$001: $2002 $00
$002: $20FF $FF
Is that why on the pinout there is AD0-7 and A8-13?
Nintendo has used multiplexed address and data buses to reduce pin count in a lot of its designs. Others I can think of are Nintendo 64 and Game Boy Advance.
tepples wrote:
Nintendo has used multiplexed address and data buses to reduce pin count in a lot of its designs. Others I can think of are Nintendo 64 and Game Boy Advance.
GBA with it's cartridge interface, latching the 24-bit address through D0-15 and A16-23, then using 16-bit incrementing counters on sequential reads and only supplying A16-23? It certainly reduces pin count, but makes a mess out of the cartridge interface and imposes timing penalties for non-sequential accesses.
beannaich wrote:
[GBA seek/read] certainly reduces pin count, but makes a mess out of the cartridge interface and imposes timing penalties for non-sequential accesses.
As opposed to the NES PPU interface that imposes a timing penalty on all accesses?
tepples wrote:
As opposed to the NES PPU interface that imposes a timing penalty on all accesses?
Touche.
EDIT: Ulfalizer, I believe the shift registers are left-shift, with carry from bit 15 being used for output. Then every 8 clocks, the bottom 8 bits are loaded in with the most recently fetched data.
beannaich wrote:
tepples wrote:
As opposed to the NES PPU interface that imposes a timing penalty on all accesses?
Touche.
EDIT: Ulfalizer, I believe the shift registers are left-shift, with carry from bit 15 being used for output. Then every 8 clocks, the bottom 8 bits are loaded in with the most recently fetched data.
That's what I was trying to describe, but I used "top" instead of "bottom". (On the right side feels like top to me, but if you think about the bits "bottom" might make more sense.
)
The output is determined by fine_x I think, which acts as a pointer into the left/top shift reg. That's why you need 16 bits - to make sure you never "run out" of pixels even if fine_x is set to 7.
ulfalizer wrote:
The output is determined by fine_x I think, which acts as a pointer into the left/top shift reg. That's why you need 16 bits - to make sure you never "run out" of pixels even if fine_x is set to 7.
Yeah, it's a tad weird though, since the value of fine_x really determines bit (15 - fine_x) is output.
Code:
FEDCBA98 76543210
00110011 00110011
^
|
fine_x(0)+
FEDCBA98 76543210
00110011 00110011
^
|
fine_x(7)-------+
Conceptually, the background does the following every dot for output (C++):
Code:
// get bits from shifters
u8_t attr = ( attr_shifter ) & 0x3U;
u8_t bit0 = ( bit0_shifter >> ( fine_x ^ 0xFU ) ) & 0x1U;
u8_t bit1 = ( bit1_shifter >> ( fine_x ^ 0xFU ) ) & 0x1U;
// clock shifters to the next pixel
bit0_shifter = u16_t( bit0_shifter << 1 );
bit1_shifter = u16_t( bit1_shifter << 1 );
// 'clock' in new data if necessary
if ( hclock & 0x8U )
{
attr_shifter = u8_t( attr_shifter << 2 ) | attr_fetch; // attr_shifter is only clocked once per tile (every 8 dots)
bit0_shifter |= bit0_fetch;
bit1_shifter |= bit1_fetch;
}
u8_t color = ( attr << 2 ) | ( bit1 << 1 ) | bit0;
if ( color ) // don't ouput 0
{
output_bg( color );
}
beannaich wrote:
ulfalizer wrote:
The output is determined by fine_x I think, which acts as a pointer into the left/top shift reg. That's why you need 16 bits - to make sure you never "run out" of pixels even if fine_x is set to 7.
Yeah, it's a tad weird though, since the value of fine_x really determines bit (15 - fine_x) is output.
Makes things consistent with the non-fine scroll at least, with increasing values pushing things to the left.
Here's a new version that includes the OAM clears. Spot any errors?
Removed the shift reg note as well, since it was a bit ambiguous and confusing.
Edit: Fixed a minor error on the 'Visible scanlines' line around h=279-305.
Old version was a bit misleading. The sprite tile vram fetches still occur on the pre-render line. Added a note about sprite evaluation instead.
Hi ulfalizer,
I would like to thank you for your work on this diagram (and everybody who helped you of course)!
It will be very useful for people like me who don't speak english.
Rid wrote:
Hi ulfalizer,
I would like to thank you for your work on this diagram (and everybody who helped you of course)!
It will be very useful for people like me who don't speak english.
No problem. Your English seems fine though.
If the diagram looks correct now, perhaps it could be linked from relevant parts of the wiki. I could update the out-of-date pages while I'm at it.
That is a very nice diagram.
Bisqwit wrote:
That is a very nice diagram.
Thanks.
Bisqwit wrote:
That is a very nice diagram.
Biggest understatement since nesdev's conception.
WedNESday wrote:
Bisqwit wrote:
That is a very nice diagram.
Biggest understatement since nesdev's conception.
Glad you like it. I'm thinking about doing another one with all the cpu/ppu/ciram/cart address/data bus connections, since I'm still a bit hazy on some of the details of that.
Talked to Quietust and did some more tracing in visual 2c02, and it looks like the secondary OAM clear and sprite evaluation is skipped on the pre-render line. Updating the diagram to reflect this.
This might also mean that it's possible for sprites to affect the rendering of the first scanline somehow.
I'm just starting out, trying to understand.
Are you sure that the cycle (NT-> AT-> LBG-> HBG) should continue to tick 256 (32 cycles), and not to 240 (30 cycles)?
Indeed, in the previous line had been done two cycles.
Plazm wrote:
I'm just starting out, trying to understand.
Are you sure that the cycle (NT-> AT-> LBG-> HBG) should continue to tick 256 (32 cycles), and not to 240 (30 cycles)?
Indeed, in the previous line had been done two cycles.
There are a total of 34 BG tuple fetches per scanline. 32 in the active portion (h:1-256), and two at the end of the raster for the NEXT line (h:321-336). Of these 34 fetches, only 33 are ever actually used in rendering, the tuple at h:249-256 isn't necessary but some mappers rely on this fetch for timing purposes.
Each tuple takes 8 dots to fetch, in the following pattern (shown in relative clocks):
Code:
+0: Name address is on the bus
+1: Name data is on the bus (read occurs here)
+2: Attr address is on the bus
+3: Attr data is on the bus (read occurs here)
+4: Bit0 address is on the bus
+5: Bit0 data is on the bus (read occurs here)
+6: Bit1 address is on the bus
+7: Bit1 data is on the bus (read occurs here)
The reason for the 2 cycles per fetch is due to the AD0-AD7 pins being multiplexed between address and data, as previously mentioned. During the first of the two cycles, something like $2010 would be seen, and during the second cycle something like $2055 would be seen. The data fetched is in the bottom 8 bits, while the top 6 remain "open bus" from the address.
beannaich wrote:
Of these 34 fetches, only 33 are ever actually used in rendering
Why 33? (in line 32 tiles)
Plazm wrote:
beannaich wrote:
Of these 34 fetches, only 33 are ever actually used in rendering
Why 33? (in line 32 tiles)
To allow for fine horizontal scrolling. ((256 pixels + 8 pixel fine scroll) / 8 pixels per tile) = 33 tiles.
Thanks a lot. Now, this section became clear to me. It also became clear why there is a two prefetch tiles, not just one.
Here's an updated version that points out that the last tile fetch is unused and adds a note about fine x.
Edit: upper<->lower. I keep thinking right->upper for some reason.
ulfalizer wrote:
Here's an updated version that points out that the last tile fetch is unused and adds a note about fine x.
The part about the shift registers is wrong. Fine X indexes into the
upper 8-bits, inversely (15-fine x). They shift left, every 8 dots the
lower 8-bits are reloaded with the fetched data.
beannaich wrote:
ulfalizer wrote:
Here's an updated version that points out that the last tile fetch is unused and adds a note about fine x.
The part about the shift registers is wrong. Fine X indexes into the
upper 8-bits, inversely (15-fine x). They shift left, every 8 dots the
lower 8-bits are reloaded with the fetched data.
See the edit I made just before you posted that.
Edit: Or slightly after I guess.
ulfalizer wrote:
See the edit I made just before you posted that.
Edit: Or slightly after I guess.
Well played
There may be some new info on OAM today, I wrote a test for it, just need someone with a TV with overscan and a powerpak to test it for me.
If I wanted to link the diagram from the wiki, would it be best to just link directly to the forum, or is there some possibly more permanent place I could put it? (Don't have any 24/7 servers myself. :/)
In that case, it'd be best to upload the diagram to the wiki. I've added the "trusted" group to your account, so you don't need to wait to become autoconfirmed. MediaWiki is supposed to support rendering SVGs to PNG at display time, as it does on Wikimedia Commons and WIkipedia. But MediaWiki relies on external programs for this rendering, and those might not be correctly configured on the server. I'd recommend uploading the SVG and the PNG, and then if you get an error about running "convert" after uploading the SVG, report it in
the wiki section of the BBS.
Linked it from a few different relevant spots on the wiki. I know some browsers still don't scroll nicely through large SVG files, so I made the PNG the primary link.
ulfalizer wrote:
Linked it from a few different relevant spots on the wiki. I know some browsers still don't scroll nicely through large SVG files, so I made the PNG the primary link.
On the rendering section it says dot 0 outputs the first pixel of the scanline. Is that right? Wouldn't that cause a missing column of color at pixel 16?
Code:
clock 340: 1111 1111 1111 1111
clock 0: 1111 1111 1111 1110
...
clock 7: 1111 1111 0000 0000
clock 8: 1111 1110 1111 1111
^
missing pixel -----+
Or is there some race condition i'm unaware of?
beannaich wrote:
ulfalizer wrote:
Linked it from a few different relevant spots on the wiki. I know some browsers still don't scroll nicely through large SVG files, so I made the PNG the primary link.
On the rendering section it says dot 0 outputs the first pixel of the scanline. Is that right? Wouldn't that cause a missing column of color at pixel 16?
Code:
clock 340: 1111 1111 1111 1111
clock 0: 1111 1111 1111 1110
...
clock 7: 1111 1111 0000 0000
clock 8: 1111 1110 1111 1111
^
missing pixel -----+
Or is there some race condition i'm unaware of?
I assumed it worked that way from some previous discussion in #nesdev but haven't actually checked in Visual 2C02. Tick 8 would have to both reload the lower shift reg and then shift it once into the upper one if the first pixel is output at dot 0 it seems, which does sound a bit weird.
If the first pixel is output at dot 1, then maybe the updated timing is identical to the old one, only we've renamed dots according to the internal PPU counters.
Did some initial poking around in visual2C02, what I saw was ... strange.
tile_l appeared as $00ff during pre-render, and ROR'd starting at dot 76, ROR'ing one bit at a time until it's value was $f807. The next clock was not $fc03 as you'd expect, it was instead $ff03. tile_h appears to follow a similar pattern, but delayed by about 6 dots.
The condensed version of this behavior (intermediate cycles removed, each line represents a single dot) is:
Code:
h v t_l t_h
04b 105 00ff ffff
04c 105 807f ffff
04d 105 c03f ffff
04e 105 e01f ffff
04f 105 f00f ffff
050 105 f807 ffff
051 105 ff03 00ff
052 105 ff81 807f
053 105 ffc0 c03f
054 105 ffe0 e01f
055 105 fff0 f00f
056 105 fff8 f807
057 105 fffc fc03
058 105 fffe fe01
EDIT: Both high and low tiles fetches were $ff from the chr-rom area.
EDIT: After more poking around, I got a single tuple access pattern, I was correct about it not being clocked at dot 0. I was however wrong about them being left shift.
Code:
h v t_l t_h t_buf
000 000 aaaa aaaa aa
001 000 aaaa aaaa aa
002 000 d555 5555 aa
003 000 eaaa 2aaa aa
004 000 f555 1555 aa
005 000 faaa 0aaa aa
006 000 fd55 0555 aa
007 000 feaa 02aa aa
008 000 ff55 0155 aa
009 000 aaaa aaaa aa
Note about the above, t_h is inverted, so I un-inverted it for clarity. Clocks begin at the beginning on dot 2, with data being clocked in on dot 9.
Figured out what's going on with the right-shifting. Turns out the pattern stored in tile_l is both reversed and bit-flipped, so that e.g. $01 turns into $7F. $AA was an unlucky choice of test data, as that turns into $AA.
Here's some related Visual 2C02 "bug reports":
- "// lower pattern bit shift register, NOT inverted!" in nodenames.js should probably say something like "// lower pattern bit shift register, reversed and inverted ($01 -> $7F)". Haven't checked what holds for tile_h yet, but it might be wrong too.
- If you watch just tile_l1, it ends up with non-0/1 values like 3E and 3F. That's probably a bug. The other bits always get 0/1 at least.
Edit: tile_h seems to be reversed but *not* inverted. Maybe the nodes could be renamed to turn the shift registers into left-shift instead.
Some quick question:
Do changes in pixel_color correspond immediately to changes in what's currently being drawn? The waveform graph seems to lag a bit behind it, but maybe that's just due to the way its drawn.
Does pixel_color include sprites? If so, the "BG pixel color, sent to EXT pins" comment seems a bit misleading.
Starting a miniature documentation on sprites. So far:
Dot 1-64 with rendering enabled: reading $2004 gives $FF.
Dot 65 with rendering enabled: start of evaluation as previously observed.
Dot 65-256 with rendering enabled: reading $2004 gives the currently fetched byte from OAM, that is, the byte being used by OAM evaluation.
Dot 257 with rendering enabled: $2003 is reset to 0.
Unanswered questions: Is $2003 driven to 0 after dot 257? If not, is it possible to write $2003 after 257 and what is the effect of this on OAM evaluation? What happens when enabling rendering at random points during a scanline (Particularly, dots 1-64)?
I'm pretty sure the first pixel is output during dot 2 now. Putting $FF in $0000 and $0001 (second one needed to get a sprite zero hit during scanline 1) and disabling the $2003/$2004 writes so you get an all-0 OAM entry for sprite 0 you can see the following:
- the tile_l/h shift registers are shifted at dot 2
- pixel_color changes at dot 2
- the sprite 0 flag goes high during dot 2 (of scanline 1)
I tried using $7F, $7F and $3F, $3F for the pattern as well, and the sprite 0 flag gets set during dot 3 and dot 4 like you'd expect. The sprite 0 flag seems to go high during the second half of the PPU tick btw...
For future reference, here's some interesting signals to trace: spr0_hit, tile_l/h, pixel_color, vid_
OK, this is how it seems to work (though I don't know why):
Sprite zero hits take effect as if the image starts drawing at h=2 (the third PPU tick on the scanline). The actual pixel/dot corresponding to that point is output two PPU ticks later though, at h=4 (and the last pixel is correspondingly output at h=260).
Any idea why the PPU delays the pixels like that?
Here's a new version with the updated sprite 0/pixel output timing information. What assumptions do sprite 0 test roms make w.r.t. timing btw?
Don't think I've screwed up anywhere, but would be nice to have some third-party verification. beannaich?
spr0_hit and vid_ are good nodes to watch.
If more people want to help out, I could put together a short "user's manual" for Visual 2C02 by the way.
Please do, all this information is absolutely fantastic! I'd love to be able to understand Visual 2C02 better. It's proving to be an awesome tool.
Okay, here goes, per the labels in the attached picture:
Code:
======== (1) ========
This starts, stops, and resets the simulation. The 'Scanline:' and 'Pixel:'
status displays are based on internal PPU counters and should be
self-explanatory. The starting state to use when resetting the simulation can
be selected with the radio buttons near (5).
======== (2) ========
This is a list of register accesses to be carried out, going in sequence from top
to bottom. The simulated 2C02 isn't attached to any other simulated devices, and
the way to access registers is by adding them to this list. For example, 'W 1 1e'
decodes as 'write $1E to $2001'. Reads can be significant for some registers,
which is why they're included. (Note that you don't get any value "back" for reads
though.)
Register accesses can be removed by clicking on the '-' and added by clicking
on the '+'. A '-' in the R/W colum means 'no-op' (use the numpad to input the
'-').
The '*' is just to show the current access. You can click on it to jump to that
point in the sequence.
======== (3) ========
Memory display. Can also be used to modify memory.
- 3F00-3F1F is the palettes. Some of the cells are mirrors.
- S000-S11F is OAM. For example, S000 would be the y position for sprite 0.
* S000-S0FF is the primary OAM.
* S100-S11F the secondary OAM (normally not directly accessible).
- 0000-03FF is the pattern tables. This 1KB segment is mirrored eight times
to fill out the entire CHR space.
- 2000-23FF is nametables. The simulation uses a kind of "one-screen low"
mirroring, and the data here is mirrored to fill out the entire nametable
space.
======== (4) ========
This is a video output waveform display. It's based on the vid_ node. If you
run the simulation without changing anything first, you will just see some
level changes and squiggly stuff here near the end of each scanline, which is
the NTSC hsync/colorburst, etc.
======== (5) ========
Pretty self-explanatory. Node numbers or node names (e.g. "spr0_hit") can be
entered in the Find: box to locate them in the diagram. (This won't be used
here.)
======== (6) ========
Tracing stuff. Additional nodes to trace can be added in the "Trace these too:"
box as a space-separated list (e.g. "spr0_hit tile_l vid_").
The cycle column is based on the master clock, which the PPU divides by four.
Each line in the trace is actually a half-cycle, so there's 4*2 = 8 lines per PPU
tick.
======== Finding nodes to trace =======
A list of nodes can be found in
http://www.qmtpro.com/~nes/chipimages/visual2c02/nodenames.js . For nodes that
have many bits, e.g. finex0, finex1, finex2, you can trace all of them at once
by using 'finex' as the node name.
======== Performance hint ========
Turning off tracing and unticking "Animate during simulation" and "Show sprite
RAM contents" can massively speed up the simulation.
======== Tutorial: Outputting some pixels =======
1. Put 81 at pattern table address 0000. This will make the palette index for
each pixel of the first row of the first tile, in order, '10000001'. (Putting
81 at 0008 as well would make it '30000003', etc.)
(Since the nametables are initialized to 0 by default, this is the tile that
will be used for all the background tiles by default.)
2. Change the value of 3F01, which is the BG palette entry that will be used.
20 seems to work fine.
3. Run the simulation (and note the Performance section). The first line is the
pre-render line, so nothing will be seen here. At scanline 1, you should see
some pixels being output in the waveform display corresponding to the 81
pattern.
======== Some things to look out for ========
- Note that the default register writes might move around sprite 0 and do
other stuff, so you might have to remove some of them or manually modify
memory later to get the state you want.
- There's a bunch of sprites sitting at (0,0). If sprites are enabled and all
use a black tile, this means you will see black for the first 8 pixels of
scanlines 1-8 (sprites don't start drawing until scanline 1 at the earliest
since the y OAM coordinate is one less than the actual position).
Edit: s/mirrored four times/mirrored eight times/
Edit 2: s/2000-23F0/2000-23FF/
Edit 3: Clarify register access column
Edit 4: Clarify cycles in the trace window
ulfalizer wrote:
Don't think I've screwed up anywhere, but would be nice to have some third-party verification. beannaich?
I'll be confirming a lot of this information as I begin implementing this into my emulator.
After some discussion in #nesdev I felt a little bad about maybe making basic emulator implementation seem way trickier than it really is, so I added a note to the diagram.
Edit: Delete a "not".
ulfalizer wrote:
After some discussion in #nesdev I felt a little bad about maybe making basic emulator implementation seem way trickier than it really is, so I added a note to the diagram.
I could never get low level rendering to work until I had a firm grasp on high level operation. That should be the natural progression of any emulator, start at high level, slowly convert things to low level. Anyone using the PPU diagram when writing a new emulator, especially with no previous experience, all I have to say is "Good luck".
A wiki page for the Visual 2C02 tutorial with some small corrections is now at
http://wiki.nesdev.com/w/index.php/Visual_2C02 .
There was some contradictory information in the notes on where the first pixel is output.
So I know that there is some curiosity about why you've observed pixel rendering being delayed about 4 dots. I have a theory which may or may not make sense, but here it goes
.
The pattern of behavior for the PPU is basically the following:
dots: 0-64 = clear S-OAM
dots: 65-256 = fill S-OAM with data for next scanline.
at the same time:
dots 0-256 = render background and sprite pixels for the current scanline.
But, as far as I understand that means that there is a possible conflict if this is done naively. If sprite 0's x position is say 32 pixels into the scanline, that means that its data will be overwritten with $FFs well before it is rendered, making it end up as garbage!
My hypothesis is that the 4 pixel delay for rendering is enough to allow an algorithm (which I have not thought of the specifics yet) to avoid this conflict.
Thoughts?
There are three separate areas of OAM:
- The normal display list, 64 entries, written with $2003/$2004/$4014
- Next line OAM, 8 entries, for sprites on the next scanline
- Counters and shifters
From 65-256, the PPU scans the display list for in-range entries and copies the first 8 to next line OAM. While this is going on, the counters and shifters are feeding the compositor. From 257-320, the PPU fetches an 8x1 pixel pattern sliver for each of the sprites in next line OAM while copying it to the counters and shifters.
Ah, that's a good point, I forgot about that. Oh well, back to the drawing board
.
I wonder if it is useful to use a 2 dimensional array of function pointers to implement this diagram. Then do something like this for every pixel
Code:
(ppu->*runStep[screenX][screenY])();
An array of 90,000 function pointers would kill your cache.
tepples wrote:
An array of 90,000 function pointers would kill your cache.
You are right.
I need another way to organize my ppu code, it is a big list of if statements.
proxy wrote:
So I know that there is some curiosity about why you've observed pixel rendering being delayed about 4 dots. I have a theory which may or may not make sense, but here it goes
.
The pattern of behavior for the PPU is basically the following:
dots: 0-64 = clear S-OAM
dots: 65-256 = fill S-OAM with data for next scanline.
at the same time:
dots 0-256 = render background and sprite pixels for the current scanline.
But, as far as I understand that means that there is a possible conflict if this is done naively. If sprite 0's x position is say 32 pixels into the scanline, that means that its data will be overwritten with $FFs well before it is rendered, making it end up as garbage!
My hypothesis is that the 4 pixel delay for rendering is enough to allow an algorithm (which I have not thought of the specifics yet) to avoid this conflict.
Thoughts?
Like tepples said, the contents of secondary OAM doesn't matter while sprite zero hit detection is taking place. There are eight internal sprite output units (my terminology - can't think of anything clearer
) that handle the actual sprite drawing - the secondary OAM is just a list of sprites with which to initialize them (during ticks 257-320).
What I suspect is going on is that the first pixel leaves the shifters at h=2. The palette entry for the pixel then needs to be looked up, which takes another two ticks, so that the first pixel is drawn at h=4. (Sprite zero hit detection only needs to know whether the pattern bits are both zero, and so doesn't need the palette lookup.)
Edit: "Sprite drawing units" would be a bit clearer.
The reloading ticks were off by one. The shifters are actually reloaded at ticks 9,17,25,...,257 and not 8,16,24,...,256.
Quote:
The background shift registers shift during each of dots 2...257 and 322...337, inclusive.
I'm trying to make sense out of visual 2C02 and was wondering about this. If I understand this correctly, tile_l0 - tile_l15 would be the low BG shift register, and new data gets placed in 8-15.
Now, if i put 0x01 at 0x0000, I can see that it arrives in the shift reg at the dots mentioned in the diagram (9, 17, 25 etc), but isn't the shifting part off? Going from pixel 1 to pixel 2 shifts the register, and going from 257 to 258 does not from what I can tell. If dots, ticks and what the simulation calls "Pixel:" are the same thing (...), isn't this off by one? Shouldn't it be 1-256, 321-336?
fred wrote:
Quote:
The background shift registers shift during each of dots 2...257 and 322...337, inclusive.
I'm trying to make sense out of visual 2C02 and was wondering about this. If I understand this correctly, tile_l0 - tile_l15 would be the low BG shift register, and new data gets placed in 8-15.
Now, if i put 0x01 at 0x0000, I can see that it arrives in the shift reg at the dots mentioned in the diagram (9, 17, 25 etc), but isn't the shifting part off? Going from pixel 1 to pixel 2 shifts the register, and going from 257 to 258 does not from what I can tell. If dots, ticks and what the simulation calls "Pixel:" are the same thing (...), isn't this off by one? Shouldn't it be 1-256, 321-336?
Bit busy with moving at the moment, but I'll leave a quick reply for now without double-checking stuff in Visual 2C02.
When I say that the shift registers shift "during" a particular dot, I mean that the
effect of the shift is seen at that dot (in the real thing there'd also be a short transition period at the beginning of the dot before things settle down). Dot 2 is the earliest tick where you see the shift registers shift, and dot 257 is the last tick. If you completely ignore transition delays (like the simulator usually does), I guess it would be most accurate to say that the shift registers shift
between dot 1 and dot 2, etc.
Suggestions for how things could be rephrased to be less ambiguous would be welcome. I'm primarily a SW guy, so it's possible that I'm missing some standard terminology.
(Note that the above usage is consistent with things being loaded or cleared "at" a particular dot meaning that they change at that dot, etc. Perhaps it would be clearer to use "at" instead of "during" for the shifts too.)
Ah, I see how you mean. I didn't think of it that way! But it is true that the effect of the shift is first seen at dot 2. Hmm.