So I'm playing around with splitting the screen twice without a mapper. The first split will use sprite0 and be around scanline 32. The second split will be around scanline 208. I'm trying to use the dmc irq to approximate it since it doesn't have to be exact (I'm extending vblank for chr-ram writes).
The main loop (without all the init code):
Code:
; ...
ldx #$40 ;
stx PORT2 ;
ldx #0 ;
stx SND_CLOCK ;
; ...
cli
_loop:
jmp _loop
This is the code called during the nmi handler. The sample is located at $C000 and is $401 bytes of $00.
Code:
lda sndChannelEnable ; disable dmc channel
and #$0F ;
sta SND_CLOCK ;
lda #%10001111 ; set sample ctrl
sta SND_DMC_CTRL ;
lda #$00 ; set dac
sta SND_DMC_DA ;
lda #$00 ; set sample addr
sta SND_DMC_ADDR ;
lda #$40 ; set sample length
sta SND_DMC_DL ;
lda sndChannelEnable ; enable dmc channel
ora #$10 ;
sta sndChannelEnable ;
sta SND_CLOCK ;
The IRQ handler is empty for now:
Code:
IRQBRK:
sei
rti
I've been able to split the screen with mapper IRQs, but I'm not sure if I'm initializing the APU correctly. Any thoughts?
A while back, I made a demo that does two splits, and does it to within a granularity less than one scanline. I plan to post it tonight to see what someone else can do with it.
never-obsolete wrote:
That will deadlock your program.
1) I is already set by the IRQ. The SEI does nothing.
2) RTI will pull status from the stack, resulting in I being cleared again
3) Because RTI will clear I, another IRQ will occur immediately, causing a deadlock in the IRQ handler.
You need to acknowledge the IRQ. I forget how this is done on the DMC. I think you just read $4015, but you'll have to double-check.
For the first part of your code, I'm not familiar with those weird labels so I can't help you.
Like Dish says, you MUST acknownledge an IRQ before doing cli or rti or any instruction that will return else it will be an endless loop of IRQs (the only exeption is an IRQ triggered by a brk instruction but here it's clearly not the case).
In the case of DMC IRQs, you should read $4015 before returning. If DMC is your only source of IRQ, you can just discard the result and assume the source of the IRQ was the DMC, else you'd have to acknownledge all possible source of IRQs, and check if their bit was set, and act accordingly,
I recommend to do something like this in your IRQ handler :
Code:
IRQ
bit $4015
bvs _noDMC ;This line can be deleted if DMC is the only possible IRQ source
lda GrayScale_toggle
eor #$01
sta GrayScale_toggle
ora #$1e
sta $2001
_noDMC
rti
This will toggle grayscale mode each IRQ. If you have correctly 2 splitpoint, the middle of your screen between both splitpoints should be gray.
D'oh, I forgot about acknowledging the irq. Thanks for the heads up.
@Tepples: I'll wait to see your example. I have a feeling I'm not setting up the DMC registers correctly.
It turned out that I had lost half the source code for this effect when my laptop died a few months ago. But I still had the working ROM on my PowerPak, and I managed to teach myself enough about
da65 to recreate working source code.
DPCM Split
Mapper 0, runs on PowerPak. README file inside archive explains how it works. It still has glitches that someone like blargg might be able to help find and fix.
Your technique is interesting. If I understand well, you play a 1-byte sample, and count the cycles before the first IRQ, to compensate for the jittering.
The problem is that you have to wait a dozen of IRQs for a dozen of scanline, so if this technique were to be used for a longer time between both split, this constant stream of useless IRQs would waste very significant CPU time for nothing (and half-killing the purpose of interrupts, although it's still better than waste ALL the CPU time).
Would this technique work as well if you play a 1-byte sample, count the # of cycles before the interrupt, then play a 17 or 33 byte sample to have only 2 IRQs in total ?
Another thing is that you use sprite-0 hit (with 7-cycle jitter I guess) for the first split, and start your IRQ technique from there. If you'd instead reverse the thing, and use IRQ for the first split and sprite 0 for the second, you could do that technique with 3 cycle jitter instead, and it could perform better. Altough of course this will waste 100% of CPU time between the 1st and 2nd split instead of between the top and the 1st split, so it all depends of what you'd want to achieve.
EDIT : Yet another thing is that you could, in the first loop, wait for $4015.7 to rise, while having the I flag set. This would prevent the IRQ, and possibly (possybly not ?) increase accuracy of the code and simplify the code inside the IRQ (that would only execute once per frame).
Bregalad wrote:
The problem is that you have to wait a dozen of IRQs for a dozen of scanline, so if this technique were to be used for a longer time between both split, this constant stream of useless IRQs would waste very significant CPU time for nothing
It's significant, but the 37 cycles out of 432 still fall within my 10% of CPU time budget.
Code:
irq_handler:
bit $4015 ; acknowledge dpcm irq
inc irqs
; schedule another IRQ
pha
lda enable_timer
sta $4015
pla
rti
And it's not entirely useless to know how much time I have left before the IRQ fires.
Quote:
Would this technique work as well if you play a 1-byte sample, count the # of cycles before the interrupt, then play a 17 or 33 byte sample to have only 2 IRQs in total ?
Playing a 17-byte sample would allow skipping the first IRQs if the splits are far enough apart ((17+1)*8*54 = 7776 CPU cycles or 69 scanlines). But the demo's splits are only about 10 bytes apart, and it would be two more instructions in the IRQ to handle switching to 1-byte samples after the initial 17- or 33-byte sample. I'll consider it in the second.
Quote:
Altough of course this will waste 100% of CPU time between the 1st and 2nd split instead of between the top and the 1st split, so it all depends of what you'd want to achieve.
What motivated my development of this technique was trying to display a background picture that needs more than 4 KiB of CHR while decompressing a page of text to put into the next picture. Here, sprite 0 would be in a fixed place (doesn't matter where as long as it's present), and the code would wait for the end of vblank by waiting for sprite 0 flag to turn off.
Quote:
EDIT : Yet another thing is that you could, in the first loop, wait for $4015.7 to rise, while having the I flag set.
The number of cycles in the loop that measures the IRQ jitter has to match the number of cycles in the loop that compensates, and it takes more time to read $4015 than the zero-page variable that the IRQ handler updates.
Quote:
What motivated my development of this technique was trying to display a background picture that needs more than 4 KiB of CHR while decompressing a page of text to put into the next picture. Here, sprite 0 would be in a fixed place (doesn't matter where as long as it's present), and the code would wait for the end of vblank by waiting for sprite 0 flag to turn off.
That's right, and no offense tepples, but the way you're using IRQ with this "inc variable" method you're polling completley kills the purpose of an IRQ, which is to free the main CPU to do another task.
Doing it is completely dumb - you'd as well have the main code doing a loop timed between the split and the result would be the same.
Yet I understand that this is a proof of concept, but you should have the main code doing something else than polling available in order to have this any useful.
So the IRQ handler will need not only to acknownledge, but also to handle the split and the loop that comensate the jitering. That's why I suggest you wait a fake IRQ just by polling $4015 after the first split (in our case) - this should simplify code and avoid an if/else statement in your IRQ routine.
Bregalad wrote:
Quote:
What motivated my development of this technique was trying to display a background picture that needs more than 4 KiB of CHR while decompressing a page of text to put into the next picture.
That's right, and no offense tepples, but the way you're using IRQ with this "inc variable" method you're polling completley kills the purpose of an IRQ, which is to free the main CPU to do another task.
I understand that, but finalizing the code for this other task may take weeks. I have a day job, my workouts, and sometimes younger cousins who visit over summer break.
Quote:
So the IRQ handler will need not only to acknownledge, but also to handle the split and the loop that comensate the jitering.
It doesn't wait for IRQ ten times; it waits for the number of elapsed IRQs to reach 10. So even if I do something that takes the time of eight IRQs to complete before I jump into that loop, I can still get into the loop in time. But now I bet tokumaru is itching to remind me that "something" may occasionally exceed this time, causing an occasional glitch. So my ultimate plan is to move more of the scroll split processing into the IRQ handler, which should add two cycles to the "just count" case.
As the Tourette's Guy channeling the late Robert Stack would put it, "
UPDATE!"
I got the whole effect to run in the NMI and IRQ handlers.
This version is a rock-solid letterbox generator for NTSC NES.
EDIT: Attached DPCM Letterbox demo here as a backup
Are you saying that with your method it is possible to use DMC IRQ's for effects that were only possible with special mappers? That's awesome!
Is it precise enough to allow you to change the scroll mid-frame without artifacts? If yes, I'd say you just did a great thing for the NESDEV community. I always wanted a method to do raster effects without having to use mappers or 100% of the CPU time.
And if you keep sample playback disabled during VBlank and a while after that I guess you can read the controller then, without having to worry about corruption, right?
EDIT: I imagine it would be hard to make this effect dynamic, right? I mean, if you need the split to move up and down as opposed to keeping it steady.
Congratluation tepples, this is really great.
Quote:
Are you saying that with your method it is possible to use DMC IRQ's for effects that were only possible with special mappers? That's awesome!
Yeah he's saying this. I think it only works if there is a sprite zero hit first tough. Also I'm not sure if it's possible to get 3 or more splitpoints that way, but anyways 2 is still much better than just 1 without any mapper.
Quote:
And if you keep sample playback disabled during VBlank and a while after that I guess you can read the controller then, without having to worry about corruption, right?
You can read the controller multiple times anyway so I don't think this is a problem unless you're like REALLY tight on timing.
Bregalad wrote:
I think it only works if there is a sprite zero hit first tough.
He said that the purpose of the sprite hit was to time the first IRQ, and that you don't need it if your NMI handler is constant-timed (like mine is).
Quote:
You can read the controller multiple times anyway
Well, you can, but why would you? I'd rather have my NMI handler read the controllers before doing any DMC IRQ stuff and use that one read for the rest of the frame.
Quote:
so I don't think this is a problem unless you're like REALLY tight on timing.
I don't like the idea of indefinitely reading the controllers until 2 consecutive read match, so I'd rather just read it when it's safe to do so.
Could the DMC technique be used for screen masking for vertical mirroring?
tokumaru wrote:
Is it precise enough to allow you to change the scroll mid-frame without artifacts?
It's doing exactly that in one of the steps of the IRQ handler. Search for your username in src/reset.s.
Bregalad wrote:
Also I'm not sure if it's possible to get 3 or more splitpoints that way
You can get as many split points as you need, provided they're spaced at least 9 or so lines apart.
Dwedit wrote:
Could the DMC technique be used for screen masking for vertical mirroring?
That's exactly what this demo does, except I didn't set vertical mirroring in the header.
tepples wrote:
It's doing exactly that in one of the steps of the IRQ handler. Search for your username in src/reset.s.
Heh, I saw that, thanks for the credits! =)
Anyway, if this can really sub for a scanline counter, that's awesome. I will take a better look at the code, but since the split points in your demo are fixed, are you manually adjusting them or can the code dynamically move the split point?
Quote:
You can get as many split points as you need, provided they're spaced at least 9 or so lines apart.
Another thing to consider is that it's not possible to have an IRQ fire at the very top of the screen... but I guess that in you could just check for those cases and use timed code instead.
I don't know if it's clear, but I just want to make sure that this technique can completely replace a standard scanline counter. Even if a bit of timed code is necessary to fine tune the splits, all types of effects should be possible, right?
Each IRQ is worth 54*8*3/341 = 3.8 NTSC lines or 50*8*3.2/341 = 3.754 PAL lines. So you could have the NMI compute two additional values: on which IRQ to activate a split point, and how much extra time after that IRQ to wait before triggering the split. Then instead of switching to slow mode on the second action, the second IRQ would check for the calculated split IRQ count and then wait for the calculated time.
Multiple splits would have to be spaced at least 9 or 10 lines apart, unlike MMC3 which can go as low as 2.
When I first thought up this technique, I was just hoping to get
one split in so that I could display tiles from the first pattern table in the top half of the screen and tiles from the second in the bottom half, so that I could
fill the screen with text. I exceeded my own expectations.
I think that making an interface that takes scanline numbers and handles the complicated details (possibly using look-up tables) to make sure a certain piece of code runs when that scanline is reached is definitely something worth pursuing.
I'd like to make some experiments using your idea if you don't mind, is that OK, tepples?
I just thought that since we know how many cycles are there between IRQs we could maybe request IRQs using cycles, not scanlines (like some mappers do), and every IRQ we subtract the known number of cycles from the total, and when the remainder is less than the time between two IRQs the rest of the time is spent with timed code. Does anyone have a better idea?
tokumaru wrote:
I'd like to make some experiments using your idea if you don't mind, is that OK, tepples?
Go ahead; that's what I made it for.
tepples, I have some questions about your code (probably because I'm an APU newbie):
You have a single sample byte, $AA, correct? I didn't see you set the sample address ($4012) anywhere, why is that? Also, at some point you "slow down" and increase the sample length, but how can you do that if you only have 1 $AA byte?
tokumaru wrote:
You have a single sample byte, $AA, correct? I didn't see you set the sample address ($4012) anywhere, why is that?
My mistake. Not audible in practice. I was lucky.
Quote:
Also, at some point you "slow down" and increase the sample length, but how can you do that if you only have 1 $AA byte?
Mistake #2. Also not audible in practice.
I heard it, but I was wearing headphones and had the volume up pretty high at the time, until I turned it down.
But what's the correct way to encode silence then? I tried hacking tepples' ROM with more $AA's after the first one and the noise got worse... The only time it appeared to stay silent was when I put 17 $FF's, but that makes no sense to me...
Encode silence as all decreasing bits, so that it runs into the bottom rail.
Tepples I really think a strings of $00s is the way to play silent samples. Just write $00 to $4011 at reset and you'll be SURE it will have no effect at all.
Other than this, it's really great. I saw that it was possible to avoid completely sprite #0 hit if you have a cycle-timed NMI which can be an interesting alternative as well. It's really a great discovery you made, even if the "trick" to have an IRQ fire after one byte of DMC and count the time until it triggers is simple, you just had to think about it, and congratulations for that.
Maybe you can't have splitpoints to close apart, but anyways if that were to be the case you could use timed code instead (without wasting too much CPU cycles), so it's not an issue. Scanline IRQ can be definitely considered luxury.
I'm thinking of something like this:
The programmer can specify the number of scanlines until the next split, and the exact time this is done in the scanline is the time the split will happen, but X scanlines down.
Use a frequency that will result in one byte of DMC finishing playing in slightly less than a multiple of 3 scanlines (to avoid fractional cycles - which means it will have to be a multiple of 16 in PAL?), so that when the IRQ fires it's timed to set up the next one at exactly the same point in the scanline, causing the IRQs all to fire with a whole number of scanlines between them.
Each IRQ checks if the number of scanlines left until the split is less than that number of scanlines between IRQs, and if so it uses timed code until the desired scanline (a small function that waits X scanlines shouldn't be hard to code).
Theoretically this would allow splits to be very close to each other, since the system would detect that the number of scanlines (as low as 1?) was smaller than the number of scanlines between IRQs and would use timed code instead. What you can't do is have 16+ split points evenly spaced across the screen, or there would be no time left for game logic.
Not as good as normal mapper IRQs because more processing time will be spent on a series of IRQs before the actual split is reached, but is still infinitely better than sprite 0 hits. Also, the more the IRQs are spaced, less time will be spent before the split, but more time will be spent by the timed code that waits for the exact scanline, so we must find a healthy compromise.
I'll have to try this whenever I get the time.
Knowledgeable NESDEV masters, please help me once again: in Nestopia and FCEUX, as soon as I start playing a DMC sample of size 1 an IRQ fires, but in Nintendulator the IRQ only happens after the sample. What gives? How does a real console behave in this situation? (I'm sorry I have to ask this, but I'm kinda unable to use my NES for a while)
Also, tepples, did you test your demo on hardware? I'm asking because even though it works with Nestopia and FCEUX, with Nintendulator there are some nasty glitches and the thing doesn't look stable at all.
Make sure you're using the newest version Nintendulator. Older versions have some nasty OAM bugs.
In the version of Nintendulator I have, it blinks occasionally, but is otherwise stable.
Then I tested the newest version of Nintendulator, and it's rock stable there.
Dwedit wrote:
In the version of Nintendulator I have, it blinks occasionally, but is otherwise stable.
Then I tested the newest version of Nintendulator, and it's rock stable there.
You are right, it works in the latest beta build. Apparently this version also has an IRQ firing when the 1-byte sample starts, like the other 2 emulators... Should I assume this is the correct behavior then?
From the info on the wiki I guess the IRQ is triggered after the last sample is fetched, in the case of a 1-sample byte it is shortly after you start the sample, but randomly placed in some range of time.
Unless you can provide test ROMs I can't test this in hardware easily.
tokumaru wrote:
Also, tepples, did you test your demo on hardware?
I tested only on 1. the ancient version of FCEU in Ubuntu's repo and 2. my NES + PowerPak. Now that I have some spare time to test it on a Windows machine, I see that FCEU and Nestopia have it right and Nintendulator has it wrong. And now that you mention it, I think I may have had my TV on mute for some other reason, and my laptop's internal speaker has never been good at reproducing low frequencies.
As for the IRQ triggering immediately after starting playback, by the time it gets to the measure loop, either 1 or 2 IRQs have elapsed. The code around "cmp #2" below the measuring loop compensates for the extra time when two IRQs happen between enabling and measuring.
I just tested on my top loader and it seems that there always is an IRQ as soon as the DMC is started.
I also tested your demo, tepples, and at least for me it's not as rock solid as it should. From time to time the text section appears out of place for a frame or so. Sometimes there are also glitches at the far right of the scanlines where the scroll changes, but that's hard to debug because each TV hides a different amount of pixels.
tokumaru wrote:
I also tested your demo, tepples, and at least for me it's not as rock solid as it should. From time to time the text section appears out of place for a frame or so.
I'm testing on a PowerPak. That might have caused some difference in the power-up state, which in turn might have caused the noise (or lack thereof). But even after a few presses of reset to randomize the PPU-CPU alignment, I can't get the window to blink.
Quote:
Sometimes there are also glitches at the far right of the scanlines where the scroll changes, but that's hard to debug because each TV hides a different amount of pixels.
My HDTV underscans a bit when zoomed out. I can see both the left and right border, and on my NES + PowerPak, all the artifacts are inside the border area.
If the sample buffer is empty, then starting a one-byte DMC sample will result in an immediate IRQ. One way of avoiding this is to start it twice in a row, clear the IRQ flag, then enable IRQ.
tepples wrote:
I'm testing on a PowerPak.
So am I. It must be the boot ROM version or something else then (does the top loader use a different CPU revision?). It doesn't blink gray like it does on Nintendulator though, the text just appears in the wrong place for a frame.
blargg wrote:
If the sample buffer is empty, then starting a one-byte DMC sample will result in an immediate IRQ. One way of avoiding this is to start it twice in a row, clear the IRQ flag, then enable IRQ.
Ah, good to know this is documented behavior. Thanks for the tip.
I'm getting occasional glitches on my NTSC frontloader, with a devcart that acts like NROM. Powered up and after a few seconds, some glitches. Taking a look at the technique, since it's pretty slick.
Quote:
If the sample buffer is empty, then starting a one-byte DMC sample will result in an immediate IRQ. One way of avoiding this is to start it twice in a row, clear the IRQ flag, then enable IRQ.
If that is true, then the sample fetch happens as soon as you enable the DMC, then I see no purpose in counting the cycles until the first IRQ (that should be compensated later), if this happens always immediately.
Bregalad wrote:
If that is true, then the sample fetch happens as soon as you enable the DMC, then I see no purpose in counting the cycles until the first IRQ (that should be compensated later), if this happens always immediately.
It actually counts the cycles until the
second IRQ. Even if the first is immediate, the second isn't because the sample buffer isn't empty.
Yeah, the first one fires right away, but the time between it and the second one varies.
I don't get it at all. If you play ONE sample, why is there TWO IRQs ?
I take it that the IRQ itself starts a sample. So if you start a one-byte sample outside IRQ when the sample buffer is empty, it immediatley reads that one byte, then finds it's done so it fires the IRQ immediately. Then the IRQ starts another one-byte sample. The sample buffer is already full, so this second sample has to wait until the buffer becomes empty, thus the second IRQ is delayed.
Then why play multiple samples at all if the first sample is immediate ? Just play a sample and the IRQ will be sync'ed with the video (that is with the 7-cycle interrupt jittering + the jitter you could possibly have at the time of setting up the DMC sample)
Because the sample buffer starts out empty, the first sample immediately finishes (from the memory reader's point of view) and thus immediately causes an IRQ. Because this is immediate, it is useless for synchronizing to the internal clock of the DMC. The time between the first IRQ and the second IRQ varies based on the phase of the clock of the DMC, and this is what the measuring loop measures. When I get time in front of an NES, I plan to experiment with starting the first two samples from the NMI handler, the first with no IRQ (to fill the sample buffer) and the second with an IRQ (to start measuring).
None of this measuring would be necessary if it were possible to restart the clock of the DMC, but this is not possible on the NES.
Well then play a sample longer than 1 byte (17 or 33 bytes for example) for a delay before the IRQ.
If I used a 17-byte sample for measuring, that would waste an unacceptable amount of time that I could be using for game logic or to prepare the VRAM update buffer. So I play a 1-byte sample to prime the buffer, and then I play a 1-byte sample to measure the phase.
I don't get it. If the first sample always start immediately, then if you start a 17-byte samples, the IRQ will happen after the 16-th sample has been fetched, which will be in a constant amount of time, and remove the need of any measuring.
Again, you cannot reset the DMC's sample byte timer. It's always outputting either 8 silence bits or the 8 bits of a sample from its buffer. It always does things 8 bits at a time, and once it's started 8 silence bits or 8 bits from the buffer, you cannot have any effect until it's finished these 8 bits. Thus, on an NTSC NES the delay between when you start a sample and the first bit of the first sample byte actually starts playing can be up to around 432 cycles (if you've got the DMC at maximum frequency). It is this variable delay that tepples' code is timing.
I guess I'll just give up on understanding DMC's bahavior and use sprite zero hits.
Here's what the output unit does: Hey, is there a sample byte in the buffer? Nope, oh well, I'm going to output silence for the next 8 timer periods. ... OK, all done. Is there a sample byte in the buffer now? Yes, OK, I've got the sample, clear the buffer now. Here I go outputting the 8 delta bits over the next 8 timer periods. ... OK, done, is a sample byte in the buffer?
Then, there's the memory reader that keeps the sample buffer filled. If the sample buffer is empty and there are any remaining bytes to be played, it reads the next byte from RAM and puts it into the sample buffer. If that was the last byte of the sample and IRQ is enabled, it sets the IRQ flag.
So the only thing feeding the output unit is the timer and the one-byte sample buffer. If you start out with the sample buffer empty and no sample playing, and then start a one-byte sample, you can see how the memory reader will immediately read that byte and put it into the sample buffer, then trigger an IRQ, since there are no more bytes to be READ (it's irrelevant that the sample buffer byte hasn't been played yet; the memory reader just cares about anything more that needs to be read from RAM).
Bregalad, it all sounds very weird to me to, specially since I don't know much about the APU. But as I'm testing this myself, I can see that everything blargg is saying makes sense. I wouldn't have a clue about what he's saying if I wasn't experimenting with this stuff right now.
Well, how the output unit can output silence ? DMC samples either increase or decrease the counter. So does it have a third state that don't affect the counter, or does it just output zero and keep the counter decreasing until it reaches 0 ?
In fact I guess the last DMC sample should be considered a bug in Nintendo's implementation. Nintendo probably intended samples to be 0, 16, 32, .... bytes long.
So with this system, an IRQ fires as soon as a sample finishes, but there is a bug that makes the sample plays one more byte.
The CPU is randomly aligned with that 8-stage "output unit" (if it had a more precise name that would probably make things less confusing), which is constantly running at one of those 16 possible speeds, so the only solution to sync with it is to play a sample and see when it finishes. Because it's a terrible waste of CPU time to play a 16-byte sample, even at the fastest speed $f, you play a 0-byte sample, and a IRQ triggers immediately. In the IRQ handler, you play a second 0-byte sample, and because the "output unit" is busy playing the buggy sample, the IRQ only happens when the output unit has done playing this sample. At this IRQ, you KNOW that the "output unit" is on it's reload cycle. Now you can play wathever sample you want for a timed delay, and the third IRQ will come at a predictable time.
In other words, you have to play 2 dummy samples at the start, and to have 2 dummy IRQs. Right ?
Talk about a major headache.
Bregalad wrote:
Well, how the output unit can output silence ? DMC samples either increase or decrease the counter.
If I understand correctly, if you set it to 0 through $4011 and use only zeroes for samples it can't go any lower than 0, thus you get a flat line.
Bregalad wrote:
Well, how the output unit can output silence ? DMC samples either increase or decrease the counter. So does it have a third state that don't affect the counter
Yes. "Not playing" is this third state, during which the output unit does not output -2 or +2 values. But transitions from "not playing" to "playing" or vice versa always happen on the multiple of 8 sample periods.
Quote:
In other words, you have to play 2 dummy samples at the start, and to have 2 dummy IRQs. Right ?
It's not strictly necessary to play the first dummy sample with IRQs turned on.
I can't get this thing to be steady... the 1 or 2 IRQs that fire when the first sample starts throw all the calculations off, and I get up to 1 scanline of variation...
EDIT: Maybe I should give up on this. I mean, the technique looks promising and all, and I believe my idea for the generalization of it (so that you can just pass a number of scanlines to a function and get your IRQ at the correct time) is good, but the fact that I'm messing with something I don't quite master (APU) makes this a hit and miss job, and not only I don't like to get the right results by luck without actually understanding the reason but this also takes too much time that could be better spent on my game.
I got some good things out of this though, such as a function that waits a specified number of scanlines that works for PAL and NTSC dynamically (you just need a flag indicating whether the console is PAL or NTSC). I was gonna use it to wait for the exact scanline of the split after the IRQ before it. If anyone is interested I can share it.
Before I give up I guess I'll try one more thing: blargg, can you please tell me in more detail how I can avoid that first IRQ (i.e. the exact order in which I have to perform the writes)?
Set interrupt mask, start it, acknowledge interrupt immediately?
Or just have your IRQ decrement a counter and return if the counter isn't the right value.
The problem is that it seems that sometimes there are 2 immediate IRQs, and this is giving me a hard time calculating the delay correctly (like I said it varies by 1 scanline or so). I'm afraid that even if I get rid of the first IRQ there will be a second "ghost" IRQ from time to time. Don't ask me what's up with that double interrupt, but from looking at tepples' code it seems he too noticed it.
The priming playback causes an immediate IRQ as the last byte of the wave entered the buffer. The eight-sample period ends during the IRQ handler, causing the buffer to empty into the shift register. The IRQ handler starts another playback, and another IRQ triggers because the buffer again has the last byte of a wave. Look around not_double_irq: in the source code to guess how I detect and compensate for that. (Hint: if the IRQ handler becomes appreciably longer, you need to increase 'lda #6' by the difference in length in eight-cycle units.)
I am adding comments before the measuring loop to clarify this. I plan to release a new version tomorrow anyway because the NMI handler was using the Y register when it shouldn't have been.
I understand the idea, but there are a lot of things in your code I don't get (lack of comments?), so I tried implementing it in my own way.
For example, I didn't understand why you count up in order to find out what the delay is. Since the IRQ will fire before the location where you actually wanted it to, to me it made more sense to set the counter to what the ideal delay should be and decrease it in the 8-cycle loop until the IRQ fires, so that whatever is left in Y is the number of 8-cycle units you have to wait before performing the raster effects. To me that made more sense than the "adc #176" you have there, that I can't figure out.
I managed to detect the double IRQs and compensate, but for some reason that reduces my error margin to half a scanline at best. There's probably some other variation I'm not taking into account.
tokumaru wrote:
For example, I didn't understand why you count up in order to find out what the delay is. Since the IRQ will fire before the location where you actually wanted it to, to me it made more sense to set the counter to what the ideal delay should be
I first experimented with this back in November of last year at a coin laundry. By then, I wasn't entirely sure what the "ideal" delay should be, so I counted up from zero.
Quote:
and decrease it in the 8-cycle loop until the IRQ fires, so that whatever is left in Y is the number of 8-cycle units you have to wait before performing the raster effects. To me that made more sense than the "adc #176" you have there, that I can't figure out.
Think of it as 'adc #-80'. The use of "negative" numbers with an up-counter should be familiar to anyone who has ever programmed the Game Boy or GBA audio system: both the Game Boy tone generators and the GBA sample timer use up-counters that reload when they become zero. The
PackBits RLE format, used by MacPaint and my nametable tools, also uses an up-counter for run lengths, though it stops at 1 instead of 0.
tokumaru, I believe this would work:
sei
lda #$10
sta $4015
nop
sta $4015
cli
Sorry I can't test it right now.
That appears to work blargg, thanks. Apparently I can make things a bit more stable now. Let's see if this goes anywhere...
EDIT: Thanks to blargg I think I got it. And thanks to tepples for the idea, of course. I got minimum variation now, that certainly fits in the HBlank area. Now on to implementing the functions that manage the whole thing and executes the splits precisely at the requested scanlines. If I succeed I'll be sure to share it with you all.
I wish it wasn't so late so that I could keep working on this. I'm really excited about having dynamic water and background bosses in my game now!
tokumaru wrote:
That appears to work blargg, thanks. Apparently I can make things a bit more stable now. Let's see if this goes anywhere...
EDIT: Thanks to blargg I think I got it. And thanks to tepples for the idea, of course. I got minimum variation now, that certainly fits in the HBlank area. Now on to implementing the functions that manage the whole thing and executes the splits precisely at the requested scanlines. If I succeed I'll be sure to share it with you all.
Cool! I'm really anxious to see if you get this done.
thefox wrote:
Cool! I'm really anxious to see if you get this done.
It looks promising. Expect something later tonight.
Try removing the NOP between the STA $4015. I think it might delay the IRQ in a very rare case. I'm not sure it's needed. I thought it might be due to the delay from starting DMC to the sample being read and then the buffer getting filled.
blargg wrote:
Try removing the NOP between the STA $4015.
OK, but it doesn't seem to be causing any problems.
I've updated
my own demo. Highlights:
- Pumped full of comments
- Scroll using Control Pad
- Runs the priming playback with IRQs turned off
- Removed some misuses of Y register
tepples wrote:
I've updated
my own demo.
Cool. Looks very stable.
blargg wrote:
Try removing the NOP between the STA $4015.
Just for the record, things became worse without it. With the NOP I can stare at the screen for minutes without seeing a single glitch, but without it something goes wrong every few seconds and the split point moves significantly.
The NES is the final authority on all matters of proper behavior
There are still a LOT of things to clean up and adjust (the timing is still pretty off), but it looks like the idea works. Here's the program changing the color emphasis bits at dynamic scanlines (this is still NTSC only):
http://membler-industries.com/tokumaru/dmc.nes
As soon as it's acceptably stable I'll release the source code. All you'll have to do to use this is call a function with a number of scanlines in the accumulator, and after that number of scanlines a function will be called. It's inside this function that you'll code your raster effects. You don't have to understand how it works internally, you just have to trust that the function will be called at the correct time.
Not everything is perfect though. There is the overhead of a few IRQs firing before the actual split point (they are 18 scanlines apart on NTSC and 16 on PAL), and there can be a lot of waiting near the split point. I'm not sure how much CPU time will be stolen from your main thread, but it's certainly less than 100%! =) If your game loop is not so intense you can probably get away with 2 or 3 splits in some special areas (bosses and such).
Both tepples' and tokumaru's demoes seems to work flawlessly on my NTSC toploader.
However on my PAL NES, tepple's demo seems unstable on some resets (randomly scrolls wrong) while it will be okay on some other resets, and tokumaru's demo falls flat on it's flace flickering a lot.
Tokumaru, your demo looks like nothing but in fact it's awesome (except for the fact it's NTSC only but I hope it's technically portable to PAL). Did you made a method that combines IRQ + timed code to be able to do something like that ?
Bregalad wrote:
Both tepples' and tokumaru's demoes seems to work flawlessly on my NTSC toploader.
However on my PAL NES, tepple's demo seems unstable on some resets (randomly scrolls wrong) while it will be okay on some other resets, and tokumaru's demo falls flat on it's flace flickering a lot.
My ROM isn't "polished" by any means, if you pay attention you can even see that I haven't hidden the split point in HBlank, there's still a lot to do timing wise.
Quote:
Tokumaru, your demo looks like nothing but in fact it's awesome (except for the fact it's NTSC only but I hope it's technically portable to PAL).
Yeah, as a demo it looks like shit, but that's not the point. The goal here is that you can have a graphical effect in any scanline you want without sacrificing 100% of your CPU time and all you need is the relative scanline number (i.e. how many scanlines after you call the function), without having to worry about how it's done. Once it's fully working I'll make something that looks nicer.
My goal is to make it PAL compatible dynamically, it just needs to initialize a couple of variables differently, which in my game I intend to do shortly after reset, when I detect the type of the console. So the same code should work for both PAL and NTSC, you don't even need to recompile.
Quote:
Did you made a method that combines IRQ + timed code to be able to do something like that ?
Yes. In NTSC the IRQs are roughly ("roughly" means that the difference has to be taken into account later) 18 scanlines apart, and 16 in PAL. Every time an IRQ fires, those numbers are subtracted from the number of scanlines until the split, and when that number becomes smaller than 18 (NTSC) or 16 (PAL) IRQs are stopped and the routine that waits scanlines with timed code is called. Of course this is combined with tepples' method of counting the time until the first IRQ to know how out of sync the APU is in relation to the PPU so that this can be compensated.
Quote:
My goal is to make it PAL compatible dynamically, it just needs to initialize a couple of variables differently, which in my game I intend to do shortly after reset, when I detect the type of the console. So the same code should work for both PAL and NTSC, you don't even need to recompile.
It's nice I'd have to store only 1 file on the powerpak, but it sounds like it would introduce a major complication/overhead in the code.
(if you haven't noticed yet, I always do separate NTSC/PAL versions of my demoes, but they always compile from the same file, I just add a ".define PAL" to enable PAL mode and use conditional assembly). Not that you are forced to go this way, but it simplifies stuff when coding timed code.
Bregalad wrote:
It's nice I'd have to store only 1 file on the powerpak
I really like the idea of having a single ROM and have it work in whatever system you run it.
Quote:
but it sounds like it would introduce a major complication/overhead in the code.
At first I though so too, but if you consider that timed code consists basically of lots of waiting, it's easy to dedicate some of this waiting to check which system it is and wait a little more or a little less without any real overhead.
Quote:
(if you haven't noticed yet, I always do separate NTSC/PAL versions of my demoes, but they always compile from the same file, I just add a ".define PAL" to enable PAL mode and use conditional assembly). Not that you are forced to go this way, but it simplifies stuff when coding timed code.
I used to do that, but once I noticed it wouldn't be so hard to have both in the same binary I though it was worth the trouble. At first it's a bit hard to consider all the branches and make sure that all possible paths use the correct amount of cycles, but once this is done you don't have to worry about it ever again, I think it's nice.
tokumaru wrote:
Bregalad wrote:
but it sounds like it would introduce a major complication/overhead in the code.
At first I though so too, but if you consider that timed code consists basically of lots of waiting, it's easy to dedicate some of this waiting to check which system it is and wait a little more or a little less without any real overhead.
"Because of physics changes, Mario can make a certain jump on NTSC but not on PAL." That's something I do
not want to see. But it still happens in practice, with the relative strengths of characters in
Super Smash Bros. Melee and
Super Smash Bros. Brawl depending on the TV system for which the product was designed. Compare the
NTSC tier list for
Melee to the
PAL tier list.
Not all of us can afford to import a PAL NES and a PAL TV to test on.
tepples wrote:
"Because of physics changes, Mario can make a certain jump on NTSC but not on PAL." That's something I do not want to see.
Yeah, I don't plan on changing the physics in my game to compensate for the slower frame rate of PAL. Even though in theory that doesn't look hard to do, there will be rounding errors in the physics parameters that will cause characters to behave differently.
Quote:
Not all of us can afford to import a PAL NES and a PAL TV to test on.
Yeah, all my PAL testing is done on emulators. I believe most TVs can handle PAL just fine though, but importing a PAL NES just for testing programs is a bit too much for me.
If you don't change the physics, it becomes something like
"Because the PAL version runs 20% slower, stage 3/7/11 of Battletoads becomes very significantly less hard to beat on the PAL version".
Also I keep separate NTSC and PAL version because it's what Nintendo did. That might sound like a stupid/biased raison, but it's a reason altogether.
Bregalad wrote:
"Because the PAL version runs 20% slower, stage 3/7/11 of Battletoads becomes very significantly less hard to beat on the PAL version".
That's the single reason why I have a TV system check on the first screen of LJ65. If the TV system doesn't match the TV system for which the build's speed constants are intended, then it puts up a TV system mismatch error message and freezes. (Yes, I complain about the NTSC PS2 not converting the TV system of all-region PAL DVDs, but it's also hundreds of times faster than the NES.)
Blargg, that "nop" has proved to be very necessary. Without it things go completely crazy on hardware, even though Nintendulator and Nestopia are OK either way.
Actually, things are working very well on hardware, and not so much on emulators. On hardware, the variation is only 24 pixels or so, an area that can be easily hidden in HBlank, but both Nintendulator and Nestopia have occasional frames every few seconds where the variation goes up to 56 or so pixels, the split point happens much earlier than normal for some unknown reason.
*BUMP*
The way in which the first sample is started makes a lot of difference it seems... I was having some sudden timing variations (40 some pixels) every few seconds, and out of desperation, I tried different things and this is what proved to be most stable, both on emulators and real hardware (while experimenting I found other combinations that worked either on hardware or emulators, but this is the only one that seems to work for both):
Code:
sei
lda #$10
sta APUSTATUS
sta APUSTATUS
sta APUSTATUS
cli
Yup, three $4015 writes in a row. Don't ask me why. Does anyone have any theories on why this is the case? Most importantly, do you thing this could have some undesired side effect?
Seems the edge case I was worried about is real. Here are your three STA $4015 instructions:
Code:
IRQ DMA Buffer Output
-------------------------------
0 0 empty busy
STA $4015
0 1 empty busy
1 0 full busy
1 0 empty new cycle
STA $4015
0 1 empty busy
1 0 full busy <-- this was the problem
STA $4015
0 1 full busy
...
0 1 full busy
0 1 empty new cycle
1 0 full busy
IRQ at proper time
At the beginning, the DMA unit has nothing to fetch, the one-byte buffer is empty, and the output unit is almost done with its current cycle.
After the first STA $4015, the DMA unit has one byte to fetch, which it does immediately. This fills the sample buffer, and sets the IRQ flag. Just after that, the output unit completes its cycle and empties the sample buffer.
By the time of the second STA $4015, the sample buffer is empty again, and thus it gets immediately filled and the IRQ flag set almost immediately again. If you had done the CLI after this, the IRQ would have been delayed until after the instruction AFTER the CLI (due to the way the 6502 effectively delays clearing the I flag by one instruction). So your IRQ would be late.
But if you instead do a third STA $4015, you are ensured that the output unit is busy and the sample buffer full, so that you don't get an immediate IRQ.
It's nice to have others working on new techniques like this and experimenting until they work well on hardware.
Bump, since this came up in
http://nesdev.com/bbs/viewtopic.php?p=76844#76844. tokumaru, did you ever get the PAL version of this working and if so, source would be appreciated.
thefox wrote:
tokumaru, did you ever get the PAL version of this working and if so, source would be appreciated.
I'm not sure. The PAL version was exactly like the NTSC one, the only difference was how many scanlines apart the interrupts were, since I wanted to use a frequency that would result in intervals close to whole scanlines, to waste as few cycles as possible.
I never finished this for good, because I lost interest in this technique when I realized that about 20% of a frame's time would be spent on a single split. I'll check the files though (I don't even remember where they are, but I'm sure there's a backup somewhere) and will share whatever I have. The code is far from clean though, there were still a lot of little things left to do before this could be considered a reusable piece of code. A problem I could never solve is that once in a while there would be sudden variations of 40 or so pixels, which could easily screw up scrolling changes.
Now that I think of it, I might have even made a single program that is NTSC & PAL compatible, based on console detection at startup. I'm sure I had plans for this, but I'll have to check if I actually implemented it. I'll let you know later today.
tokumaru wrote:
I never finished this for good, because I lost interest in this technique when I realized that about 20% of a frame's time would be spent on a single split.
Aah, I remembered there was a catch to this.
Do you think it would be possible to cut down on the CPU usage if one was more clever about the choice of frequencies/etc?
On another note, even if a generic routine can't be made efficiently, it would still be useful to have good documentation and/or an ("offline") application to generate code for DMC timed splits based on input parameters.
A demo with multiple variable split points uses a lot of CPU time, as it would have to spin a lot longer to wait for the right line. But the letterbox demo (two fixed split points) uses 20 IRQs per frame, most of which take only about 36 cycles (if I remember correctly) to handle. Even counting other overheads, I imagine it's taking less than 5% of CPU time. I'd have to profile it harder to get a more precise figure.
Yeah, my solution has so much overhead because it's supposed to act like an scanline counter, that can generate interrupts for any given scanline. Hardcoded split points are surely much faster.
I found the most recent archive of my work on this (and it appears I was working on PAL compatibility the last time I touched it, because the assembled ROM works on PAL but not on NTSC), but I'm having trouble uploading it using the ftp Memblers set up for me. I'll try again after dinner.
Why use 20 IRQs when you could just play a longer sample with adjusted lenght/duration ratio, and use only 1 on them ?
I think the CPU usage is about 9 scanlines, plus a few more because of the imperfetion of the lenght/duration ratio (you'll have to use the one which will fire an IRQ just before the desired split point, and wait the remaining time the old way).
So I'll say 10 scanlines, which is about 4% CPU time in NTSC.
Bregalad wrote:
Why use 20 IRQs when you could just play a longer sample with adjusted lenght/duration ratio, and use only 1 on them ?
Because we have to measure how long it takes for the first IRQ to fire (so that this difference is used to compensate the timing later), and if you use a larger interval you'll waste a lot of time waiting for this IRQ to happen.
EDIT:
Here's all the work I've done on this. Don't expect any documentation or clean code though. Feel free to do whatever you want with this.
Bregalad wrote:
Why use 20 IRQs when you could just play a longer sample with adjusted lenght/duration ratio, and use only 1 on them ?
I already do play a longer sample through much of the big area. But the only sample lengths one can reasonably use for this are 1 and 17 bytes.
Well can't you play the shortest 1-byte sample at rate $0f for the initial tuning, then play a longer/slower sample for a second IRQ, then wait a variable time that adds up to a constant with the first delay ? I'm pretty sure you could.
Of course if you play a slower sample you loose resolution, but then you'd have to find the best length/rate ratio for this. Then it'd be a tradeoff between playing a longer sample, thrus waste ROM, or play a shorter one and waste some CPU time counting useless IRQs before the interesting one.
IIRC, the whole problem with DMC IRQs is that the APU runs at it's own pace, meaning that even if you give it the command to start playing a sample at the same point in the frame every time, the IRQ will not happen at the same time.
If we count how long it takes for the IRQ to fire (this is one of the waiting loops), we know what the error is, and we can use this information to compensate for the error later (this is the second waiting loop), when it's time for the raster effect.
Maybe it is actually possible to detect the error using a shorter/faster sample to reduce the amount of waiting, but it might not be trivial to dynamically vary the frequency during the frame as the split point comes closer.
No, no that's not what I mean.
The timing diagram of what I mean is that :
Code:
------------------
NMI Vblank sync
Sprite DMA, VRAM updates, etc... (constant time)
------------------
start playback of short sample (varying time)
enter in a loop polling $4015 and counting iterations
--------------------
"IRQ #1" -> count time before the pseudo-IRQ (an actual IRQ isn't needed and can be disabled with SEI, this is automatically done in an NMI routine by the way).
Start playback of long/slow sample (fixed time)
RTI (also clears the I flag)
....sound, game engine, etc. code here....
-------------------
IRQ #2 -> disable DMC channel
Wait for constant time minus the calulated one
Wait for an adjustement time for the raster effect (compensate for lack of precision in lenght/speed of the long sample)
Raster effect register writes
RTI
----------------------
end of frame here
----------------------
I think this is what would make the most sense to me, and I think it should work, but the adjustment time that compensate the lack of precision must not be too big else too much CPU time will be wasted
Bregalad wrote:
Well can't you play the shortest 1-byte sample at rate $0f for the initial tuning, then play a longer/slower sample for a second IRQ, then wait a variable time that adds up to a constant with the first delay ?
I already do this, sort of: I use a 1-byte sample and a 17-byte sample. The 1-byte sample takes 1*8*54 = 432 CPU cycles, or 432*3/341 = 3.8 scanlines. The 17-byte sample takes 17*8*54 = 7344 CPU cycles, or 7344*3/341 = 64.6 scanlines. I already do play a 17-byte sample to skip large areas. But there's nothing in between a 1- and 17-byte sample. On the other hand, you're right that I haven't investigated playing the 1-byte sample at maximum rate (54 cycle period) and then switching to 1-byte samples at a slower rate (longer period) in order to skip areas smaller than 64 scanlines.
OK I took time to compute the # of scanline each possibility of DMC length, for both NTSC and PAL (hopefully I didn't do accidental errors).
I rounded all values to the
higer integer, because if you want to wait N scanlines, you want the IRQ to happen BEFORE the Nth scanline after setting up the IRQ.
(samples taking longer than a frame are represented by stars)
Code:
NTSC Rate
Length $0 $1 $2 $3 $4 $5 $6 $7 $8 $9 $a $b $c $d $e $f
----------------------------------------------------------------------------------------------------
1-byte (8 bits) 31 27 24 23 21 18 16 16 14 12 10 10 8 6 6 4
17-byte (136 bits) ** ** ** ** ** ** ** ** 228 192 170 154 127 101 87 65
33-byte (264 bits) ** ** ** ** ** ** ** ** ** ** ** ** ** 196 168 126
49-byte (392 bits) ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 187
PAL Rate
Length $0 $1 $2 $3 $4 $5 $6 $7 $8 $9 $a $b $c $d $e $f
----------------------------------------------------------------------------------------------------
1-byte (8 bits) 30 27 24 23 21 18 16 15 14 12 10 9 8 6 5 4
17-byte (136 bits) ** ** ** ** ** ** ** ** 225 189 169 151 126 100 85 64
33-byte (264 bits) ** ** ** ** ** ** ** ** ** ** ** ** ** 194 164 124
49-byte (392 bits) ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 184
Because a PAL interrupt will always happen about the same time or a bit sooner than a NTSC interrupt, I will use the NTSC table exclusively to set the "best" setting here :
Code:
Delta Best opt. for IRQ
1-3 Timed code
4-5 Length $0, rate $f
6-7 Lenght $0, rate $d
8-9 Length $0, rate $c
10-11 Length $0, rate $a
12-13 Length $0, rate $9
14-15 Length $0, rate $8
16-17 Length $0, rate $6
18-20 Length $0, rate $5
21-22 Length $0, rate $4
23 Lenght $0, rate $3
24-26 Length $0, rate $2
27-30 Length $0, rate $1
31-64 Length $0, rate $0
65-86 Length $1, rate $f
87-100 Length $1, rate $e
101-125 Lenght $1, rate $d
126 Lenght $2, rate $f
127-153 Length $1, rate $c
154-167 Lenght $1, rate $b
168-169 Lenght $2, rate $e
170-186 Length $1, rate $a
187-191 Length $3, rate $f
192-195 Length $1, rate $9
196-227 Length $2, rate $d
228-239 Length $1, rate $8
Therefore, after counting a necesarly initial wait of 4 scanlines, which is necessarily for the sync because of how DMC works, you should compute the # of scanline that remains to wait before the split and use the lockup table above.
Note that it's in most cases unnecessary to have a sample longer than 17 bytes, because it will only save a couple of scan-lines in the final fine-tuned timed code wait. The only exeption is if you want to wait something close to 220- 227 scanlines, where there is a significant gap where it's best to use a 33 bytes sample.
Also note the huge gap between 31 and 64 scan lines, where there is nothing better than 31 scanlines available. If you wanted to wait, let's say exactly 63 scanlines and use only one IRQ, you'd have to set an IRQ to line 31 and wait the remaining 32 lines "by hand", wasting a lot of CPU usage. Therefore for this game I'd really to use a second IRQ, to save CPU usage.
So there is probably no way more than 2 IRQs and more than 8 scanlines of "timed code" would ever be required for this, for ANY split.
I agree it's not the most amazing thing in the world, but remember all this is possible WITHOUT ANY MAPPER.
This is a very interesting idea, Bregalad. Suddenly I feel very tempted to try coding the "definitive mapperless scanline counter" again...! =)