Hi guys!
I'm developing a nes emulator. I tested it with nestest on automation (I don't have any PPU, just the CPU working). I got a problem with this line :
Quote:
CDAD BA TSX A:00 X:00 Y:99 P:27 SP:FB CYC: 58 SL:257
The result is that the X register has the value : 0xFB.
It's not working on my emulator because the value 0xFB is never pulled on the stack. Even with the log I can't find where it's pulled.
So can someone tell me what I miss there?
Thanks.
TSX copies the stack pointer itself, not any specific value from the stack, into X.
SP:FB + TSX = X:FB
Ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh!!!!!!!!!!!!!!!!!!
Now it makes sense.
Thanks for your help. I would have never figured it out by myself, I was so focused on the fact that everything was about push and pop that I didn't even read the doc correctly.
Ok now all the opcodes work correctly but can someone explain me what is CYC and SL in the nestest log. First I thought that CYC was the number of cycle but it doesn't seem to be that.
Thanks for your help.
Vash wrote:
can someone explain me what is CYC and SL in the nestest log.
Given that CYC wraps around at 341, I'd say that's the PPU cycle within the scanline, which is what SL means.
That log really shouldn't be like that. It should either have just a cycle counter that never rolls over, or no cycles at all, since it's not a timing test anyway.
Thank you for your answer.
So if I got it right, the CYC is the number of cycle of the opcodes wraped around at 341.
I don't understand how SL is calculated cause in the nestest log, it begins at 241 and sometimes goes back to -1.
Vash wrote:
So if I got it right, the CYC is the number of cycle of the opcodes wraped around at 341.
Each scanline is 341 PPU cycles long, no matter if the console is NTSC (where each CPU cycle equals 3 PPU cycles) or PAL (where each CPU cycle equals 3.2 PPU cycles).
Quote:
I don't understand how SL is calculated cause in the nestest log, it begins at 241 and sometimes goes back to -1.
There are 262 scanlines in a frame, 240 of those are visible. The log probably starts at 241 because that's the start of VBlank. The one numbered -1 is the pre-render scanline.
I was going to remove that from the log (and zip it), but then I noticed that it has a disassembly, and even shows what is read from memory by read instructions. We really should have a log that just shows PC, A, X, Y, P, S, with no disassembly or anything else. That'd be much easier to generate for someone just getting their CPU core working.
blargg wrote:
We really should have a log that just shows PC, A, X, Y, P, S, with no disassembly or anything else.
It looks like a job for a Perl one-liner.
Well, yeah, it's simple to do with a regexp-style replacement. But we must first agree on the format before updating it. Here's one I used to give the following format:
Code:
find: (....) ............................ (.........................)[^\r]+
replace: PC:\1 \2
Also went and made these replaceents, since P doesn't implement bits 4 and 5:
" P:2" " P:0"
" P:6" " P:4"
" P:A" " P:8"
" P:E" " P:C"
Changed SP to S, since S is the proper name of the register.
Changed : to = sign, the proper mathematical symbol.
We end up with this:
PC=C000 A=00 X=00 Y=00 P=04 S=FD
Any other changes before I update the Wiki with this one?
blargg wrote:
Well, yeah, it's simple to do with a regexp-style replacement. But we must first agree on the format before updating it.
My idea was that each emulator author would make his own RE to thin the log down to the data he needs. On the other hand, the ability to disassemble is a good first step toward making a debugger.
OK, maybe I'll post a zip of the full log, and this condensed version. No reason not to provide the latter, as a starter that's trivial to compare with.
Well now I have my CPU working so I will start the PPU now but there is something I don't understand.
When is frame rendered?
The PPU renders continuously, from line 0 through line 239, at one dot per CYC in the log (three dots per NTSC CPU cycle). A few fetches for line 0 are performed at the start of line -1.
Vash wrote:
When is frame rendered?
The PPU is constantly working alongside the CPU. It keeps repeating this cycle: 20 Vblank scanlines, 1 dummy scanline, 240 picture scanlines, 1 dummy scanline. It never stops. It's the responsibility of the game program to sync itself to this cycle.
Ok so the CPU and PPU are working in parallel but in my emulator I wanted to do something that looks like a game loop as this :
while(GAME_RUNNING)
{
if(timeElapsed>=tick)
Game.update();
Game.render();
}
So basically my question is when do I stop the CPU emulator to render a frame?
The most accurate method is to render three pixels, then perform one CPU cycle, and repeat. That's slow, so various catch-up schemes are used. As for when you give the GUI control, But the common pattern that I've seen is to stop the emulator at the start of line 240 (the post-render line), where 241 is the start of vertical blanking.
Ok so the emulator main loop can look like something like that :
Code:
while(EMULATOR_RUNNING)
{
if(cycle < 262) // number of scanline
CPU.run(&cycle);
PPU.render();
}
There are 262 scanlines, but 341 PPU cycles in each scanline. More like 89342 PPU cycles total (29780.66... CPU cycles).
You need to expect the game to write to the PPU during rendering time, because even Super Mario Bros changes the scrolling location part way through draw time.
The more I read stuff, the less I understand
. I'm completely lost.
I'm ok with the cycle : 341 PPU cycle per scanline with 262 scanlines : 89342 PPU cycles. As 1 cpu cycle = 3 ppu Cycles, we end up with 29780 cpu cycles.
What do you mean by : the game write to the PPU during rendering time?
tepples wrote:
The most accurate method is to render three pixels, then perform one CPU cycle, and repeat.
Are you racist against people living in PAL territories ?
Let's first adopt some terms that don't all sound the same. We don't need to use "cycle" for everything.
Cycle: CPU cycle. For example, two cycles in a NOP
Pixel: time PPU spends rendering a single pixel
Clock: the 21477272.7 Hz master timebase
Therefore:
1 clock = 1/21477272.7 second
1 pixel = 4 clocks = 1/5369318 second
1 cycle = 3 pixels = 12 clocks = 1/1789772.7 second
1 scanline = 341 pixels (in most cases) = 113.67 cycles
1 frame = 262 scanlines = 29780.67 cycles = 1/60.1 second
For PAL:
1 clock = 1/26601712.5 second
1 pixel = 5 clocks = 1/5320342.5 second
1 cycle = 3.2 pixels = 16 clocks = 1/1662607 second
1 scanline = 341 pixels = 106.5625 cycles
1 frame = 312 scanlines = 33247.5 cycles = 1/50 second
Then we can talk of these things with different one-word terms, and not get confused.
Bregalad wrote:
tepples wrote:
The most accurate method is to render three pixels, then perform one CPU cycle, and repeat.
Are you racist against people living in PAL territories ?
Russia is a PAL territory. The Dendy famiclone uses a /15 CPU instead of a /16 one like the official PAL NES, resulting in 3 pixels per cycle, and a PPU that makes NMI at scanline 291 instead of 241 like the official PAL NES. It appears the newbie hasn't yet appreciated the concept of two processors running in parallel. If I listed all variants of the NES architecture immediately, it would confuse the newbie even more.
For Dendy:
1 clock = 1/26601712.5 second
1 pixel = 5 clocks
1 cycle = 3 pixels = 15 clocks
1 scanline = 341 pixels = 113.67 cycles
1 frame = 312 scanlines = 35464 cycles
Vash wrote:
What do you mean by : the game write to the PPU during rendering time?
Here are the most common changes made to the PPU's state during rendering...
* Change the scrolling location for a status bar.
* Change the scrolling location many times because we want wave backgrounds or it's a racing game.
* Bankswitch the CHR so that different graphics are drawn after a certain scanline.
* Change which pattern table backgrouds and sprites use.
Then some more tricky stuff that games can do...
* Bankswitch the CHR more than once within the same scanline (Punch Out, Marble Madness, Fire Emblem, etc...)
* Disable rendering so the game can write to video ram, then re-enable rendering later within the same frame. (Wizards and Warriors 3)
* Disable rendering, then write a second sprite table, then re-enable rendering (Day Dreamin Davey, RC Pro am, Stunt Kids, some other games)
Any Renderer which looks at the PPU's initial state at the start of the frame (scroll position, CHR banks mapped in, which pattern tables to use, size of sprites) and attempts to draw the entire screen using only that initial state won't do a very good job, even Super Mario Bros won't scroll correctly.
You need at least scanline-level accuracy of PPU state changes. And then, scanline-level accuracy of PPU state changes isn't good enough for Punch Out, that needs pixel-level accuracy.
But you don't need to keep switching between CPU code and PPU code every instruction, you can instead use a catch-up method where you wait until the emulated game makes a PPU write, or the frame ends, then you draw that amount of pixels which have elapsed.
It took a while, but I found an overview of "catch-up" and "timestamp" related techniques in
this article on our wiki. Let me know about
anything that you don't understand in this article so that I can go fix it.
Why don't you guys support emulation the core one cycle at a time, not just "x cycles=instruction y"...? Wouldn't that help the emulation alot for REALLY close timing things?
Running one cycle at a time is slow because it needs to keep the state of both the emulated CPU and PPU in the host CPU's L1 cache, and not all host CPUs are big enough for that. Efficient emulators use catch-up techniques to keep the host CPU's attention on only one emulated part at once yet still act
as if the components run at the same time. Drop the catch-up, as you suggest, and you have an emulator like Nintendulator or bsnes, which last time I checked didn't run too well on netbooks.
tepples wrote:
The most accurate method is to render three pixels, then perform one CPU cycle, and repeat. That's slow, so various catch-up schemes are used. As for when you give the GUI control, But the common pattern that I've seen is to stop the emulator at the start of line 240 (the post-render line), where 241 is the start of vertical blanking.
- Absolutely true.
I had to (additionally) create a queue system, right after an instruction, for things like switching to GUI or sound output updates/poll.
Vash wrote:
Ok so the emulator main loop can look like something like that :
Code:
while(EMULATOR_RUNNING)
{
if(cycle < 262) // number of scanline
CPU.run(&cycle);
PPU.render();
}
The problem with this approach is that many games (and I mean LOTS of them) make use of the fact that CPU and PPU run side by side. These games modify certain PPU parameters as the image renders in order to to change the rendered image in some way. This is used for status bars, parallax scrolling, color changes, things like that. If you ignore those timed changes and only render the image based on the final state of the PPU, almost every game will look wrong, and many might even hang (the ones that rely on sprite 0 hits).
A common solution to this problem is the "catch up" method. You basically run the CPU until the program tries to make any changes to the PPU or the frame ends, at which point you make the PPU catch up to the CPU by rendering the necessary number of pixels.
Of course you still have to consider events external to the CPU that might affect the program flow, such as sprite 0 hits or IRQs. You have to predict when those will happen so that you can update the system's state accordingly at the correct times.
Ok thanks for your help. Things get more clear now.
I have started to write the PPU and I'm still working with nestest. I'm launching nestest from the reset Vector and not from C000 anymore and the program seems to loop. Here's my log :
Code:
c004 78 Set Disable Interrupt Flag A=0 X=0 Y=0 P=24 SP=fd
c005 d8 Clear Decimal Flag A=0 X=0 Y=0 P=24 SP=fd
c006 a2 Load X, Immediate ff A=0 X=ff Y=0 P=a4 SP=fd
c008 9a Transfer X to Stack Pointer A=0 X=ff Y=0 P=a4 SP=ff
c009 ad Load In A, Absolute @ 2002 = 80 A=80 X=ff Y=0 P=a4 SP=ff
c00c 10 Branch on Plus offset = fb A=80 X=ff Y=0 P=a4 SP=ff
c00e ad Load In A, Absolute @ 2002 = 0 A=0 X=ff Y=0 P=26 SP=ff
c011 10 Branch on Plus offset = fb A=0 X=ff Y=0 P=26 SP=ff
c00e ad Load In A, Absolute @ 2002 = 0 A=0 X=ff Y=0 P=26 SP=ff
c011 10 Branch on Plus offset = fb A=0 X=ff Y=0 P=26 SP=ff
c00e ad Load In A, Absolute @ 2002 = 0 A=0 X=ff Y=0 P=26 SP=ff
c011 10 Branch on Plus offset = fb A=0 X=ff Y=0 P=26 SP=ff
c00e ad Load In A, Absolute @ 2002 = 0 A=0 X=ff Y=0 P=26 SP=ff
c011 10 Branch on Plus offset = fb A=0 X=ff Y=0 P=26 SP=ff
At start, 2002 = 80 as we're in VBlank. So the BPL doesn't occur. Once it's read, the 7th digit got reset so 2002 = 00 and BPL always occurs.
I don't see what's going wrong here.
nestest is not a PPU tester. Its tests are not designed to exercise the PPU in a way that yields results whose cause can be easily tracked down.
Your right. Except that all the PPU test roms from the wiki display their results on screen, which mean the ppu gotta work.
Not so; the
ppu_vbl_nmi test also outputs the result to $6000 (WRAM). See the readme.
Thanks for the info. I didn't know the output was also in the RAM.
I've tried with implied tests rom from instr_test-V2 but I've got errors :
Code:
C:\Users\Vash\Desktop\Rom\instr_test-v2\rom_singles\01-implied.nes
01 de b0 61
2A ROL A
0A ASL A
6A ROR A
4A LSR A
8A TXA
98 TYA
AA TAX
A8 TAY
E8 INX
C8 INY
CA DEX
88 DEY
38 SEC
18 CLC
F8 SED
D8 CLD
78 SEI
58 CLI
B8 CLV
EA NOP
1A NOP
3A NOP
5A NOP
7A NOP
DA NOP
FA NOP
01-implied
Failed
This is weird because even the NOP is failing wheras NOP doesn't do anything.
Argh, I hadn't put up a newer version that works around APU emulation bugs. It's up now
in the CPU tests page as instr_test-v3.zip.
I've tried with the V3 but I still have the same output :
Code:
C:\Users\Vash\Desktop\Rom\instr_test-v3\rom_singles\01-implied.nes
01 de b0 61
2A ROL A
0A ASL A
6A ROR A
4A LSR A
8A TXA
98 TYA
AA TAX
A8 TAY
E8 INX
C8 INY
CA DEX
88 DEY
38 SEC
18 CLC
F8 SED
D8 CLD
78 SEI
58 CLI
B8 CLV
EA NOP
1A NOP
3A NOP
5A NOP
7A NOP
DA NOP
FA NOP
01-implied
Failed
Does it come from my implementation? What do I have to do to make it work cause apparently it's not a CPU problem.
I've tried the implied test V4. It tries to execute the opcode 0x02 which is the undocumented opcode KIL. It's supposed to stop the execution of the program. I didn't implement as I don't know how to handle it.
Is it supposed to execute this instruction?
I'm trying to pass the cpu_interrupts test. I have a problem with the cli_latency rom. Here is the log :
Code:
C:\Users\Vash\Desktop\Rom\cpu_interrupts_v2\rom_singles\1-cli_latency.nes
0B DE B0 61
Unacknowledged IRQ shouldn't let any mainline code run
1-cli_latency
Failed #11
The thing is I have no idea of what it means. What is unaknowledged IRQ?
An unacknowledged IRQ is one that hasn't been
acknowledged yet.
Vash wrote:
What is unaknowledged IRQ?
The CPU has an IRQ pin that tells it to interrupt the program execution and call the IRQ routine, and if you don't acknowledge the IRQ the state of that signal will not change, so when the IRQ routine finishes it will be called again, and again, and again, effectively locking up the program.
So does it mean that when $4017 throw an IRQ, we have to read $4015 at start of the IRQ?
When $4017 sets the IRQ flag, it will keep telling the CPU to cause an IRQ, until the CPU reads from $4015 (or writes $40 or $C0 to $4017) to tell it to stop requesting an IRQ. $4017 doesn't have any other way of knowing that the CPU handled the IRQ; it doesn't know that the CPU has entered the IRQ handler.
Ok So I guess that this error comes from the fact that something's wrong in my flag handling.
I have an other question about those two registers : 4017 an 4015.
The wiki says that when we set $4017.6, $4015.6 is cleared. Do this work the other way around too? When we read $4015, $4015.6 is cleared so $4017.6 is setted?
Quote:
When we read $4015, $4015.6 is cleared so $4017.6 is setted?
No. There is a separate frame IRQ flag, and the bit last written to $4017.D6. If the latter was last written with a 1, the IRQ flag is never set. Reading $4015 merely clears the IRQ flag; it doesn't affect the last value written to $4017.D6. Usually the Wiki will note significant things like this.
Thanks for your help. I passed the cli latency test but now I have a problem with 01-vbl_basics. It says : "VBL period is too short with BG off".
I've read form the wiki that when BG is off, the complete frame is one ppu clock shorter. So instead of being 341*262=89342 PPU clocks, it should be 89341.
As it says that VBL period is too short and not too long, I don't get it.
My VBL period is 21 * 341 = 7161 PPU clocks. Even when I put it longer, it doesn't work.
Anyone sees what I'm missing here?
Vash wrote:
I've read form the wiki that when BG is off, the complete frame is one ppu clock shorter.
Which page on the wiki says this?
http://wiki.nesdev.com/w/index.php/Full_palette_demo
Quote:
The PPU has two frame lengths, short and long, and an internal flag that toggles every frame and determines whether the frame will be short or long. A long frame is 341*262 PPU clocks, while a short is one PPU clock less. The missed clock occurs around the end of VBL, and only if rendering is disabled.
Yeah, that's a typo, it should be "and only if rendering is enabled." as can be seen here:
http://wiki.nesdev.com/w/index.php/PPU_frame_timing
BTW, this error on that page is a perfect example of why I hate duplicating information; it leads to inconsistencies. My advice: if a page is describing a palette demo, don't take anything it says about anything other than the palette demo as fact, merely a suggestion of what you might find documented elsewhere.
Yes you're right but when I searched "vbl period" in the wiki, this page is the only one returned.
So the VBlank length is :
- 6820 ppu cycle when BG is off.
- 6820 ppu cycle when BG is on with an even frame
- 6819 ppu cycle when BG is on with an odd frame.
But even when I put a bigger number, I still got this error : "VBL period is too short with BG off"
The period is the time between occurrences. On NTSC, the VBL period is 89342 pixels when rendering is disabled.
It's 89342 ppu cycle so it's 29780.66 cpu cycle.
It's still not working.
Here is what I do : every 29780.66 cpu cycle, $2002.7 is setted to 1 to indicate the start of VBlank. It's also the case with BG on with even frame. For an odd frame with BG on, $2002.7 is setted every 29780.33 cpu cycle (minus one ppu cycle).
First, saying it's "not working" is not as useful as describing the symptoms. Do you mean "it's still giving the same error"?
What happens if you increase it by 10 cycles? Decrease by 10 cycles?
I have tried to increase the number of cycles and from 29780.66 cpu cycle to 90374 cpu cycle, I have the "VBL period is too short with BG off" error. Then at 90375 cpu cycle, I have the "VBL period is way off" error.
Here is my main loop to show you how I do it :
Code:
numberOfCyclePerFrame = getFrameCPUCycleLength();
if (m_cpuCycle < numberOfCyclePerFrame)
{
executeInstruction(&m_cpuCycle);
numberOfCyclePerFrame = getFrameCPUCycleLength();
if (ppuCycle>= 6820 && vBlank)
{
clearVBlank();
vBlank = false;
}
//NMI
if(launchNMI())
NMI();
//IRQ
if(launchIRQ())
IRQ();
//reset previous cycle count
previousCycle = m_cpuCycle;
}
else
{
//reset the number of cycle
m_cpuCycle = m_cpuCycle - (int)(numberOfCyclePerFrame/3);
setVBlank();
vBlank = true;
//reset the ppu cycle counter
ppuCycle = m_cpuCycle * 3;
}
Basically what I do is execute instructions for 29780.66 or 29780.33 cpu cycles (depending on the BG and the even/odd frame). Once the number of cycles passed, I reset $2002.7 to 1 (cause it's the start of a new frame so it's the vblank). At 6820 ppu cycles , I put $2002.7to 0 because it's the end of the vblank (20 scanlines after the start : 20 * 341)
What units is numberOfCyclePerFrame in? The if makes it look like it's in cycles, but the part where you divide it by 3 makes it look like it's in pixels. At this point, you're probably going to have to write your own logging so you can get things basically behaving correctly. Once an emulator gets too far incorrect, a test ROM can't make much sense of what it's doing.