The relative speed of the NES and Game Boy can be calculated in several ways.
SM83 vs. 6502
The Nintendo Entertainment System and the Super Game Boy accessory have the same 945/44 = 21.47 MHz master clock. The NTSC NES divides the master clock by 12 to make the 1.79 MHz 6502 clock. SGB divides the clock by 5 (4.30 MHz) before passing it to the Game Boy's LR35902 system on chip, whose multicycle implementation in turn divides it by 4 (1.07 MHz) for its Sharp SM83 CPU core. Thus the effective clock rate of the Game Boy is 3/5 (60%) of the NES clock rate, which opens the debate about whether SM83 makes it up in work per clock.
Stack instructions: A push and pop on SM83 take 7 cycles total, same as 6502, but they handle 2 bytes at a time. SM83's RET is faster than 6502's RTS by 2 cycles, reducing the penalty for subroutine calls. The indirect call instruction JP (HL) takes 1 cycle, which is faster than the load high PHA load low PHA RTS on 6502.
ALU instructions: SM83's lack of a penalty cycle for "implied"-mode instructions helps. The Intel-style carry (as opposed to MOS/ARM-style carry) allows a fast idiom for sign-extending A: RLCA SUB A copies bit 7 to all bits of A. There's an 8-bit rotate in addition to the 6502's 9-bit one, an arithmetic right shift that copies old bit 7 to new bit 7 (no need for CMP #$80 ROR A), and a nibble swap instruction. But there's no sign flag after ALU operations, and testing bit 7 of an ALU result needs another cycle or two for a compare or bit-test instruction.
Memory instructions: With its 2-cycle 16-bit increments and autoincrement for the pointer register HL, SM83 is arguably faster than 6502 for sequential access to arrays, especially those larger than 256 bytes. But for random access, I've mentioned elsewhere how like Intel's 8080, SM83 lacks the 6502's rich indexed addressing modes. Thus random access to a field of a structure, such as the fields of an actor in a game, requires radical reorganization of structures in memory and more preparation in advance based on in which order the fields will be accessed. Later I'll post the workarounds that I discovered.
RAM: Game Boy has more. This tilts some space-time tradeoffs; I'd be interested to read how this plays out in practice.
C language: Making a game that runs with minor changes between PC and either NES or Game Boy often involves writing the game logic in C and only the I/O (input, audio, graphics) and systems parts of the engine in assembly. ISSOtm has written thoughts about C on Game Boy. Instructions that involve HL and SP allow for a larger hardware stack, reducing some of the soft-stack penalty that cc65 has to pay.
Surrounding chipset
VRAM bandwidth: GB and NTSC NES have almost the same count of cycles per scanline (114 vs. 113.667). NTSC NES has 20.5 lines of vertical blanking (assuming half of prerender is "borrowed") while GB has 10. GB, however, doesn't support extending blanking to add more VRAM update time. An unrolled copy to VRAM without the popslide technique is 6 cycles/byte on GB compared to 8 on NES. But because the GB PPU is faster relative to the CPU (4 dots on GB, 3 on NTSC NES) and narrower (160 dots vs. 256), GB also has the majority of its scanline (at least 64 cycles) open for VRAM reading and writing during horizontal blanking. This makes it practical for a loop to copy 8 bytes to VRAM after each of the 144 scanlines even without the GBC's CHR HDMA feature, or 1152 bytes per screen, so long as you take care about tearing. But nothing beats the bandwidth of having all your tiles in CHR ROM at a slight cost in flexibility, though GBC has banked CHR RAM.
OAM DMA on Game Boy takes 160 cycles, running at 1 cycle per byte like Super NES DMA, as opposed to 514 cycles (2 per byte) on the NES. Because it doesn't pause the CPU during DMA execution, only HRAM (its counterpart to NES zero page or GBA IWRAM) is accessible, and DMA is normally done by a 10-byte subroutine in HRAM. OAM DMA is also possible mid-frame provided sprite rendering is turned off, allowing it to be moved out of vblank and into the status bar.
Scrolling: Like the Super NES, the Game Boy lacks the oddball 30-row nametable height. This simplifies some designs for nametable update packet updates. Monochrome doesn't have attributes at all; GBC is like MMC5 EXRAM or Super NES nametables in that it has a second byte plane of attributes whose addresses parallel those of the nametable. But without special support for +32 increment, a nametable column copy loop is slightly slower than 6 cycles/byte.
Frame rate: Both the NES and Game Boy run at close to 60 frames per second. The original green screen Game Boy (DMG) takes several refreshes to change a pixel from light to dark or vice versa. The Game Boy Pocket and Game Boy Color take fewer, but still not fast enough to make 30 Hz flicker as noticeable as it would be on SGB. This lets developers get away with engines that run on twos, on threes, or even on fours (Balloon Kid), for 30, 20, or 15 fps.
I'm interested in fleshing out the arguments both ways to make them more quantitative as opposed to hand-wavey.
SM83 vs. 6502
The Nintendo Entertainment System and the Super Game Boy accessory have the same 945/44 = 21.47 MHz master clock. The NTSC NES divides the master clock by 12 to make the 1.79 MHz 6502 clock. SGB divides the clock by 5 (4.30 MHz) before passing it to the Game Boy's LR35902 system on chip, whose multicycle implementation in turn divides it by 4 (1.07 MHz) for its Sharp SM83 CPU core. Thus the effective clock rate of the Game Boy is 3/5 (60%) of the NES clock rate, which opens the debate about whether SM83 makes it up in work per clock.
Stack instructions: A push and pop on SM83 take 7 cycles total, same as 6502, but they handle 2 bytes at a time. SM83's RET is faster than 6502's RTS by 2 cycles, reducing the penalty for subroutine calls. The indirect call instruction JP (HL) takes 1 cycle, which is faster than the load high PHA load low PHA RTS on 6502.
ALU instructions: SM83's lack of a penalty cycle for "implied"-mode instructions helps. The Intel-style carry (as opposed to MOS/ARM-style carry) allows a fast idiom for sign-extending A: RLCA SUB A copies bit 7 to all bits of A. There's an 8-bit rotate in addition to the 6502's 9-bit one, an arithmetic right shift that copies old bit 7 to new bit 7 (no need for CMP #$80 ROR A), and a nibble swap instruction. But there's no sign flag after ALU operations, and testing bit 7 of an ALU result needs another cycle or two for a compare or bit-test instruction.
Memory instructions: With its 2-cycle 16-bit increments and autoincrement for the pointer register HL, SM83 is arguably faster than 6502 for sequential access to arrays, especially those larger than 256 bytes. But for random access, I've mentioned elsewhere how like Intel's 8080, SM83 lacks the 6502's rich indexed addressing modes. Thus random access to a field of a structure, such as the fields of an actor in a game, requires radical reorganization of structures in memory and more preparation in advance based on in which order the fields will be accessed. Later I'll post the workarounds that I discovered.
RAM: Game Boy has more. This tilts some space-time tradeoffs; I'd be interested to read how this plays out in practice.
C language: Making a game that runs with minor changes between PC and either NES or Game Boy often involves writing the game logic in C and only the I/O (input, audio, graphics) and systems parts of the engine in assembly. ISSOtm has written thoughts about C on Game Boy. Instructions that involve HL and SP allow for a larger hardware stack, reducing some of the soft-stack penalty that cc65 has to pay.
Surrounding chipset
VRAM bandwidth: GB and NTSC NES have almost the same count of cycles per scanline (114 vs. 113.667). NTSC NES has 20.5 lines of vertical blanking (assuming half of prerender is "borrowed") while GB has 10. GB, however, doesn't support extending blanking to add more VRAM update time. An unrolled copy to VRAM without the popslide technique is 6 cycles/byte on GB compared to 8 on NES. But because the GB PPU is faster relative to the CPU (4 dots on GB, 3 on NTSC NES) and narrower (160 dots vs. 256), GB also has the majority of its scanline (at least 64 cycles) open for VRAM reading and writing during horizontal blanking. This makes it practical for a loop to copy 8 bytes to VRAM after each of the 144 scanlines even without the GBC's CHR HDMA feature, or 1152 bytes per screen, so long as you take care about tearing. But nothing beats the bandwidth of having all your tiles in CHR ROM at a slight cost in flexibility, though GBC has banked CHR RAM.
OAM DMA on Game Boy takes 160 cycles, running at 1 cycle per byte like Super NES DMA, as opposed to 514 cycles (2 per byte) on the NES. Because it doesn't pause the CPU during DMA execution, only HRAM (its counterpart to NES zero page or GBA IWRAM) is accessible, and DMA is normally done by a 10-byte subroutine in HRAM. OAM DMA is also possible mid-frame provided sprite rendering is turned off, allowing it to be moved out of vblank and into the status bar.
Scrolling: Like the Super NES, the Game Boy lacks the oddball 30-row nametable height. This simplifies some designs for nametable update packet updates. Monochrome doesn't have attributes at all; GBC is like MMC5 EXRAM or Super NES nametables in that it has a second byte plane of attributes whose addresses parallel those of the nametable. But without special support for +32 increment, a nametable column copy loop is slightly slower than 6 cycles/byte.
Frame rate: Both the NES and Game Boy run at close to 60 frames per second. The original green screen Game Boy (DMG) takes several refreshes to change a pixel from light to dark or vice versa. The Game Boy Pocket and Game Boy Color take fewer, but still not fast enough to make 30 Hz flicker as noticeable as it would be on SGB. This lets developers get away with engines that run on twos, on threes, or even on fours (Balloon Kid), for 30, 20, or 15 fps.
I'm interested in fleshing out the arguments both ways to make them more quantitative as opposed to hand-wavey.