Assuming you want to avoid all graphical glitches... this means the earliest you can start messing with the PPU address is just after it increments the Y scroll (PPU cycle 257 on the scanline). You must also finish and have the scroll values reset before it begins fetching bytes for the next scanline (cycle 320 -- meaning you must be done by 319) This leaves you a window of 62 PPU cycles -- or roughly 20 CPU cycles.
Now note that you only have 20 CPU cycles if you happen to hit HBlank at EXACTLY the right time --- which typically won't be the case, since the last CPU instruction might have left you 1 cycle into HBlank. So to be safe you should probably only work around having have 18 cycles or less -- depending on the length of the instructions you use. But for this, let's just say you have 18.
that gives you time for 4 writes (STA,STX,STY absolute are 4 cycles).. and I guess 1 load (LDA,LDX or LDY immediate for the final 2 cycles). If you have A,X,Y all set up with what you want to have written before HBlank, you may be able to take advantage of all these writes.
Now, you'll want the last 2 writes to be resetting the scroll -- so that means you're stuck with the last 2 writes being $2006 writes. Which I guess would make the write before that your palette write (to $2007). Making this stunt just BARELY possible... with a minor problem:
Fine Y scroll will be lost, since you can't set it with just $2006 writes alone, and you don't have time to alternate $2005/$2006 writes to set it properly. This means that you won't be able to change a palette color on every scanline -- but only on half of them (the ones where the fine Y scroll for the next scanline would be less than 4)
Now note that only the second $2006 write needs to come after PPU cycle 257 (since that's the only one that changes the PPU address directly). The first can come a little earlier. In the end, you'll probably want something like this:
Code:
LDX #desired_scroll_high
LDY #desired_scroll_low
LDA #$3F
STA $2006 ; note - key time 1 .. see below.
LDA #palette_address_you_want_to_change
STA $2006 ; note - key time 2 .. see below. HBlank starts now
LDA #desired_color
STA $2007
STX $2006
STY $2006
; end HBlank
The Key time 1 (that first $2006 write) can be dangerous depending on the H scroll of the screen. At this time (before HBlank starts), the PPU is still fetching/rendering tiles, and in the process of doing so, is incrementing the PPU address by 1 every 8 PPU cycles. This wouldn't be a problem, except that if it increments the address so that the low 5 bits wrap from $xx1F, it will reload bit 10 from the temp PPU address -- which this write is changing. If this write rubs that bit the wrong way, it's okay as long as only the last fetched tile is affected (since it's never rendered to the screen). If you mess up the tile before that you could have up to 7 distored pixels on the right side... the tile before that = up to 15 pixels. And so on. Now in this case... "distorted" only means they could be loaded from the wrong nametable ($2400 instead of $2000 ... or $2C00 instead of $2800). So if you have both nametables containing the same tile for this part of the screen (or more simply, if you have horizontal mirroring), this problem will be avoided altogether.
The Key time 2 (that second $2006 write) is what's most important. HBlank (scanline cycle 257) must hit BEFORE the final cycle of that instruction (the write cycle). If not, you risk distorting the next scanline (and possibly the rest of the frame).
Of course the hardest part of all this (and wouldn't you know it... the part I didn't cover ;P ) is actually FINDING HBlank with any degree of accuracy. IRQs are slow and coarse -- and could leave you up to 7 or so cycles off (chopping your slim-but-doable 18 available cycles down to a don't-even-bother 12 cycles). Sprite 0 hit might be good enough since BIT is moderately quick (4 cycles) -- but with a 4 cycle error you'd only have about 15 cycles (and if you count the cycles in my code above, you'll see I need AT LEAST 15 cycles for it to work -- so this could be very dangerous)
EDIT -- also note that this is all theoretical. As has been said this just flat out might not work on the real system.