Code:
lda {word_a}
asl #2
eor {word_b}
and #$cccc
tay
eor {word_b}
xba
asl
tax
lda lut,x
ror
sta {word_b}
tya
lsr #2
eor {word_a}
xba
asl
tax
lda lut,x
ror
sta {word_a}
asl #2
eor {word_b}
and #$cccc
tay
eor {word_b}
xba
asl
tax
lda lut,x
ror
sta {word_b}
tya
lsr #2
eor {word_a}
xba
asl
tax
lda lut,x
ror
sta {word_a}
How this works is that it takes 2 words like this:
Code:
ccccddddaaaabbbb
gggghhhheeeeffff
gggghhhheeeeffff
Swap bits so that planes 0 and 1 are separates from 2 and 3:
Code:
ccggddhhaaeebbff
ccggddhhaaeebbff
ccggddhhaaeebbff
Then swap high and low bytes:
Code:
aaeebbffccggddhh
aaeebbffccggddhh
aaeebbffccggddhh
Then shift left to get the index for a 64kB conversion LUT, and then shift the high bit back in, and we're done.
I didn't count how many cycles this takes but I believe this would be pretty fast, as long as you already have 2 pixels per byte in the first place.
Edit: Found out this takes 69 cycles total, so 8.625 cycles per pixel.