My attempt at RLE compression

My attempt at RLE compression
by nicklausw on 2016-04-09 (#167951)

I wanted a reason to practice my python skills, and so I've tried to make my own RLE compression thingy for the SNES.

The way the format works is like this. Say you have these bytes:

Code:

00 ff 00 00 ff ff 00 00 00 00 00 00 ff ff ff ff ff ff

The new file is made up of pairs of two: one byte for how many of the second byte should be copied.
An FF is appended at the end to mark where the routine should stop copying, so the output would be:

Code:

01 00 01 ff 02 00 02 ff 06 00 06 ff ff

The .py file does this perfectly:

Code:

#!/usr/bin/env python3
# basic rle compression program
# by nicklausw

# this format works as follows.
# one byte is read, and that is
# how many of the next byte to copy over. repeat.
# when the first byte is $ff, the copy is done.

import sys
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("in_file", help="the chr data to be rle'd")
parser.add_argument("out_file", help="the rle'd output")
args = parser.parse_args()

f_in = open(args.in_file, 'rb')
f_out = open(args.out_file, 'wb')

# python doesn't need variable initialization i think,
# but for clarity's sake.
byte_n = 0 # byte
byte_c = 1 # byte count 

for c in f_in.read():
  if c != byte_n or byte_c == 0xFE:
    f_out.write(bytes([byte_c]))
    f_out.write(bytes([byte_n]))
    byte_c = 1
    byte_n = c
  else:
    byte_c += 1

# now catch the last time around
f_out.write(bytes([byte_c]))
f_out.write(bytes([byte_n]))

# the end mark
f_out.write(bytes([0xFF]))

f_in.close()
f_out.close()

The problem is the routine to copy the bytes to vram:

Code:

.proc rle_copy_ppu
  setxy16
  seta8
  ldy #$00

 loop: 
  seta8
  lda (rle_cp_ram), y
  cpa #$ff
  beq done
  seta16
  and #$ff
  tax
  iny
  seta8
  lda (rle_cp_ram),y
  jsr rle_loop
  iny
  jmp loop
  
done:
  rts

rle_loop:
  seta16
  ;and #$ff
 loop2:
  sta PPUDATA
  dex
  bne loop2
  rts
.endproc

It's called like this:

Code:

; copy font
  setaxy16
  stz PPUADDR  ; we will start video memory at $0000
  lda #font
  sta rle_cp_ram
  jsr rle_copy_ppu

Now, this is the expected output:

And this is the actual output:

I'll give more info on my code if need be, but can anyone see anything wrong that might cause an error in things being copied over?

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167955)

I don't have time to look in-depth at your code, but I noticed that you are only using a byte for the count. So, if your original data has more than 254 repetitions (because you are also using 255 as a terminator), then you will need to accommodate this. Rather than make the count a 16-bit value and consistently waste space, you can just use a "divide and modulus" strategy to make combination entries to create the proper total.

EDIT: (Actually, your code is already handling this just fine!)

EDIT: supplying the actual data you are using for the RLE (2bpp font) would be helpful too.

Re: My attempt at RLE compression
by tokumaru on 2016-04-09 (#167957)

That's an incredibly wasteful variation of RLE, seeing as every non repeated byte gets a "1" inserted in front of it! I wouldn't be surprised if this caused expansion instead of compression in a lot of cases!

The most common solution is to specify runs of uncompressed bytes, just like you represent runs of compressed bytes. You can use bit 7 to specify whether a run is compressed or uncompressed, and bits 0-6 to specify the length, from 1 to 128. Or you can assume that compressed and uncompressed runs will alternate, and use all 8 bits for the length. In this case, you'd need to allow runs of 0 length, which you'd use between two runs of the same type.

Another option is to use auto-triggered RLE, where a length is only specified after 2 (or 3) repeated bytes show up, and more are assumed to come. This assumption might be wrong, in which case the length will be 0.

Re: My attempt at RLE compression
by nicklausw on 2016-04-09 (#167958)

Bazz: good point, but it doesn't affect the file I'm RLE-ing.

tokumaru: thanks for the suggestion, but if I can't even get this basic RLE to work then why over-complicate it more?

Anyway, I fixed the python file, the first byte was copied either once too many times if it's 0 or once too few if not. The bug is still present.

Code:

#!/usr/bin/env python3
# basic rle compression program
# by nicklausw
# public domain.

# this format works as follows.
# one byte is read, and that is
# how many of the next byte to copy over. repeat.
# when the first byte is $ff, the copy is done.

import sys
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("in_file", help="the chr data to be rle'd")
parser.add_argument("out_file", help="the rle'd output")
args = parser.parse_args()

f_in = open(args.in_file, 'rb')
f_out = open(args.out_file, 'wb')

# python doesn't need variable initialization i think,
# but for clarity's sake.
byte_n = int.from_bytes(f_in.read(1), byteorder='little') # byte
byte_c = 0 # byte count

f_in.seek(0) 

for c in f_in.read():
  if c != byte_n or byte_c == 0xFE:
    f_out.write(bytes([byte_c]))
    f_out.write(bytes([byte_n]))
    byte_c = 1
    byte_n = c
  else:
    byte_c += 1

# now catch the last time around
f_out.write(bytes([byte_c]))
f_out.write(bytes([byte_n]))

# the end mark
f_out.write(bytes([0xFF]))

f_in.close()
f_out.close()

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167959)

tokumaru wrote:

It can only be considered wasteful because people here are generalizing (OP is not designing for any particular dataset or hasn't communicated that) -- and it's been on my mind how I believe a lot of concepts are discussed in this forum "generally" -- when we all know everything changes when a specific project provides actual constraints.

Now is a good time to mention "use-case." I bring this up because there's a difference between designing a general RLE or an RLE specific to a dataset. This same notion applies to almost any other part of development.

Personally, I am going to use RLE for my joypad recordings that run my game's demo mode, and it is likely to ALWAYS have repeats -- in which case, using a bit to specify uncompressed or compressed data would actually be wasteful (waste half a byte per entry). So always consider your use-case if you are designing for some specific dataset.

On the other hand, I'm not sure if there are better compression algos for such a dataset.

Let's also consider the size of the decompression routine in ROM compared to the volume of compressed data we will be using, and ensure there is still a sizable benefit.

Re: My attempt at RLE compression
by dougeff on 2016-04-09 (#167960)

Edit...somehow I missed that you were talking about graphic compression for SNES. Never mind.

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167961)

Quote:

tokumaru: thanks for the suggestion, but if I can't even get this basic RLE to work then why over-complicate it more?

I am proud of you.

I am going to take a closer look at it now

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167962)

no reason to make all of 8 / 16 / 8 bits changes to the accumulator.. just clear hi-8 before start loop

Next -- the problem appears to be in the vram write loop -- you only load the accumulator with a byte, but you are writing a whole word to VRAM, with the high byte as zero.. Is this intentional???? I thought you wanted to copy the bytes directly to VRAM as they are?? But in this case, you are only using them as low bytes of a word.... Since I'm not sure... I'm not going to finish any sort of solution.. But at least have done some refactoring below. (mainly removing all the rep/sep 's) (This could be refactored even further if you don't need the high byte zero at all).

I'm assuming PPUDATA is the VRAM, which is a $2118/$2119 IIRC? Don't forget to have properly configured the increment setting ($2115) (I won't add this in my code.

---- my version (incomplete)

Code:

.proc rle_copy_ppu
  setxy16
  seta8
  ldy #$00
  lda #0   ; clear Accum. hi-byte
  xba

 loop: 
  lda (rle_cp_ram), y
  cpa #$ff
  beq done
  tax
  iny
  lda (rle_cp_ram),y
  jsr rle_loop
  iny
  bra loop
  
done:
  rts

; IN: X = count
;      A = byte
rle_loop:
; INCOMPLETE / WILL NOT WORK (for above mentioned reasons in forum post)
; Please finish reading forum post and reply so we can work out your needs
  sta PPUDATA   ; this is only writes the low byte.
  dex
  bne rle_loop
  rts
.endproc

-----
So now, let's talk about a solution if you do indeed want to write bytes directly to VRAM (and not just lobytes of a word with the high byte as zero, as you started with). we've identified how it's an issue since writing to VRAM requires maintaining an index of whether we are currently writing to the
low byte or high byte of the VRAM register.. and also, if we do the last transfer, ending on the low byte, we'll need to manually write an empty high byte or (some other solution) to get it transferred.

Because of this added difficulty -- well, you can figure it out if you'd like.. Shouldn't be too hard -- but also consider an alternative of decompressing to RAM and DMA'ing -- once again,
it's just an alternative -- and with no project constraints it doesn't matter (go with VRAM direct, it sounds more challenging

Re: My attempt at RLE compression
by nicklausw on 2016-04-09 (#167967)

Is there enough RAM to store 8 KB of data for OAM? Because I'm scared of having to figure out the words and stuff for a direct VRAM copy.

Or is there a way for PPUDATA to just take bytes?

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167974)

> Is there enough RAM to store 8 KB of data for OAM?

s/OAM/VRAM ??? There are ~~64KB~~ 128KB of SysRam

> Or is there a way for PPUDATA to just take bytes?

There is only a way to have the VRAM word address increment based on lo-byte access or hi-byte access, but there is no single byte-port to write to.

----

Let's just verify that your decompression routine works -- you can decompress into SysRAM, and use a debugger to dump the RAM to file. then compare your decompressed section (you may need to trim in a hex editor) with the original data. (I use vbindiff for this)

Re: My attempt at RLE compression
by bazz on 2016-04-09 (#167983)

Code:

byte_n = int.from_bytes(f_in.read(1), byteorder='little') # byte
byte_c = 0 # byte count

f_in.seek(0)

should be

Code:

byte_n = int.from_bytes(f_in.read(1), byteorder='little') # byte
byte_c = 1 # byte count

There's no point in re-seeking back the byte you've already read.

Caution: I'm not a big python programmer, let alone python3 - so if there's other points of improvement, I can't see them X_X

I tested out your compression program -- it seems to work just fine. And, it does automatically cover > 254 repetitions

I used a nifty perl command to print large repetitions of characters for testing (I actually learned this from "Hacking: The Art of Exploitation" like 14 years ago xD)

perl -e 'print "A"x256' > test.txt

Have fun proving your SNES decompression works ^_^

Re: My attempt at RLE compression
by tokumaru on 2016-04-09 (#167987)

nicklausw wrote:

tokumaru: thanks for the suggestion, but if I can't even get this basic RLE to work then why over-complicate it more?

You're right. I actually didn't realize you were having trouble making your current design work until I had typed my entire post, but I ended up posting it anyway. I can't help much with the SNES side of things, so please consider what I wrote as tips for the future.

Re: My attempt at RLE compression
by nicklausw on 2016-04-10 (#168011)

Okay, this routine copies to non-zeropage RAM properly (to $100), but there isn't enough of it available so everything screws up.

You say there's 64 KB of RAM (I said OAM on mistake)? Where? As that would fix everything.

The variables:

Code:

.segment "ZEROPAGE"
  rle_cp_ram: .res 2
  rle_cp_num: .res 2
  
.segment "BSS"
  rle_cp_dat: .res 2 ; a LOT more

The fixed routine:

Code:

.proc rle_copy_ram
  setxy16
  seta8
  ldy #$00
  sty rle_cp_num
  lda #0   ; clear Accum. hi-byte
  xba

 loop:
  lda (rle_cp_ram), y
  cpa #$ff
  beq done
  tax
  iny
  lda (rle_cp_ram),y
  jsr rle_loop
  iny
  bra loop
 
done:
  rts

; IN: X = count
;      A = byte
rle_loop:
  phy
  ldy rle_cp_num
rle_inter:
  sta rle_cp_dat,y
  iny
  dex
  bne rle_inter
  sty rle_cp_num
  ply
  rts
.endproc

Re: My attempt at RLE compression
by tepples on 2016-04-10 (#168012)

The Super NES has 64 KiB of VRAM and 128 KiB of work RAM at $7E0000-$7FFFFF. (The bottom 8 KiB of work RAM is mirrored into the bottom 8 KiB of banks $00-$3F and $80-$BF.) One common pattern is to decompress tile data to work RAM and then schedule a DMA to copy a few KiB to VRAM during the next vertical blanking period.

Re: My attempt at RLE compression
by nicklausw on 2016-04-10 (#168015)

So do I have to do fancy bank switching, or do long references (lda $7fe000) work fine?

Re: My attempt at RLE compression
by dougeff on 2016-04-10 (#168017)

Quote:

So do I have to do fancy bank switching, or do long references (lda $7fe000) work fine?

It looks from tepples example, that you would be reading from the ROM, decompressing to the WRAM, and then DMA ing to the VRAM.

I would assume you would use...STA long,X or something, for the second part.

Re: My attempt at RLE compression
by bazz on 2016-04-10 (#168020)

Thanks Tepples for correcting me, Sysram is 128KB ..

all other comments are on the right track

Re: My attempt at RLE compression
by bazz on 2016-04-10 (#168046)

I just want to happily report that I wrote the asm code to stream RLE joypad log data to my game's joypad routine for its demo mode. It works

. It only required an extra byte of RAM to track the count -- and it made my data 1/4 the size

I also wrote a RLE decompression program for python3 based on nicklausw's scheme.

Code:

#!/usr/bin/env python3
# basic rle decompression program
# by bazz
# public domain.

# There is no error detection

import sys
import argparse
import struct

parser = argparse.ArgumentParser()
parser.add_argument("in_file", help="the RLE data to be decompressed")
args = parser.parse_args()

with open(args.in_file, 'rb') as f_in:
  while True:
    count = int.from_bytes(f_in.read(1), byteorder='little')
    if count == 0xFF:
      break
    byte = f_in.read(1)
    while count > 0:
      sys.stdout.buffer.write(byte)
      count -= 1

I am a big fan of writing output to terminal, and deciding if I want to pipe output to other programs or to actual files. That's why I do not have the decompressor write explicitly to a file. I modified my local compressor to behave the same way.

Re: My attempt at RLE compression
by nicklausw on 2016-04-11 (#168103)

Success! But now...Question time!

1. On the Z80, pushing and pulling from the stack is discouraged in loops because it's slow. Is it discouraged on the SNES? Because the only way I could get long RAM addressing while not making the rest of the code problematic was this:

Code:

rle_inter:
  phx
  phy
  plx
  sta $7e2000,x   ; this is only writes the low byte.
  phx
  ply
  plx
  iny
  dex
  bne rle_inter
  sty rle_cp_num
  ply
  bra rle_loop_done

2. How do you get CA65 to figure out that you're trying to use long addressing? This is why I had to write out $7e2000 above. EDIT: figured this one out. CA65 says it's "far" addressing.

Re: My attempt at RLE compression
by tepples on 2016-04-11 (#168109)

nicklausw wrote:

On the Z80, pushing and pulling from the stack is discouraged in loops because it's slow. Is it discouraged on the SNES?

A 16-bit push takes 4 cycles, and a 16-bit pull takes 5.

Quote:

Because the only way I could get long RAM addressing while not making the rest of the code problematic was this:

Code:

rle_inter:
  phx
  phy
  plx
  sta $7e2000,x   ; this is only writes the low byte.
  phx
  ply
  plx
  iny
  dex
  bne rle_inter
  sty rle_cp_num
  ply
  bra rle_loop_done

I'd have to see it in context to see whether you could rearrange the use of registers to minimize stack use, such as using Y for the source (especially using [dd],Y addressing) and X for the destination.

Quote:

2. How do you get CA65 to figure out that you're trying to use long addressing? This is why I had to write out $7e2000 above. EDIT: figured this one out. CA65 says it's "far" addressing.

To force far (24-bit) addressing, prefix the address with f:.

Re: My attempt at RLE compression
by nicklausw on 2016-04-11 (#168113)

K then, I'll give the function more context.

variables: (I'm using your cfg file)

Code:

.segment "ZEROPAGE"
  rle_cp_ram: .res 2
  rle_cp_num: .res 2
  
.segment "BSS7E" : far
  rle_cp_dat: .res 8192 ; 8 KB?

Function:

Code:

.proc rle_copy_ram
  setxy16
  seta8
  ldy #$00
  sty rle_cp_num
  lda #0   ; clear Accum. hi-byte
  xba

 loop:
  lda (rle_cp_ram), y
  cpa #$ff
  beq done
  tax
  iny
  lda (rle_cp_ram),y
  bra rle_loop
rle_loop_done:
  iny
  bra loop
 
done:
  rtl

; IN: X = count
;      A = byte
rle_loop:
; INCOMPLETE / WILL NOT WORK (for above mentioned reasons in forum post)
; Please finish reading forum post and reply so we can work out your needs
  phy
  ldy rle_cp_num
rle_inter:
  phx
  phy
  plx
  sta rle_cp_dat,x   ; this is only writes the low byte.
  phx
  ply
  plx
  iny
  dex
  bne rle_inter
  sty rle_cp_num
  ply
  bra rle_loop_done
.endproc

Function being called, I guess:

Code:

; copy font
  setaxy16
  lda #font & $ffff
  sta rle_cp_ram
  jsl rle_copy_ram

Re: My attempt at RLE compression
by tepples on 2016-04-11 (#168121)

Assuming rle_cp_ram is the source address, and the data is in the current data bank. If so, rle_cp_src might be a clearer name.

In this loop, you may want to put Y (source index) on the stack first so that you can use Y as a remaining length counter.

See if this (untested) makes any sense:

Code:

.segment "ZEROPAGE"
  rle_cp_src: .res 2    ; Renamed: serves as pointer (within current bank) to compressed data
  rle_cp_index: .res 2  ; Renamed: serves as index into rle_cp_dat
 
.segment "BSS7E" : far
  rle_cp_dat: .res 8192 ; 8 KB?

;;
; Decompresses data to rle_cp_dat using a simple RLE scheme.
; @param DBR:rle_cp_src pointer to compressed data
; @return rle_cp_index = 4
.proc rle_copy_ram
  setxy16
  seta8
  ldy #$00
  sty rle_cp_index
  tya  ; clear low and high bytes of accumulator

 loop:
  lda (rle_cp_src),y
  cpa #$ff  ; RLE data is terminated by a run length of $FF
  beq done  ; But what does a run length of 0 do?
  tax
  iny
  lda (rle_cp_src),y
  iny
  phy

  ; At this point, Y (source index) is saved on the stack,
  ; A is the byte to write, and X is the length of the run.
  txy
  ldx rle_cp_index
  ; And here, Y is the length of the run, A is the byte to write,
  ; and X is the index into the decompression buffer.
rle_inter:
  sta rle_cp_dat,x
  inx
  dey
  bne rle_inter

  stx rle_cp_index
  ply  ; Restore source index
  bra loop
 
done:
  rtl
.endproc

I wonder if it could be made even faster by DMAing that byte from ROM to the WRAM's B bus port instead of using this CPU fill.

Re: My attempt at RLE compression
by nicklausw on 2016-04-12 (#168221)

The code worked after I added a little something, because x's higher bit never gets cleared before transferring to y (causing way too much byte copying):

Code:

; At this point, Y (source index) is saved on the stack,
  ; A is the byte to write, and X is the length of the run.
  txy
  
  ; additional code starts here
  
  ; no higher byte!
  pha
  seta16
  tya
  and #$ff
  tay
  seta8
  pla

Nothing worse about assembly programming than learning the limitations of registers.

Does anyone know of any documents that explain more about what exactly DMA does? Because as far as I know, it just does super fast data transfers and no one really questions why.

Re: My attempt at RLE compression
by 93143 on 2016-04-12 (#168228)

http://wiki.superfamicom.org/snes/show/DMA+&+HDMA
http://wiki.superfamicom.org/snes/show/Timing
http://problemkaputt.de/fullsnes.htm#snesdmatransfers

What exactly are you looking for?

Re: My attempt at RLE compression
by nicklausw on 2016-04-12 (#168232)

Thanks. I wasn't looking for anything exactly, just wanted to know how DMA works. So basically it reads data without passing by the CPU first?

Also, here's the ROM for what I've been working on throughout this thread. It's a..."platformer". Quotes because there's no sprites yet, and even once they're there, it's not really designed to be scrolling.

I don't want to release source right now, so I'll just say this: the ROM...
Initializes the SNES registers (thanks koitsu),
Loads a little palette through DMA (thanks tepples and AntonioND for bgr macro),
Decompresses some RLE data to RAM and DMA's to VRAM (thanks tepples, bazz, (and myself [semi-sarcasm?] because this is the only part where I got kind of original)),
Draws stuff to the screen (thank-yous already handled for that),
and fades in the screen (I actually figured this out myself).
Extra thank you's to tepples for various stuff like the cfg file.

The point I'm trying to make with all these weird thank-yous is that this is one crazy system. :lol:

Re: My attempt at RLE compression
by nicklausw on 2016-04-12 (#168255)

Update, here's a git repo with it all. The set-up is a little weird, so if for some reason you want to play with it then be sure that the makefile suits your environment.

Re: My attempt at RLE compression
by nicklausw on 2016-04-13 (#168384)

Another update I guess! The RLE now handles many individual symbols.

Used to, this:

Code:

00 01 00 01 00 01

Would basically be expanded to this during "compression":

Code:

01 00 01 01 01 00 01 01 01 00 01 01

Now, longer statements are available with the format of:
$fe: to warn the decomp routine to direct copy these bytes.
$xx: the next byte is the number of bytes ahead to copy.
And then there's the bytes.

So now these bytes would become:

Code:

fe 06 00 01 00 01 00 01

This reduced the size of the font in ROM by 2 KB. Not a crazy amount given the size available on a SNES cartridge, but this is more complicated than I thought I'd be willing to get, so I'm happy.

It's all in the repo.