It's just a variant of Cinepak made to work with palettes. The base blocks are 2×2, 4×2 and 4×4 with dictionaries for each size (with up to 256 entries for each). Anything that doesn't fit into those dictionaries has to be stored as-is. I don't remember how audio is stored, besides being streamed alongside video (the sample rate is quite low, though).
Not sure if this stands true for all versions (the decoder was being constantly updated over time and Sega controlled that only the latest version was being used, if the addenums are anything to go by) but I know that's how at least some of them work.
EDIT:
or just look here instead of taking what I said for granted. I don't guarantee the accuracy of my statements =P (although that seems to be the gist of the basics)