Can I optimize this sorting routine?

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Can I optimize this sorting routine?
by on (#121309)
I have a collection of 28 actors. I want to sort them by their y value. Since the actors are each 8 bytes, I figured it would be faster to sort an array of indexes to the actors. This is an implementation of the insertion sort routine.

I'm posting this question because this is my first NES project, I'm still getting used to the limitations of the 6502 (particularly all the addressing modes!), and I'm hoping that any insights people could offer could speed up all the coding I do in the future. Can anyone think of a way that I could speed up this code?

The vast majority of the time in the sorting routine is spent doing the following (at label '_jSort'):
1. y = the current position in the collection to be sorted.
2. save y
3. a = the index of the actor at the current position in the collection.
4. transfer a to y
5. a = the y value of the actor at the index in y.
>>> my comparison and swap code happens here
6. restore y to the saved value (restoring it to the current position in the collection to be sorted).
7. Decrement y.
8. loop to 1.

This is the complete routine. Any thoughts how I could optimize this code?
Code:
; -------------------------------- [ SortActors ] --------------------------------
; In:     A is the superchunk index (0-3) that contains the list of Static Actors we are sorting.
; Notes: Uses $00-$05 in zp.
SortActors:
.scope
.alias    _PointerSort      $00
.alias    _PointerYVals     $02
.alias    _key              $04
.alias    _y                $05

    and #$03                 ; clip A to $00-$03
    `add $03                 ; a = a + 3
   
    sta [_PointerSort+1]     ; _PointerSort = ptr to sort array.
    sta [_PointerYVals+1]    ; _PointerYVals = to y-values.
    lda #StaticSortArray
    sta _PointerSort
    lda #StaticsY
    sta _PointerYVals

    ; insertion sort
    ldx #01
    _iSort:
        txa                            ; y = x
        tay                            ;
        lda (_PointerSort),y           ; a = actor[x]
        sty _y                         ;
        tay                            ;
        pha                            ; push a (index pointed to by x)
        lda (_PointerYVals),y          ; a = actor[x].y
        sta _key                       ; key = actor[x].y
        ldy _y
        dey                            ; y = x - 1
        _jSort:                        ; while (key < actor[y].y && y >= 0) {
            lda (_PointerSort),y       ; ; a = actor[y].y
            sty _y                     ; ; _y = y
            tay                        ;
            lda (_PointerYVals),y      ;
            cmp _key                   ;
            ldy _y                     ; ;    y = _y
            bcc _iNext                 ;
            lda    (_PointerSort),y    ;
            iny                        ;    actor[y+1] = actor[y]
            sta (_PointerSort),y       ;
            dey                        ;
            dey                        ;     y = y - 1
            bpl _jSort                 ; }
        _iNext:
        iny                            ; actor[y+1] = actor[x]
        pla                            ; pull a (index pointed to by x)
        sta (_PointerSort),y           ;
        inx
        cpx #StaticCount
        bne _iSort
    rts
Re: Can I optimize this sorting routine?
by on (#121310)
I don't have any objective numbers to back my suggestion, but instead of looping through the objects just for sorting, I would do the sorting at the same time as processing their movement, since the sorting is directly related to their positions. Whenever an object's Y coordinate changes, check the appropriate neighbor object (which one will depend on whether the coordinate increased or decreased) to decide whether they should switch places (more switches might be needed, so you gotta keep checking in that direction until no more switches are needed).

Based on my experience, looping is pretty slow, so you should avoid looping through the same entities more than once per frame. If you're gonna loop though the objects to run their AI, take the opportunity to do everything else you have to do with them, like rendering their sprites or sorting their coordinates. I know it might seem crazy to draw objects when you haven't processed them all yet, because it means that their state isn't yet final, but the solution I found for this is to render the graphics based on data from the last frame (i.e. render sprites before updating the positions), which you know is consistent. A 1 frame delay isn't really noticeable to players, and the time you save by not looping through all the objects again is considerable.

Another thing that might help is using a linked list of objects. If you store references to the previous and next objects in each object's RAM, you can navigate through the objects in a single list, instead of having to jump between the list of objects and the list of sorted object indices. Linked lists also make it possible to exclude unused object slots, you simply don't link to them, so when you only have a few active objects you won't waste time looking at empty slots.

I wouldn't do these modifications unless I was actually experiencing slowdowns though, which I'm not sure is your case.

EDIT: I realize it would be cleaner to have simpler specialized routines that you can call in the appropriate order, but depending on the amount of data you have to process per frame on such a limited machine as the NES, you'll eventually see the need to combine tasks in order to save time. It isn't always pretty, but sometimes it's necessary in order to get the performance that a complex game needs.
Re: Can I optimize this sorting routine?
by on (#121320)
The classic optimization for insertion sort is due to David Shell. Instead of starting by comparing actors 1 unit apart, make three passes over your data. First compare actors 10 units apart, then actors 4 units apart, then actors 1 unit apart. I used Shell sort (but with Knuth's 13, 4, 1 sequence) in the early version of my "Who's Cuter" tech demo. (It has since been replaced with mergesort.)

Another way is to do a single round of bubble sort. This may produce momentarily missorted items for one frame if one item skips past numerous items, but so long as the items don't move drastically from frame to frame like a blue hedgehog on methamphetamine, the player won't notice. I used this "eventual bubble sort" in the 2011 update of my Sprite Cans demo as part of a sprite rotation technique. (The old version of Sprite Cans used exact insertion sort and would occasionally slow down.)
Re: Can I optimize this sorting routine?
by on (#121326)
As far I know bubble sort is the fastest for small datas (lower than 16-32) and then you'll have to ressort to a complicated algorithm to sort more data (that you probably don't want to implement in 6502 ASM ;) )
Re: Can I optimize this sorting routine?
by on (#121327)
I don't think Bubble Sort is fastest in any case except the already sorted case. This is a surprising suggestion to me. It is sometimes the fastest sort to code, as insertion sort and bubble sort are usually very easy in concept and execution, but their running time is still O(n^2) for randomly sorted lists whether those lists are small or large.

Bubble sort and insertion sort can both be good for sorting a little bit at a time, like tepples suggested, but if you need to sort everything from scratch each frame, I would suggest an O(n log n) sort like Heap Sort or Merge Sort. Heap Sort is usually easier to do in-place than Merge Sort (which typically needs a temporary buffer to copy back and forth from). Heap Sort usually takes a little more code as well, though.

There are algorithms like Intro Sort which use something like Quick Sort until the division of the list gets small enough, at which point it switches to Heap Sort. The choice of Heap Sort for smaller parts of the list is partly due to Heap Sort's random-access patterns working best when the working set can be cached, and breaking up the worst-case problems of Quick Sort (Quick Sort is expected O(n log n) but O(n^2) in some rare/worst cases). Quick Sort, on the other hand, has linear access patterns, so it works better on very large sets when caching is an issue.

The caching thing isn't really relevant for the 6502, where you're not going to have a random-access performance penalty anyway. Operating on a pointer list is a good way to minimize the inefficiency due to movement of data, so it's good you've done that as an optimization already. If your comparison step is also optimized, the only thing left is to use an algorithm that reduces the actual number of comparisons and data-swaps you end up doing.

Edit: just thinking about the efficiency of bubble sort, to ballpark the efficiency breakdown, it really is efficient on an almost-sorted list. Roughly speaking, if there are less than log2(n) items out of place it will be faster than Heap Sort (which always takes the same amount of time). Similarly speaking if you need to move more than log2(n) items per frame, you might as well switch to an O(n log n) sort and do the whole list, since it becomes more efficient to just sort the whole list with a Heap/Merge sort at that point. This is why bubble sort is actually a good choice for sorting a little bit of the list each frame.
Re: Can I optimize this sorting routine?
by on (#121328)
Speaking of the implementation optimization, rather than choosing the most effective sorting algorithm, I'd try to get rid from frequent indirect indexed addressing by storing lists in fixed memory areas and probably try to unroll the loop, since it is just 28 objects.
Re: Can I optimize this sorting routine?
by on (#121350)
I was just saying that bubble sort is fast for small datas.
Of course it's O(n^2), but if n is sufficiently small this is not a problem. This is a tremendous problem if n is large.

Heap sort is very efficient for large data, but you don't want to implement it in 6502 ASM. You could but I bet it'd be at least a dozen times bigger and slower for each iteration of the loop. All the extra effort is only worth if the data is large enough so that the gain in complexity matters.

Bubble Sort is n^2 only in the absolute worst case where data is sorted in the opposite order. If is also extremely straightforward to implement, in 6502 asm it should be like 10-15 instructions or so. I like it that you can make the sort stable or not by changing the algorithm slightly. I use it for sorting the sprites I display on screen.
Re: Can I optimize this sorting routine?
by on (#121351)
For the case of sorting sprites for display, where you want to turn nearly-sorted data into even closer to sorted data with the occasional mistake, you can handle the worst fast-moving elements by running an ascending pass of bubble sort followed by a descending pass. If you need exactitude, such as sort-and-sweep collision detection that relies on coordinates being monotonic, with the coordinate of a higher-numbered element always being greater than or equal that of the previous element, Shell sort is probably the best option below about 100 elements.
Re: Can I optimize this sorting routine?
by on (#121352)
Bregalad wrote:
Heap sort is very efficient for large data, but you don't want to implement it in 6502 ASM. You could but I bet it'd be at least a dozen times bigger and slower for each iteration of the loop. All the extra effort is only worth if the data is large enough so that the gain in complexity matters.


I would suspect a heap sort to have about 6x the amount of code compared to a bubble sort, but per move+compare+swap step I would sincerely doubt the implementation would be any worse than 50% as efficient, easily made up by the algorithmic efficiency for if you need to sort more than ~log2(n) elements. You need a lot of extra code for handling left/right branches etc. but you don't take all of that code in a single equivalent step.

Anyhow, as stated before, if you can retain the mostly-in-order list from the previous frame then just stick with bubble sort.
Re: Can I optimize this sorting routine?
by on (#121355)
I didn't even question the sorting algorithm in my response because I think the algorithm is not very important in this case, what's important is reducing indirection (which means less saving and restoring of registers and slow pointer manipulation) and the number of times you iterate over the objects, which IMO are the 2 things responsible for making a task like this slow.

What I'm suggesting is something like the code below right after the object is done being updated for the frame:
Code:
   ;object attributes are stored as a structure of arrays (interleaved)
   ;x contains the current object's slot index (unsorted: 0 = first, 1 = second, etc.)
   ;each object holds indices to the one before it and to the one after it (sorted linked list)
   ;DeltaY is some variable that was used to move the object vertically
   bit DeltaY ;check whether the object moved up or down
   bmi ComparePrevious
CompareNext:
   ldy NextObjectIndex, x ;have y point to the next object
   bmi Done ;don't do anything if this is the last object
   lda ObjectY, y ;compare the coordinates
   cmp ObjectY, x
   bcs Done ;if NextY >= CurrentY, we're done
   ;SWAP THIS OBJECT AND THE NEXT HERE
   jmp CompareNext ;keep comparing
ComparePrevious:
   ldy PreviousObjectIndex, x ;have y point to the previous object
   bmi Done ;don't do anything if this is the first object
   lda ObjectY, x ;compare the coordinates
   cmp ObjectY, y
   bcs Done ;if CurrentY >= PreviousY, we're done
   ;SWAP THIS OBJECT AND THE PREVIOUS HERE
   jmp ComparePrevious ;keep comparing
Done:

This way you don't have to iterate over the objects more than once, and there's also no need to deal with pointers. The only downside I can see is that if two objects are close and to each other and moving in the same direction vertically at the similar speeds, they might be swapped and un-swapped after each one is moved every frame unnecessarily.

EDIT: It now occurred to me that swapping items in a doubly linked list involves changing a lot of indices... Say, if you have a list of 5 items [A B C D E] and you need to swap B and D, you have to change B's and D's previous and next (4 indices), A's next, C's previous, C's next and E's previous. That's 8 indices you have to change, so this part of the process would probably slow things down a bit.
Re: Can I optimize this sorting routine?
by on (#121483)
Thanks for all the advice about sorting algorithms! I'm going to stick with Insertion sort for now, because I already have it implemented. With regards to the equation I posted...
Shiru wrote:
Speaking of the implementation optimization, rather than choosing the most effective sorting algorithm, I'd try to get rid from frequent indirect indexed addressing by storing lists in fixed memory areas
I think I could replace the third indirect indexed load by using the undocumented LAX instruction and then TXA to restore A later in the inner loop (saving 3 cpu cycles per loop iteration). Given that Nintendo isn't going to fail my game for using an undocumented opcode, is it okay to use this instruction in my source?
Re: Can I optimize this sorting routine?
by on (#121484)
LAX immediate is reportedly unstable, but the other LAX opcodes should be fine.
Re: Can I optimize this sorting routine?
by on (#121488)
'LAX #immed' isn't even called that, because it doesn't Load A and X. The instruction in its place is sometimes called ATX instead. Regardless, don't use it.
Re: Can I optimize this sorting routine?
by on (#121631)
Using illegal opcodes makes Nintendulator's debug window go nuts (given that the LAX instruction is in the inner loop, it's called well over 100 times when sorting 28 randomized objects). I'd hate to foist that upon someone who wants to play my game; I'm going to pass on using it.

That said, learning about why the illegal opcodes also gave me a better understanding of the various addressing modes of the 6502, which was very helpful.
Re: Can I optimize this sorting routine?
by on (#121649)
pops wrote:
Using illegal opcodes makes Nintendulator's debug window go nuts (given that the LAX instruction is in the inner loop, it's called well over 100 times when sorting 28 randomized objects). I'd hate to foist that upon someone who wants to play my game; I'm going to pass on using it.

There's a setting to disable those messages (CPU -> Log Invalid Opcodes), but I understand your opinion. In my opinion, that setting should be off by default.
Re: Can I optimize this sorting routine?
by on (#121674)
Iterative merge sort is easy to write, and always takes the same amount of time. Maybe it's overkill.