We were evaluating a vendor product for viewing a certain kind of data file (most 80-byte records of various types, but also includes variable length TIFF images embedded inside, all text is EBCDIC; yeah, it was designed by a committee too). These files are typically large (200MB to 1GB).
This vendor app was written in dot-net by a total fucking idiot. His code took 45 minutes to read a 800MB file over a network drive mount (client + server on same network segment). So I debugged it using tcpdump / wireshark, procexp and filemon. His code would issue a 4K read request to the kernel for EACH 1, 2 or 4 byte field within each record. Overlapping reads. Literally:
Code:
seek 0, seek_set
read 4096
seek 2, seek_set
read 4096
.....
I filed a bug report. He didn't understand the problem. I sent in my data. He finally got it. They fixed the bug, but wanted to up their price by $2K for the fix. I said fuck this (if you haven't gathered, I was pissed at their sloppiness).
Over one weekend I wrote a win32 gui app in C++ using the pure win32 api (no mfc, no dot-crap, no winforms, gdi++ to read TIFFs from a memory stream). I mem-map the data file with a 256MB window that slides 64MB at a time. My app can process a 800MB file from local disk in 7 seconds (when its hot in the cache) and a 1G file over the network at near wire speed (12 seconds or so).
His app was a memory hog. Mine uses a bit of memory, but not a terrible amount. I designed a hybrid data structure (part tree, part linked-list) to hold the file's hierarchy in memory as I parsed it. The user can then search for records and the app displays the image and other record data.
I wanted to open-source the app and dump it on the Internet. As much as my bosses thought that would be funny, they did not permit me to.
My point is (other than a little bragging):
Data structures and IO patterns _do_ matter. Even incredibly powerful modern computers and networks can be brought to their knees by bad coding practices.