patch: Change default sorting algorithm in sort(1)

Discussion:

Cyril Zhang via freebsd-hackers

2021-05-17 20:25:56 UTC

I am looking to change the default algorithm used in sort from
heapsort to mergesort. This would significantly improve the runtime of
sort, even after this patch: https://reviews.freebsd.org/D30170.

I couldn't find any rationale for why heapsort was chosen as the
default. The original commit from 2012, that introduced heapsort as
the default, says "Change default sort method to mergesort, which has
a better worst case performance than qsort", seemingly indicating that
mergesort should be the default.

Here is some data. Sorting a 9999999 line file of integers between 1 and 100:

$ time sort --heapsort -n -o /dev/null rand.txt
22.99 real 22.69 user 0.31 sys

$ time sort --mergesort -n -o /dev/null rand.txt
10.76 real 10.47 user 0.29 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
6.39 real 5.77 user 0.62 sys

$ time sort --heapsort -o /dev/null rand.txt
26.42 real 26.16 user 0.26 sys

$ time sort --mergesort -o /dev/null rand.txt
9.97 real 9.72 user 0.25 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
8.25 real 7.58 user 0.66 sys

Same file, but pre-sorted numerically (a particular trait of libc's
mergesort implementation, which sort uses, is that it is linear on
sorted data):

$ time sort --heapsort -n -o /dev/null sorted.txt
15.97 real 15.64 user 0.33 sys

$ time sort --mergesort -n -o /dev/null sorted.txt
3.97 real 3.69 user 0.27 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
3.75 real 3.07 user 0.68 sys

A real-world example, sorting a file with 10403181 lines of text:

$ time sort --heapsort -o /dev/null cscope.1
38.52 real 37.86 user 0.66 sys

$ time sort --mergesort -o /dev/null cscope.1
18.57 real 17.82 user 0.75 sys

$ time gsort --parallel=1 -o /dev/null cscope.1
15.75 real 12.19 user 3.56 sys

The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *) extra memory where n is the number of
lines in the input. In practice, this doesn't seem to be much, and is
dwarfed by the memory allocations that occur in the pre-processing
stage of sort, even if each line contains only a single character.
Using Valgrind's massif tool, I found that mergesort appears to
allocate only about 7% of the program's total memory usage.
Equivalently, this means that mergesort allocates about 7.5% extra
memory.

Last, this changes the default sorting algorithm from a non-stable
sort to a stable one, which could potentially affect existing programs
that may have been incorrectly relying on the output sorting order
even when -s is not specified.

Here is the Phabricator review for the patch:
https://reviews.freebsd.org/D30319. Does anyone see any problems with
this change?

Hans Petter Selasky

2021-05-18 09:23:11 UTC

Permalink

Post by Cyril Zhang via freebsd-hackers
The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *)

Will this affect small-memory systems ability to sort data?

--HPS

Cyril Zhang via freebsd-hackers

2021-05-18 18:27:55 UTC

Permalink

Post by Hans Petter Selasky
Will this affect small-memory systems ability to sort data?

It shouldn't. The sort program already allocates a large amount of
memory during the preprocessing stage, far more than the amount of
bytes in the input file (which is a wholly separate issue from the one
at hand). The extra memory required for mergesort is much smaller than
that, even in the worst case where each line of input contains only a
single character.

The sort program also has a mechanism to use temporary files in order
to handle cases where the system runs out of memory. This is
documented in the -S flag:

-S size, --buffer-size=size
Use size for the maximum size of the memory buffer. Size modi-
fiers %,b,K,M,G,T,P,E,Z,Y can be used. If a memory limit is not
explicitly specified, sort takes up to about 90% of available
memory. If the file size is too big to fit into the memory buf-
fer, the temporary disk files are used to perform the sorting.

So a small-memory system could set the -S flag to allow for some extra
memory for use with mergesort.

Finally, the --heapsort and --quicksort flags will still be available,
if memory conservation is truly needed.