Cyril Zhang via freebsd-hackers
2021-05-17 20:25:56 UTC
I am looking to change the default algorithm used in sort from
heapsort to mergesort. This would significantly improve the runtime of
sort, even after this patch: https://reviews.freebsd.org/D30170.
I couldn't find any rationale for why heapsort was chosen as the
default. The original commit from 2012, that introduced heapsort as
the default, says "Change default sort method to mergesort, which has
a better worst case performance than qsort", seemingly indicating that
mergesort should be the default.
Here is some data. Sorting a 9999999 line file of integers between 1 and 100:
$ time sort --heapsort -n -o /dev/null rand.txt
22.99 real 22.69 user 0.31 sys
$ time sort --mergesort -n -o /dev/null rand.txt
10.76 real 10.47 user 0.29 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
6.39 real 5.77 user 0.62 sys
$ time sort --heapsort -o /dev/null rand.txt
26.42 real 26.16 user 0.26 sys
$ time sort --mergesort -o /dev/null rand.txt
9.97 real 9.72 user 0.25 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
8.25 real 7.58 user 0.66 sys
Same file, but pre-sorted numerically (a particular trait of libc's
mergesort implementation, which sort uses, is that it is linear on
sorted data):
$ time sort --heapsort -n -o /dev/null sorted.txt
15.97 real 15.64 user 0.33 sys
$ time sort --mergesort -n -o /dev/null sorted.txt
3.97 real 3.69 user 0.27 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
3.75 real 3.07 user 0.68 sys
A real-world example, sorting a file with 10403181 lines of text:
$ time sort --heapsort -o /dev/null cscope.1
38.52 real 37.86 user 0.66 sys
$ time sort --mergesort -o /dev/null cscope.1
18.57 real 17.82 user 0.75 sys
$ time gsort --parallel=1 -o /dev/null cscope.1
15.75 real 12.19 user 3.56 sys
The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *) extra memory where n is the number of
lines in the input. In practice, this doesn't seem to be much, and is
dwarfed by the memory allocations that occur in the pre-processing
stage of sort, even if each line contains only a single character.
Using Valgrind's massif tool, I found that mergesort appears to
allocate only about 7% of the program's total memory usage.
Equivalently, this means that mergesort allocates about 7.5% extra
memory.
Last, this changes the default sorting algorithm from a non-stable
sort to a stable one, which could potentially affect existing programs
that may have been incorrectly relying on the output sorting order
even when -s is not specified.
Here is the Phabricator review for the patch:
https://reviews.freebsd.org/D30319. Does anyone see any problems with
this change?
heapsort to mergesort. This would significantly improve the runtime of
sort, even after this patch: https://reviews.freebsd.org/D30170.
I couldn't find any rationale for why heapsort was chosen as the
default. The original commit from 2012, that introduced heapsort as
the default, says "Change default sort method to mergesort, which has
a better worst case performance than qsort", seemingly indicating that
mergesort should be the default.
Here is some data. Sorting a 9999999 line file of integers between 1 and 100:
$ time sort --heapsort -n -o /dev/null rand.txt
22.99 real 22.69 user 0.31 sys
$ time sort --mergesort -n -o /dev/null rand.txt
10.76 real 10.47 user 0.29 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
6.39 real 5.77 user 0.62 sys
$ time sort --heapsort -o /dev/null rand.txt
26.42 real 26.16 user 0.26 sys
$ time sort --mergesort -o /dev/null rand.txt
9.97 real 9.72 user 0.25 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
8.25 real 7.58 user 0.66 sys
Same file, but pre-sorted numerically (a particular trait of libc's
mergesort implementation, which sort uses, is that it is linear on
sorted data):
$ time sort --heapsort -n -o /dev/null sorted.txt
15.97 real 15.64 user 0.33 sys
$ time sort --mergesort -n -o /dev/null sorted.txt
3.97 real 3.69 user 0.27 sys
$ time gsort --parallel=1 -o /dev/null rand.txt
3.75 real 3.07 user 0.68 sys
A real-world example, sorting a file with 10403181 lines of text:
$ time sort --heapsort -o /dev/null cscope.1
38.52 real 37.86 user 0.66 sys
$ time sort --mergesort -o /dev/null cscope.1
18.57 real 17.82 user 0.75 sys
$ time gsort --parallel=1 -o /dev/null cscope.1
15.75 real 12.19 user 3.56 sys
The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *) extra memory where n is the number of
lines in the input. In practice, this doesn't seem to be much, and is
dwarfed by the memory allocations that occur in the pre-processing
stage of sort, even if each line contains only a single character.
Using Valgrind's massif tool, I found that mergesort appears to
allocate only about 7% of the program's total memory usage.
Equivalently, this means that mergesort allocates about 7.5% extra
memory.
Last, this changes the default sorting algorithm from a non-stable
sort to a stable one, which could potentially affect existing programs
that may have been incorrectly relying on the output sorting order
even when -s is not specified.
Here is the Phabricator review for the patch:
https://reviews.freebsd.org/D30319. Does anyone see any problems with
this change?