Re: git: 8dcf3a82c54c - main - libc: Implement bsort(3) a bitonic type of sorting algorithm.

From: Jessica Clarke <jrtc27_at_freebsd.org>
Date: Wed, 19 Apr 2023 20:17:57 UTC
On 19 Apr 2023, at 20:27, Hans Petter Selasky <hselasky@freebsd.org> wrote:
> 
> On 4/19/23 19:51, Jessica Clarke wrote:
>> On 19 Apr 2023, at 18:24, Hans Petter Selasky <hselasky@freebsd.org> wrote:
>>> 
>>> On 4/19/23 17:46, Brooks Davis wrote:
>>>> On Wed, Apr 19, 2023 at 12:06:26PM +0000, Hans Petter Selasky wrote:
>>>>> The branch main has been updated by hselasky:
>>>>> 
>>>>> URL: https://cgit.FreeBSD.org/src/commit/?id=8dcf3a82c54cb216df3213a013047907636a01da
>>>>> 
>>>>> commit 8dcf3a82c54cb216df3213a013047907636a01da
>>>>> Author:     Hans Petter Selasky <hselasky@FreeBSD.org>
>>>>> AuthorDate: 2022-09-08 10:16:43 +0000
>>>>> Commit:     Hans Petter Selasky <hselasky@FreeBSD.org>
>>>>> CommitDate: 2023-04-19 12:04:22 +0000
>>>>> 
>>>>>     libc: Implement bsort(3) a bitonic type of sorting algorithm.
>>>>>          The bsort(3) algorithm works by swapping objects, similarly to qsort(3),
>>>>>     and does not require any significant amount of additional memory.
>>>>>          The bsort(3) algorithm doesn't suffer from the processing time issues
>>>>>     known the plague the qsort(3) family of algorithms, and is bounded by
>>>>>     a complexity of O(log2(N) * log2(N) * N), where N is the number of
>>>>>     elements in the sorting array. The additional complexity compared to
>>>>>     mergesort(3) is a fair tradeoff in situations where no memory may
>>>>>     be allocated.
>>>>>          The bsort(3) APIs are identical to those of qsort(3), allowing for
>>>>>     easy drop-in and testing.
>>>>>          The design of the bsort(3) algorithm allows for future parallell CPU
>>>>>     execution when sorting arrays. The current version of the bsort(3)
>>>>>     algorithm is single threaded. This is possible because fixed areas
>>>>>     of the sorting data is compared at a time, and can easily be divided
>>>>>     among different CPU's to sort large arrays faster.
>>>>>          Reviewed by:    gbe@, delphij@, pauamma_gundo.com (manpages)
>>>>>     Sponsored by:   NVIDIA Networking
>>>>>     Differential Revision:  https://reviews.freebsd.org/D36493
>>>> Why was this needed?  I'm not really a fan to adding new, non-standard
>>>> sorts without a clear use case.  I understand it has some specific
>>>> advantages vs qsort, but are we going to use it?  Doing so will make our
>>>> code less portable so we almost certainly can't s/qsort/bsort/.
>>> 
>>> Hi Brooks,
>>> 
>>> My long term plan is to get bsort() to replace qsort(), but because the two algorithms have different characteristics, then some people may complain it is good for me, but not for you. I want there to be an option besides from qsort(), which does not use malloc() as an integral part of sorting. And is faster than O(N*N) sorting (still the worst case for qsort in FreeBSD).
>> Why not do an adaptive qsort instead like other standard libraries?
>> Also, nothing actually says qsort doesn’t allocate memory; in fact,
>> glibc’s own documentation states that one should not assume it is
>> in-place and doesn’t allocate.
> 
> Hi Jessica,
> 
>>> qsort() is frequently used to do all kinds of sorting, and some pointed out that qsort() can technically be any kind of sorting algorithm, but looking around it is not.
>> Because there are variants that are guaranteed n log n? Which is better
>> than your n log^2 n.
> 
> Only so-called mergesort() algorithms can do sorting in O(N log N) time, from what I know. And it needs to allocate working memory, when the sorting arrays get large. From past experience, malloc() is a problem in the fast path. If only mergesort() could pre-allocate that memory, or have a pointer to pass working memory. What is the point of allocating and freeing memory over and over again, in certain applications, doing frequent sorting. The API is broken.

pdqsort is n log n time, in-place and doesn’t allocate, and is used,
for example, for Rust’s standard sort_unstable.

>>> When I build applications of my own, I always use mergesort(), which depend on malloc(). Having a dependency on a certain memory allocator to get performance is not good.
>>> 
>>> I want to distinguish from qsort() by calling it bsort(). If people use bsort() they know what they get cross-platform.
>> No they don’t, bsort can mean multiple things.
> 
> In the C-code standard domain, can you give examples of different meanings of bsort() which are established? I'm not aware of any such existing usage.

The b could be bitonic, bubble, binary insertion or something else?
Searching the literature turns up
https://dl.acm.org/doi/pdf/10.1145/3341.3348 as an algorithm described
in the April 1985 Communications of the ACM.

>> If you want a specific
>> algorithm, put it in your program, but please don’t, just use a
>> standardised function like qsort.
> 
> Sorting is a quite generic thing.
> 
> Likewise with <sys/queue.h> we support four different ways to make lists. And then after epoch(9) was introduced came another three ways, basically replicas of <sys/queue.h> with different properties (See: sys/contrib/ck/include/ck_queue.h)
> 
> Personally I think the same about mergesort(), qsort() and bsort(). They are so different and unique ways to sort that we should have all of them in libc. They each serve a purpose.

Except they’re not, because all of what bsort provides can be provided
by qsort. All people care about are being able to quickly sort their
data and whether it’s a stable sort. Occasionally people care about
allocating, but not often. We have multiple list implementations in
sys/queue.h because they support different sets of operations. Arguably
SLIST and LIST should be subsumed by STAILQ and TAILQ respectively,
though.

Meanwhile the CK variants exist because they’re lockless data
structures, which the sys/queue.h ones are not (and shouldn’t be).

>>> If people use qsort() the output is random basically. It helps very little my application works on FreeBSD, but not Linux, for example.
>> No, the output is sorted, which is far from random. And if you need a
>> stable sort you should ask for one.
> 
> Let me rephrase: Random with regards to execution time.
> 
> mergesort() : not for realtime applications (syscalls can sleep when getting working memory)
> qsort() : usually fast, but not always
> bsort() : Worst case is "N * log2 N" compared to "N * N" (qsort), and no syscalls are involved.

Modified qsort() : Wort case is “N * log N”, and no syscalls are involved.

Please just adapt qsort to not have degenerate behaviour; it will
improve our implementation of a standard C function and render your
bsort() entirely obsolete.

>>> In FreeBSD qsort() is typically used for sorting files up to 16K entries per directory. Even if qsort() can explode in time, probably it's not that important. But when using qsort() for sorting millions of mathematical expressions, for example, doing number analysis, this is unacceptable.
>>> 
>>> I think "C.A.R. Hoare's quicksort" technique is designed for single CPU systemsf only. Even if the best-case average is "N*log2(N)", that amount of processing cannot be split by multiple CPUs. The algorithm is serial as such.
>>> 
>>> The bsort() algorithm is much more NCPU friendly, because it can split the work into fixed size smaller and independent work loads. Otherwise the work load doubles every time you merge two sorted lists. And when the number of lists to merge is fewer than the number of CPUs available, your system runs out of guts basically.
>> I highly doubt you want a libc sort routine to start spawning threads.
> 
> Right. I'm thinking more like Grand Dispatch Central, though not that common in FreeBSD, we still have a wiki page:
> 
> https://wiki.freebsd.org/GrandCentralDispatch
> 
> bsort() itself doesn't have to create threads to be SMP'ed, but can provide the basic functions for SMP sorting.
> 
> When sorting records, the clever way to do it, is to make a list of pointers to those records, and sort based on that, swapping pointers first, then records in the end.
> 
> Technically you could have 4 CPUs go in a tight spin on different parts of the bsort() algorithm, and by use of existing atomic compare and set pointer operations, you could make a fully deterministic sorting algorithm, even if the different CPUs clobber on the same memory locations.

That sounds like a great way to get a big game of cache line ping pong.

> This is something I want to explore later.
> 
> Neither mergesort() nor qsort() can do that.

You can still parallelise them, there’s just a serial step to merge or
partition respectively.

>> That also seems extremely contradictory to your claim about wanting one
>> that doesn’t allocate memory, but creating threads is going to do much
>> more than just that, and would scream POLA violation to me.
> 
> See my explanation above. TAILQ_INSERT_TAIL() can also be used in multithreading environments, without having to link to pthreads.

Only under a lock. At which point multithreading isn't relevant.

Please revert this all. FreeBSD’s libc isn’t a place for your pet
projects. New public interfaces in something as core as libc should
have good justification, and I still do not see that here.

>>>> I also note that the swap code is pointlessly slow for size > 256 and
>>>> should almost certainly use aliasing with matching words like memcpy
>>>> implementations do.  Doing so would make it easier to port this code to
>>>> CHERI where that is required.
>>> 
>>> I totally agree about the swap code being pointless. And I tried to look where is memswap(), but it was not there. This kind of swapping is done many places, and we could benefit from having a compiler supported memswap() function. What do you think?
>> We agree: https://github.com/CTSRD-CHERI/cheribsd/issues/207. But the
>> world doesn’t have that in C.
> 
> Yes, that's a good idea! I'll try to follow that lead.
> 
>> We don’t need new non-standard sorting routines in libc.
> 
> We don't need English tea either ;-)

I don’t know, nor do I care, what tea has to do with sorting routines.

Jess