Skip to content

Commit

Permalink
Add HOWTO.md (#10)
Browse files Browse the repository at this point in the history
Add HOWTO and update READMEs
  • Loading branch information
meooow25 authored May 5, 2024
1 parent 0af3d12 commit 72100e8
Show file tree
Hide file tree
Showing 4 changed files with 332 additions and 34 deletions.
299 changes: 299 additions & 0 deletions HOWTO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# Using this library

This is a short guide to demonstrate how this library may be used. Familiarity
with some form of arrays and mutable arrays in Haskell is expected. If your
array type is not present here, it should still be possible to adapt some of the
code below to your use case.

## Sorting boxed elements

This library offers the function

```hs
sortArrayBy
:: (a -> a -> Ordering) -- ^ comparison
-> MutableArray# s a
-> Int -- ^ offset
-> Int -- ^ length
-> ST s ()
```

So how does one use this?

* The first parameter is a comparison function that will be used to order the
elements.
* The second parameter is a [`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-)
with elements of type `a`. This is a primitive array type provided by GHC.
This array will be sorted in place.
* The third and fourth parameters are `Int`s which demarcate a slice of the
array. Elements in this slice will be sorted, and other elements will not be
touched.
* The return type is an `ST` action. If you are not familiar with `ST`, please
see the documentation for [`Control.Monad.ST`](https://hackage.haskell.org/package/base-4.19.0.0/docs/Control-Monad-ST.html).

Clearly, to use `sortArrayBy`, an important step is to put the elements to be
sorted into a `MutableArray#`. The most convenient way to do this depends on how
the elements are stored prior to sorting.

### Example 1: [`MVector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Mutable.html#t:MVector)

Consider that we need to sort a mutable vector `MVector` from the `vector`
library. This is quite easy, and in fact we do not need to put elements anywhere
because the underlying representation of an `MVector` is a `MutableArray#`! We
only need to get it out of the `MVector`.

```hs
import Control.Monad.Primitive (PrimMonad(..), stToPrim) -- from the package "primitive"
import Data.Primitive.Array (MutableArray(..)) -- also from "primitive"
import Data.Vector.Mutable (MVector(..))

import qualified Data.SamSort as Sam

-- | Sort a mutable vector in place.
sortMV :: (PrimMonad m, Ord a) => MVector (PrimState m) a -> m ()
sortMV = sortMVBy compare

-- | Sort a mutable vector in place using a comparison function.
sortMVBy :: PrimMonad m => (a -> a -> Ordering) -> MVector (PrimState m) a -> m ()
sortMVBy cmp (MVector off len (MutableArray ma)) =
stToPrim $ Sam.sortArrayBy cmp ma off len
```

### Example 2: [`Vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector.html#t:Vector)

Now consider sorting an (immutable) `Vector`, again from the `vector` library.
Since we cannot mutate it, we will return a sorted copy. The most convenient way
here is to thaw to a `MVector` and sort it as we did above.

```hs
import Data.Vector (Vector)
import qualified Data.Vector as V

-- | Sort a vector.
sortV :: Ord a => Vector a -> Vector a
sortV = sortVBy compare

-- | Sort a vector using a comparison function.
sortVBy :: (a -> a -> Ordering) -> Vector a -> Vector a
sortVBy cmp v = V.create $ do
mv <- V.thaw v
sortMVBy cmp mv -- from Example 1 above
pure mv
```

We can test it out in GHCI.

```hs
>>> sortV (V.fromList [5,2,6,3,4,1])
[1,2,3,4,5,6]
>>> import Data.Ord (comparing)
>>> sortVBy (comparing length) (V.fromList ["Lunar","11.3","Candle","Magic"])
["11.3","Lunar","Magic","Candle"]
```

### Example 3: List

Let us now try to sort a list, like [`Data.List.sort`](https://hackage.haskell.org/package/base-4.19.0.0/docs/Data-List.html#v:sort)
does. We will need to move the elements from the list into a `MutableArray#`.

I recommend using the [`primitive`](https://hackage.haskell.org/package/primitive-0.9.0.0/docs/Data-Primitive-Array.html)
library for this task. `primitive` provides boxed wrappers over GHC primitive
types and functions to work with them. While it is possible to do this without
any library, it is easiest to use what is already available. If you are unable
to use `primitive`, you can take a peek at the relevant definitions there and
use them directly.

```hs
import Control.Monad.Primitive (stToPrim)
import qualified Data.Foldable as F
import Data.Primitive.Array (MutableArray(..))
import qualified Data.Primitive.Array as A

import qualified Data.SamSort as Sam

-- | Sort a list.
sortL :: Ord a => [a] -> [a]
sortL = sortLBy compare

-- | Sort a list using a comparison function.
sortLBy :: (a -> a -> Ordering) -> [a] -> [a]
sortLBy cmp xs = F.toList $ A.runArray $ do
let a = A.arrayFromList xs
n = A.sizeofArray a
ma@(MutableArray ma') <- A.thawArray a 0 n
stToPrim $ Sam.sortArrayBy cmp ma' 0 n
pure ma
```

In GHCI,
```hs
>>> sortL ["Fall","In","The","Dark"]
["Dark","Fall","In","The"]
>>> import Data.Ord (Down, comparing)
>>> sortLBy (comparing Down) [3.4,8.5,9.1,7.9,3.1,6.2]
[9.1,8.5,7.9,6.2,3.4,3.1]
```

> [!TIP]
>
> Avoid `Data.List`'s `sort` and `sortBy` when a large number of elements need
> to be fully sorted and performance is a concern. Sorting lists is quite
> inefficient. Put the elements in a mutable array and use this (or some other)
> sorting library instead.
## Sorting `Int`s

Converting to a `MutableArray#` and sorting, as explained in the above section,
should cover the majority of use cases. However, sometimes it is not the
best option. For instance, we may be storing `Int`s in an unboxed array for
efficiency. Having to pull them out and box them for sorting does not sound
good.

The second function provided by this library is

```hs
sortIntArrayBy
:: (Int -> Int -> Ordering) -- ^ comparison
-> MutableByteArray# s
-> Int -- ^ offset in Int#s
-> Int -- ^ length in Int#s
-> ST s ()
```

As you might have guessed, this sorts an unboxed array of `Int`s. We can use
this whenever we need to sort `Int`s, or even other types that may be cheaply
converted to and from `Int`s (like `Word`).

### Example 1: [Unboxed `MVector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Unboxed-Mutable.html#t:MVector)

Let us sort a mutable unboxed `MVector Int`. Like with the boxed `MVector`,
we do not need to move the elements because the underlying representation is a
`MutableByteArray#`.

```hs
import Control.Monad.Primitive (PrimMonad(..), stToPrim)
import Data.Primitive.ByteArray (MutableByteArray(..))
import qualified Data.Vector.Primitive as VP
import qualified Data.Vector.Unboxed.Mutable as VUM
import qualified Data.Vector.Unboxed.Base as VUB

import qualified Data.SamSort as Sam

sortVUMInt :: PrimMonad m => VUM.MVector (PrimState m) Int -> m ()
sortVUMInt = sortVUMIntBy compare

sortVUMIntBy
:: PrimMonad m
=> (Int -> Int -> Ordering) -> VUM.MVector (PrimState m) Int -> m ()
sortVUMIntBy cmp mv = case mv of
VUB.MV_Int (VP.MVector off len (MutableByteArray ma')) ->
stToPrim $ Sam.sortIntArrayBy cmp ma' off len
```

> [!WARNING]
>
> Do not try changing the element type above to sort any other unboxed vector.
> `MutableByteArray#` is the underlying representation for many unboxed vectors,
> but it would be incorrect to use the above code if the element type is not
> `Int`.
## Sorting by index

We have now covered sorting boxed values, and sorting `Int`s. What about other
types in unboxed arrays?

### Example 1: [Unboxed `Vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Unboxed.html#t:Vector)

Consider that we need to sort an unboxed vector of some type `a`. The `vector`
library is designed in a way that the underlying representation of an unboxed
vector can be anything depending on the type `a`. We cannot assume anything
about it.

We know that we can index such a vector efficiently. We also know that we can
construct such vectors from an `Int -> a` using the handy `generate` function. We
will use these facts to sort such a vector.

First we will create an `Int` vector, the elements of which will be indices into
the `a` vector. Then we will sort this `Int` vector using a comparison function
that indexes the `a` vector and compares `a`s. Finally, we will construct a
vector with `a`s in the order of the sorted indices.

This technique is general enough that we can sort any flavor of `Vector`
(boxed, `Unboxed`, `Prim`, `Storable`), so let us use `Vector.Generic` to
define the functions.

```hs
import Control.Monad.Primitive (stToPrim)
import Data.Primitive.ByteArray (MutableByteArray(..))
import qualified Data.Vector.Generic as VG
import qualified Data.Vector.Primitive as VP
import qualified Data.Vector.Primitive.Mutable as VPM

import qualified Data.SamSort as Sam

-- | Sort a vector.
sortByIdxVG :: (Ord a, VG.Vector v a) => v a -> v a
sortByIdxVG = sortByIdxVGBy compare

-- | Sort a vector using a comparison function.
sortByIdxVGBy :: VG.Vector v a => (a -> a -> Ordering) -> v a -> v a
sortByIdxVGBy cmp v = VG.generate n (VG.unsafeIndex v . VP.unsafeIndex ixa)
where
n = VG.length v
cmp' i j = cmp (VG.unsafeIndex v i) (VG.unsafeIndex v j)
ixa = VP.create $ do
ixma <- VPM.generate n id
case ixma of
VPM.MVector off len (MutableByteArray ma') ->
stToPrim $ Sam.sortIntArrayBy cmp' ma' off len
pure ixma
```

In fact, this technique is general enough to be used whenever indices can
be sorted, using any method, and not just with this library!

Sorting by index is more beneficial the larger the elements are in memory,
since moving around index `Int`s is cheaper than moving around the elements
themselves.

We can see that the sort works as expected in GHCI.

```hs
>>> import Data.Ord (comparing)
>>> import qualified Data.Vector.Unboxed as VU
>>> let v = VU.fromList [(6,4),(5,4),(1,2)] :: VU.Vector (Int,Int)
>>> sortByIdxVGBy (comparing snd) v
[(1,2),(6,4),(5,4)]
```

And we can see [in benchmarks](https://github.com/meooow25/samsort/tree/master/compare#4-sort-105-int-int-ints-unboxed)
that sorting by index is indeed more efficient than sorting directly, for
elements of type `(Int, Int, Int)` and sort implementations which support both.

## Sorting unboxed arrays of small elements

So sorting by index is more beneficial the larger the element is, but what
about small elements? Perhaps we need to sort an unboxed array of `Word8`s, or
`Float`s?

Our options as seen above are

* Convert to a boxed array and sort. Lots of avoidable allocations and slow
comparisons.
* Sort by index. Better, but has avoidable allocations in the form of the
index array.

Neither are ideal. The most efficient way to sort small elements is to sort the
array of such elements directly. Unfortunately, this library cannot be used to
do this because there are only two functions, one to sort boxed values, and one
to sort `Int`s. If we must use this library, sorting by index is the method of
choice. It is not ideal, but it will not be slow either.

The [`primitive-sort`](https://hackage.haskell.org/package/primitive-sort)
library may also be a good choice for this task. It can sort such small elements
efficiently, though it has some drawbacks (not adaptive, cannot sort a slice,
cannot sort using a comparison function, more dependencies).

[`vector-algorithms`](https://hackage.haskell.org/package/vector-algorithms)
is also able to sort small elements, however it turns out to be
[slower in practice](https://github.com/meooow25/samsort/tree/master/compare#5-sort-105-word8s-unboxed).
46 changes: 22 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,19 @@ A stable adapative mergesort implementation

## Features

This is a lightweight library offering a high performance primitive sort
function. The function sorts a GHC
[`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-)
in place.
This is a lightweight library offering two high performance sort functions:

* `sortArrayBy` sorts a GHC [`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-)
of boxed elements in place.
* `sortIntArrayBy` sorts a GHC [`MutableByteArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableByteArray-35-)
of `Int#`s in place.

There are no dependencies outside of `base`. This means that this library is
not tied to array abstractions from any particular library. This also means
that you may need to write a wrapper function that sorts your flavor of Haskell
array, such as ones from
[`primitive`](https://hackage.haskell.org/package/primitive-0.9.0.0/docs/Data-Primitive-Array.html#t:MutableArray),
[`vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Mutable.html#t:MVector),
[`array`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Arr.html#t:STArray),
or elsewhere. You can find an example with `primitive`
[here](https://github.com/meooow25/samsort/blob/82b7b9c84919a6d44484df9375a63d26c0520716/compare/Main.hs#L61-L64).
that you may need to write a few lines of code to get a `MutableArray#` or
`MutableByteArray#` from your data, which can then be sorted. See
[`HOWTO.md`](https://github.com/meooow25/samsort/blob/master/HOWTO.md)
for a guide.

If you need to use this library in an environment where you cannot depend on
other packages, you may simply copy the lone source file
Expand All @@ -34,21 +33,20 @@ to your project.
* The sort is adaptive, i.e. the sort identifies and uses ascending and
descending runs of elements occuring in the input to perform less work. As a
result, the sort is $O(n)$ for already sorted inputs.
* The performance is comparable to and in many cases better than the comparison
sorts from the [vector-algorithms](https://hackage.haskell.org/package/vector-algorithms)
library. See [the benchmarks](https://github.com/meooow25/samsort/tree/master/compare)
* The sort is the fastest among implementations from other libraries in most
scenarios. [See the benchmarks](https://github.com/meooow25/samsort/tree/master/compare)
for details.

## FAQ

#### Why not use \<insert strategy\>?
## Known issues

I'm open to changing the implemention if an alternative is demonstrated to
perform better, as long as the sort remains stable and adaptive.
Ideally, this library would offer only an algorithm, capable of sorting arrays
of any flavor. To support different arrays we would need to rely on some
abstraction, either from another library (like `vector`), or created here. We
cannot do either of those while also keeping the library as lightweight as it
is now.

#### How do I sort an unboxed array with this library?
## Contributing

You can't. To sort different types of arrays, I would have to rely on an
existing library's abstractions (like `vector-algorithms` relies on `vector`),
or roll my own. This goes against the goal of keeping the library lightweight. I
do not have a solution to this problem at the moment.
Questions, bug reports, documentation improvements, code contributions welcome!
Please [open an issue](https://github.com/meooow25/samsort/issues) as the first
step. Slow performance counts as a bug!
18 changes: 9 additions & 9 deletions compare/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ is at [`result/result.csv`](result/result.csv).

## Comparing with other libraries

| Label | Library | Function |
| --- | --- | --- |
| `ssArray` | `samsort` | `Data.SamSort.sortArrayBy` |
| `ssIntArray` | `samsort` | `Data.SamSort.sortIntArrayBy` |
| `vaHeap` | `vector-algorithms` | `Data.Vector.Algorithms.Heap.sortBy` |
| `vaIntro` | `vector-algorithms` | `Data.Vector.Algorithms.Intro.sortBy` |
| `vaMerge` | `vector-algorithms` | `Data.Vector.Algorithms.Merge.sortBy` |
| `vaTip` | `vector-algorithms` | `Data.Vector.Algorithms.Tim.sortBy` |
| `ps` | `primitive-sort` | `Data.Primitive.Sort.sortMutableBy` |
| Library | Function | Stable | Adaptive | Label |
| --- | --- | --- | --- | --- |
| `samsort` | `Data.SamSort.sortArrayBy` | Yes | Yes | `ssArray` |
| `samsort` | `Data.SamSort.sortIntArrayBy` | Yes | Yes | `ssIntArray` |
| `vector-algorithms` | `Data.Vector.Algorithms.Heap.sortBy` | No | No | `vaHeap` |
| `vector-algorithms` | `Data.Vector.Algorithms.Intro.sortBy` | No | No | `vaIntro` |
| `vector-algorithms` | `Data.Vector.Algorithms.Merge.sortBy` | Yes | No | `vaMerge` |
| `vector-algorithms` | `Data.Vector.Algorithms.Tim.sortBy` | Yes | Yes | `vaTim` |
| `primitive-sort` | `Data.Primitive.Sort.sortMutableBy` | Yes | No | `ps` |

An `-i` suffix indicates that the sort was done by index.

Expand Down
Loading

0 comments on commit 72100e8

Please sign in to comment.