From 72100e87770898626e09964794357bbdcd951944 Mon Sep 17 00:00:00 2001 From: Soumik Sarkar Date: Sun, 5 May 2024 13:19:06 +0530 Subject: [PATCH] Add HOWTO.md (#10) Add HOWTO and update READMEs --- HOWTO.md | 299 ++++++++++++++++++++++++++++++++++++++++++++++ README.md | 46 ++++--- compare/README.md | 18 +-- samsort.cabal | 3 +- 4 files changed, 332 insertions(+), 34 deletions(-) create mode 100644 HOWTO.md diff --git a/HOWTO.md b/HOWTO.md new file mode 100644 index 0000000..7c84b01 --- /dev/null +++ b/HOWTO.md @@ -0,0 +1,299 @@ +# Using this library + +This is a short guide to demonstrate how this library may be used. Familiarity +with some form of arrays and mutable arrays in Haskell is expected. If your +array type is not present here, it should still be possible to adapt some of the +code below to your use case. + +## Sorting boxed elements + +This library offers the function + +```hs +sortArrayBy + :: (a -> a -> Ordering) -- ^ comparison + -> MutableArray# s a + -> Int -- ^ offset + -> Int -- ^ length + -> ST s () +``` + +So how does one use this? + +* The first parameter is a comparison function that will be used to order the + elements. +* The second parameter is a [`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-) + with elements of type `a`. This is a primitive array type provided by GHC. + This array will be sorted in place. +* The third and fourth parameters are `Int`s which demarcate a slice of the + array. Elements in this slice will be sorted, and other elements will not be + touched. +* The return type is an `ST` action. If you are not familiar with `ST`, please + see the documentation for [`Control.Monad.ST`](https://hackage.haskell.org/package/base-4.19.0.0/docs/Control-Monad-ST.html). + +Clearly, to use `sortArrayBy`, an important step is to put the elements to be +sorted into a `MutableArray#`. The most convenient way to do this depends on how +the elements are stored prior to sorting. + +### Example 1: [`MVector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Mutable.html#t:MVector) + +Consider that we need to sort a mutable vector `MVector` from the `vector` +library. This is quite easy, and in fact we do not need to put elements anywhere +because the underlying representation of an `MVector` is a `MutableArray#`! We +only need to get it out of the `MVector`. + +```hs +import Control.Monad.Primitive (PrimMonad(..), stToPrim) -- from the package "primitive" +import Data.Primitive.Array (MutableArray(..)) -- also from "primitive" +import Data.Vector.Mutable (MVector(..)) + +import qualified Data.SamSort as Sam + +-- | Sort a mutable vector in place. +sortMV :: (PrimMonad m, Ord a) => MVector (PrimState m) a -> m () +sortMV = sortMVBy compare + +-- | Sort a mutable vector in place using a comparison function. +sortMVBy :: PrimMonad m => (a -> a -> Ordering) -> MVector (PrimState m) a -> m () +sortMVBy cmp (MVector off len (MutableArray ma)) = + stToPrim $ Sam.sortArrayBy cmp ma off len +``` + +### Example 2: [`Vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector.html#t:Vector) + +Now consider sorting an (immutable) `Vector`, again from the `vector` library. +Since we cannot mutate it, we will return a sorted copy. The most convenient way +here is to thaw to a `MVector` and sort it as we did above. + +```hs +import Data.Vector (Vector) +import qualified Data.Vector as V + +-- | Sort a vector. +sortV :: Ord a => Vector a -> Vector a +sortV = sortVBy compare + +-- | Sort a vector using a comparison function. +sortVBy :: (a -> a -> Ordering) -> Vector a -> Vector a +sortVBy cmp v = V.create $ do + mv <- V.thaw v + sortMVBy cmp mv -- from Example 1 above + pure mv +``` + +We can test it out in GHCI. + +```hs +>>> sortV (V.fromList [5,2,6,3,4,1]) +[1,2,3,4,5,6] +>>> import Data.Ord (comparing) +>>> sortVBy (comparing length) (V.fromList ["Lunar","11.3","Candle","Magic"]) +["11.3","Lunar","Magic","Candle"] +``` + +### Example 3: List + +Let us now try to sort a list, like [`Data.List.sort`](https://hackage.haskell.org/package/base-4.19.0.0/docs/Data-List.html#v:sort) +does. We will need to move the elements from the list into a `MutableArray#`. + +I recommend using the [`primitive`](https://hackage.haskell.org/package/primitive-0.9.0.0/docs/Data-Primitive-Array.html) +library for this task. `primitive` provides boxed wrappers over GHC primitive +types and functions to work with them. While it is possible to do this without +any library, it is easiest to use what is already available. If you are unable +to use `primitive`, you can take a peek at the relevant definitions there and +use them directly. + +```hs +import Control.Monad.Primitive (stToPrim) +import qualified Data.Foldable as F +import Data.Primitive.Array (MutableArray(..)) +import qualified Data.Primitive.Array as A + +import qualified Data.SamSort as Sam + +-- | Sort a list. +sortL :: Ord a => [a] -> [a] +sortL = sortLBy compare + +-- | Sort a list using a comparison function. +sortLBy :: (a -> a -> Ordering) -> [a] -> [a] +sortLBy cmp xs = F.toList $ A.runArray $ do + let a = A.arrayFromList xs + n = A.sizeofArray a + ma@(MutableArray ma') <- A.thawArray a 0 n + stToPrim $ Sam.sortArrayBy cmp ma' 0 n + pure ma +``` + +In GHCI, +```hs +>>> sortL ["Fall","In","The","Dark"] +["Dark","Fall","In","The"] +>>> import Data.Ord (Down, comparing) +>>> sortLBy (comparing Down) [3.4,8.5,9.1,7.9,3.1,6.2] +[9.1,8.5,7.9,6.2,3.4,3.1] +``` + +> [!TIP] +> +> Avoid `Data.List`'s `sort` and `sortBy` when a large number of elements need +> to be fully sorted and performance is a concern. Sorting lists is quite +> inefficient. Put the elements in a mutable array and use this (or some other) +> sorting library instead. + +## Sorting `Int`s + +Converting to a `MutableArray#` and sorting, as explained in the above section, +should cover the majority of use cases. However, sometimes it is not the +best option. For instance, we may be storing `Int`s in an unboxed array for +efficiency. Having to pull them out and box them for sorting does not sound +good. + +The second function provided by this library is + +```hs +sortIntArrayBy + :: (Int -> Int -> Ordering) -- ^ comparison + -> MutableByteArray# s + -> Int -- ^ offset in Int#s + -> Int -- ^ length in Int#s + -> ST s () +``` + +As you might have guessed, this sorts an unboxed array of `Int`s. We can use +this whenever we need to sort `Int`s, or even other types that may be cheaply +converted to and from `Int`s (like `Word`). + +### Example 1: [Unboxed `MVector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Unboxed-Mutable.html#t:MVector) + +Let us sort a mutable unboxed `MVector Int`. Like with the boxed `MVector`, +we do not need to move the elements because the underlying representation is a +`MutableByteArray#`. + +```hs +import Control.Monad.Primitive (PrimMonad(..), stToPrim) +import Data.Primitive.ByteArray (MutableByteArray(..)) +import qualified Data.Vector.Primitive as VP +import qualified Data.Vector.Unboxed.Mutable as VUM +import qualified Data.Vector.Unboxed.Base as VUB + +import qualified Data.SamSort as Sam + +sortVUMInt :: PrimMonad m => VUM.MVector (PrimState m) Int -> m () +sortVUMInt = sortVUMIntBy compare + +sortVUMIntBy + :: PrimMonad m + => (Int -> Int -> Ordering) -> VUM.MVector (PrimState m) Int -> m () +sortVUMIntBy cmp mv = case mv of + VUB.MV_Int (VP.MVector off len (MutableByteArray ma')) -> + stToPrim $ Sam.sortIntArrayBy cmp ma' off len +``` + +> [!WARNING] +> +> Do not try changing the element type above to sort any other unboxed vector. +> `MutableByteArray#` is the underlying representation for many unboxed vectors, +> but it would be incorrect to use the above code if the element type is not +> `Int`. + +## Sorting by index + +We have now covered sorting boxed values, and sorting `Int`s. What about other +types in unboxed arrays? + +### Example 1: [Unboxed `Vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Unboxed.html#t:Vector) + +Consider that we need to sort an unboxed vector of some type `a`. The `vector` +library is designed in a way that the underlying representation of an unboxed +vector can be anything depending on the type `a`. We cannot assume anything +about it. + +We know that we can index such a vector efficiently. We also know that we can +construct such vectors from an `Int -> a` using the handy `generate` function. We +will use these facts to sort such a vector. + +First we will create an `Int` vector, the elements of which will be indices into +the `a` vector. Then we will sort this `Int` vector using a comparison function +that indexes the `a` vector and compares `a`s. Finally, we will construct a +vector with `a`s in the order of the sorted indices. + +This technique is general enough that we can sort any flavor of `Vector` +(boxed, `Unboxed`, `Prim`, `Storable`), so let us use `Vector.Generic` to +define the functions. + +```hs +import Control.Monad.Primitive (stToPrim) +import Data.Primitive.ByteArray (MutableByteArray(..)) +import qualified Data.Vector.Generic as VG +import qualified Data.Vector.Primitive as VP +import qualified Data.Vector.Primitive.Mutable as VPM + +import qualified Data.SamSort as Sam + +-- | Sort a vector. +sortByIdxVG :: (Ord a, VG.Vector v a) => v a -> v a +sortByIdxVG = sortByIdxVGBy compare + +-- | Sort a vector using a comparison function. +sortByIdxVGBy :: VG.Vector v a => (a -> a -> Ordering) -> v a -> v a +sortByIdxVGBy cmp v = VG.generate n (VG.unsafeIndex v . VP.unsafeIndex ixa) + where + n = VG.length v + cmp' i j = cmp (VG.unsafeIndex v i) (VG.unsafeIndex v j) + ixa = VP.create $ do + ixma <- VPM.generate n id + case ixma of + VPM.MVector off len (MutableByteArray ma') -> + stToPrim $ Sam.sortIntArrayBy cmp' ma' off len + pure ixma +``` + +In fact, this technique is general enough to be used whenever indices can +be sorted, using any method, and not just with this library! + +Sorting by index is more beneficial the larger the elements are in memory, +since moving around index `Int`s is cheaper than moving around the elements +themselves. + +We can see that the sort works as expected in GHCI. + +```hs +>>> import Data.Ord (comparing) +>>> import qualified Data.Vector.Unboxed as VU +>>> let v = VU.fromList [(6,4),(5,4),(1,2)] :: VU.Vector (Int,Int) +>>> sortByIdxVGBy (comparing snd) v +[(1,2),(6,4),(5,4)] +``` + +And we can see [in benchmarks](https://github.com/meooow25/samsort/tree/master/compare#4-sort-105-int-int-ints-unboxed) +that sorting by index is indeed more efficient than sorting directly, for +elements of type `(Int, Int, Int)` and sort implementations which support both. + +## Sorting unboxed arrays of small elements + +So sorting by index is more beneficial the larger the element is, but what +about small elements? Perhaps we need to sort an unboxed array of `Word8`s, or +`Float`s? + +Our options as seen above are + +* Convert to a boxed array and sort. Lots of avoidable allocations and slow + comparisons. +* Sort by index. Better, but has avoidable allocations in the form of the + index array. + +Neither are ideal. The most efficient way to sort small elements is to sort the +array of such elements directly. Unfortunately, this library cannot be used to +do this because there are only two functions, one to sort boxed values, and one +to sort `Int`s. If we must use this library, sorting by index is the method of +choice. It is not ideal, but it will not be slow either. + +The [`primitive-sort`](https://hackage.haskell.org/package/primitive-sort) +library may also be a good choice for this task. It can sort such small elements +efficiently, though it has some drawbacks (not adaptive, cannot sort a slice, +cannot sort using a comparison function, more dependencies). + +[`vector-algorithms`](https://hackage.haskell.org/package/vector-algorithms) +is also able to sort small elements, however it turns out to be +[slower in practice](https://github.com/meooow25/samsort/tree/master/compare#5-sort-105-word8s-unboxed). diff --git a/README.md b/README.md index e118b1d..309f960 100644 --- a/README.md +++ b/README.md @@ -6,20 +6,19 @@ A stable adapative mergesort implementation ## Features -This is a lightweight library offering a high performance primitive sort -function. The function sorts a GHC -[`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-) -in place. +This is a lightweight library offering two high performance sort functions: + +* `sortArrayBy` sorts a GHC [`MutableArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableArray-35-) + of boxed elements in place. +* `sortIntArrayBy` sorts a GHC [`MutableByteArray#`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Exts.html#t:MutableByteArray-35-) + of `Int#`s in place. There are no dependencies outside of `base`. This means that this library is not tied to array abstractions from any particular library. This also means -that you may need to write a wrapper function that sorts your flavor of Haskell -array, such as ones from -[`primitive`](https://hackage.haskell.org/package/primitive-0.9.0.0/docs/Data-Primitive-Array.html#t:MutableArray), -[`vector`](https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Mutable.html#t:MVector), -[`array`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-Arr.html#t:STArray), -or elsewhere. You can find an example with `primitive` -[here](https://github.com/meooow25/samsort/blob/82b7b9c84919a6d44484df9375a63d26c0520716/compare/Main.hs#L61-L64). +that you may need to write a few lines of code to get a `MutableArray#` or +`MutableByteArray#` from your data, which can then be sorted. See +[`HOWTO.md`](https://github.com/meooow25/samsort/blob/master/HOWTO.md) +for a guide. If you need to use this library in an environment where you cannot depend on other packages, you may simply copy the lone source file @@ -34,21 +33,20 @@ to your project. * The sort is adaptive, i.e. the sort identifies and uses ascending and descending runs of elements occuring in the input to perform less work. As a result, the sort is $O(n)$ for already sorted inputs. -* The performance is comparable to and in many cases better than the comparison - sorts from the [vector-algorithms](https://hackage.haskell.org/package/vector-algorithms) - library. See [the benchmarks](https://github.com/meooow25/samsort/tree/master/compare) +* The sort is the fastest among implementations from other libraries in most + scenarios. [See the benchmarks](https://github.com/meooow25/samsort/tree/master/compare) for details. -## FAQ - -#### Why not use \? +## Known issues -I'm open to changing the implemention if an alternative is demonstrated to -perform better, as long as the sort remains stable and adaptive. +Ideally, this library would offer only an algorithm, capable of sorting arrays +of any flavor. To support different arrays we would need to rely on some +abstraction, either from another library (like `vector`), or created here. We +cannot do either of those while also keeping the library as lightweight as it +is now. -#### How do I sort an unboxed array with this library? +## Contributing -You can't. To sort different types of arrays, I would have to rely on an -existing library's abstractions (like `vector-algorithms` relies on `vector`), -or roll my own. This goes against the goal of keeping the library lightweight. I -do not have a solution to this problem at the moment. +Questions, bug reports, documentation improvements, code contributions welcome! +Please [open an issue](https://github.com/meooow25/samsort/issues) as the first +step. Slow performance counts as a bug! diff --git a/compare/README.md b/compare/README.md index 3368b51..fa8d5fe 100644 --- a/compare/README.md +++ b/compare/README.md @@ -6,15 +6,15 @@ is at [`result/result.csv`](result/result.csv). ## Comparing with other libraries -| Label | Library | Function | -| --- | --- | --- | -| `ssArray` | `samsort` | `Data.SamSort.sortArrayBy` | -| `ssIntArray` | `samsort` | `Data.SamSort.sortIntArrayBy` | -| `vaHeap` | `vector-algorithms` | `Data.Vector.Algorithms.Heap.sortBy` | -| `vaIntro` | `vector-algorithms` | `Data.Vector.Algorithms.Intro.sortBy` | -| `vaMerge` | `vector-algorithms` | `Data.Vector.Algorithms.Merge.sortBy` | -| `vaTip` | `vector-algorithms` | `Data.Vector.Algorithms.Tim.sortBy` | -| `ps` | `primitive-sort` | `Data.Primitive.Sort.sortMutableBy` | +| Library | Function | Stable | Adaptive | Label | +| --- | --- | --- | --- | --- | +| `samsort` | `Data.SamSort.sortArrayBy` | Yes | Yes | `ssArray` | +| `samsort` | `Data.SamSort.sortIntArrayBy` | Yes | Yes | `ssIntArray` | +| `vector-algorithms` | `Data.Vector.Algorithms.Heap.sortBy` | No | No | `vaHeap` | +| `vector-algorithms` | `Data.Vector.Algorithms.Intro.sortBy` | No | No | `vaIntro` | +| `vector-algorithms` | `Data.Vector.Algorithms.Merge.sortBy` | Yes | No | `vaMerge` | +| `vector-algorithms` | `Data.Vector.Algorithms.Tim.sortBy` | Yes | Yes | `vaTim` | +| `primitive-sort` | `Data.Primitive.Sort.sortMutableBy` | Yes | No | `ps` | An `-i` suffix indicates that the sort was done by index. diff --git a/samsort.cabal b/samsort.cabal index abd6c3a..cdae745 100644 --- a/samsort.cabal +++ b/samsort.cabal @@ -14,8 +14,9 @@ category: Data build-type: Simple extra-doc-files: - README.md CHANGELOG.md + HOWTO.md + README.md tested-with: GHC == 8.4.4