Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation improvements #37

Merged
merged 1 commit into from
Dec 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion parser-regex.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ cabal-version: 2.4
name: parser-regex
version: 0.2.0.0
synopsis: Regex based parsers
description: Regex based parsers.
homepage: https://github.com/meooow25/parser-regex
bug-reports: https://github.com/meooow25/parser-regex/issues
license: BSD-3-Clause
Expand All @@ -15,6 +14,15 @@ extra-doc-files:
README.md
CHANGELOG.md

description:
Regex based parsers. See
.
["Regex.Text"]: To work with @Text@ from the @text@ library.
.
["Regex.List"]: To work with @String@s or lists.
.
["Regex.Base"]: To work with other sequences.

tested-with:
GHC == 9.0.2
, GHC == 9.2.8
Expand Down
131 changes: 129 additions & 2 deletions src/Regex/Base.hs
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
-- | This module exports base types and functions. You can use these to define
-- functions to work on arbitrary sequence types. If you want to work with
-- @Text@ or @String@, import and use "Regex.Text" or "Regex.List" instead.
-- functions to work on arbitrary sequence types.
--
-- If you want to work with @Text@ or @String@, import and use "Regex.Text" or
-- "Regex.List" instead.
--
module Regex.Base
(
-- * @RE@ and @Parser@
Expand Down Expand Up @@ -57,6 +60,20 @@ module Regex.Base
, R.liftA2'
, R.foldlMany'
, R.foldlManyMin'

-- * Additional information

-- ** Recursive definitions
-- $recursive-definitions

-- ** Laziness
-- $laziness

-- ** Looping parsers
-- $looping-parsers

-- ** Performance
-- $performance
) where

import qualified Regex.Internal.Regex as R
Expand Down Expand Up @@ -93,3 +110,113 @@ import qualified Regex.Internal.Parser as P
-- /don't use these functions/. Simply use @fmap@, @liftA2@, @foldlMany@ or
-- @foldlManyMin@ instead.
--

-- $recursive-definitions
--
-- It is not possible to define a @RE@ recursively. If it were permitted, it
-- would be capable of parsing more than
-- [regular languages](https://en.wikipedia.org/wiki/Regular_language).
-- Unfortunately, there is no good way\* to make it impossible to write such
-- a regex in the first place. So it must be avoided by the programmer. As an
-- example, avoid this:
--
-- @
-- re :: RE Int [Int]
-- re = liftA2 (:) (single 1) re \<|> [] \<$ single 0 -- diverges!
-- @
--
-- Instead, use appropriate combinators from this module:
--
-- @
-- re = many (single 1) <* single 0
-- @
--
-- For the same reason, be cautious when using combinators from the other
-- packages on @RE@s. Make sure that they do not attempt to construct a
-- recursive @RE@.
--
-- If you find that your regex is impossible to write without recursion,
-- you are attempting to parse a non-regular language! You need a more powerful
-- parser than what this library has to offer.
--
-- \*[Unlifted datatypes](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/primitives.html#unlifted-datatypes)
-- can be used for this but they are too inconvenient to work with.
--

-- $laziness
--
-- Parsing is lazy in the result value, i.e. the @a@ in @RE c a@ or
-- @Parser c a@. In fact, for the algorithm used in this library, this laziness
-- is essential for good runtime complexity. However, there is little reason
-- to be lazy in other aspects, such as the elements of the sequence, @c@, or
-- the functions and regexes used in combinators. Functions are strict in such
-- arguments.
--
-- @
-- -- Lazy in the result
-- parseFoldr foldr (compile (pure ⊥)) [] = Just ⊥
-- parseFoldr foldr (compile (fmap (\\_ -> ⊥) (single 1))) [1] = Just ⊥
--
-- -- Strict in places like
-- single ⊥ = ⊥
-- fmap ⊥ r = ⊥
-- liftA2 f r ⊥ = ⊥
-- @
--

-- $looping-parsers
--
-- What should be the result of parsing an empty sequence with
-- @(many (pure ()))@?
--
-- Since @many r@ parses @r@ as many times as possible, and @pure ()@ succeeds
-- without consuming input, the result should arguably be the infinite list
-- @repeat ()@. Similarly, parsing an empty sequence with
-- @(foldlMany f z (pure ()))@ should diverge. Note that this applies to not
-- just @pure x@, but any regex that can succeed without consuming input, such
-- as @many x@, @manyMin x@, etc.
--
-- This library considers that such an outcome is not desirable in practice. It
-- would be surprising to get an infinite structure from a parser. So, in the
-- case that @many@ succeeds an infinite number of times, this library treats it
-- as succeeding /zero/ times.
--
-- By this rule, @(many (pure ()))@ on an empty sequence parses as @[]@ and
-- @(foldlMany f z (pure ()))@ parses as @z@.
--
-- This behavior makes it impossible to distinguish between zero parses and
-- infinite parses. To address this, an alternate combinator 'Regex.List.manyr'
-- is provided. This parses into a 'Regex.List.Many', a type that clearly
-- indicates if parsing succeeded without consuming input into an infinite list,
-- or if it succeeded a finite number of times.
--

-- $performance
--
-- This section describes some performance characteristics of this library,
-- without requiring a dive into the source code.
--
-- Parsing with a @RE@ is done in two distinct steps.
--
-- 1. A @RE@ is compiled to a @Parser@, which is a
-- [nondeterministic finite automaton](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton)
-- (NFA), in \(O(m)\) time. \(m\) here is the size of the @RE@, which is the
-- number of nodes in its internal tree representation. The resulting @Parser@
-- has \(O(m)\) size.
-- 2. The @Parser@ is run on a sequence in \(O(mn \log m)\) time, where \(n\) is
-- the length of the sequence. This assumes that each @(c -> Maybe a)@ function
-- used to parse individual elements takes \(O(1)\) time.
--
-- /Performance tip/: Use @(\<$)@ over @(\<$>)@, and @(\<*)@\/@(*>)@ over
-- @liftA2@\/@(\<*>)@ when ignoring the result of a @RE@. Knowing the result is
-- ignored allows compiling to a faster parser.
--
-- Memory usage for parsing is \(O(nm)\), but
--
-- * If the result of a @RE@ is ignored using @(\<$)@, @(\<*)@, or @(*>)@, only
-- \(O(m)\) memory is required.
--
-- This applies even as subcomponents. So, any subcomponent @RE@ of a larger
-- @RE@ that is only recognizing a section of the list is cheaper in terms of
-- memory.
--
2 changes: 1 addition & 1 deletion src/Regex/Internal/Regex.hs
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ import qualified Data.Foldable as F
--
-- Note that, because of bias, it is /not true/ that @a \<|> b = b \<|> a@.
--
-- /Performance note/: Prefer the smaller of equivalent regexes, i.e. prefer
-- /Performance tip/: Prefer the smaller of equivalent regexes, i.e. prefer
-- @(a \<|> b) \<*> c@ over @(a \<*> c) \<|> (b \<*> c)@.
--
data RE c a where
Expand Down
54 changes: 34 additions & 20 deletions src/Regex/List.hs
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,18 @@ module Regex.List
, L.replaceAll

-- * Additional information
-- $info

-- ** Recursive definitions
-- $recursive-definitions

-- ** Laziness
-- $laziness

-- ** Looping parsers
-- $looping-parsers

-- ** Performance
-- $performance
) where

import qualified Regex.Internal.Regex as R
Expand All @@ -104,9 +115,7 @@ import qualified Regex.Internal.List as L
-- * "Data.Traversable": @traverse@, @for@, @sequenceA@
--

-- $info
--
-- == Recursive definitions
-- $recursive-definitions
--
-- It is not possible to define a @RE@ recursively. If it were permitted, it
-- would be capable of parsing more than
Expand Down Expand Up @@ -134,16 +143,17 @@ import qualified Regex.Internal.List as L
-- you are attempting to parse a non-regular language! You need a more powerful
-- parser than what this library has to offer.
--
-- \* [Unlifted datatypes](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/primitives.html#unlifted-datatypes)
-- can serve this purpose but they are too inconvenient to work with.
-- \*[Unlifted datatypes](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/primitives.html#unlifted-datatypes)
-- can be used for this but they are too inconvenient to work with.
--
-- == Laziness

-- $laziness
--
-- Parsing is lazy in the result value, i.e. the @a@ in @RE c a@ or
-- @Parser c a@. In fact, for the algorithm used in this library, this laziness
-- is essential for good runtime complexity. However, there is little reason
-- to be lazy in other aspects, such as the values of the sequence, @c@, or the
-- functions and regexes used in combinators. Functions are strict in such
-- to be lazy in other aspects, such as the elements of the sequence, @c@, or
-- the functions and regexes used in combinators. Functions are strict in such
-- arguments.
--
-- @
Expand All @@ -157,7 +167,8 @@ import qualified Regex.Internal.List as L
-- liftA2 f r ⊥ = ⊥
-- @
--
-- == Looping parsers

-- $looping-parsers
--
-- What should be the result of @reParse (many (pure ())) ""@?
--
Expand All @@ -168,7 +179,7 @@ import qualified Regex.Internal.List as L
-- can succeed without consuming input, such as @many x@, @manyMin x@, etc.
--
-- This library considers that such an outcome is not desirable in practice. It
-- would be surprising to get an infinite structure from your parser. So, in the
-- would be surprising to get an infinite structure from a parser. So, in the
-- case that @many@ succeeds an infinite number of times, this library treats it
-- as succeeding /zero/ times.
--
Expand All @@ -181,25 +192,28 @@ import qualified Regex.Internal.List as L
-- indicates if parsing succeeded without consuming input into an infinite list,
-- or if it succeeded a finite number of times.
--
-- == Performance

-- $performance
--
-- This section may be useful for someone looking to understand the performance
-- of this library without diving into the source code.
-- This section describes some performance characteristics of this library,
-- without requiring a dive into the source code.
--
-- Parsing with a @RE@ is done in two distinct steps.
--
-- 1. A @RE@ is compiled to a @Parser@ in \(O(m)\) time, where \(m\) is the size
-- of the @RE@. This is a
-- 1. A @RE@ is compiled to a @Parser@, which is a
-- [nondeterministic finite automaton](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton)
-- (NFA).
-- (NFA), in \(O(m)\) time. \(m\) here is the size of the @RE@, which is the
-- number of nodes in its internal tree representation. The resulting @Parser@
-- has \(O(m)\) size.
-- 2. The @Parser@ is run on a list in \(O(mn \log m)\) time, where \(n\) is
-- the length of the list. Assumes every @Char@ is parsed in \(O(1)\).
-- the length of the list. This assumes that each @(c -> Maybe a)@ function
-- used to parse individual elements takes \(O(1)\) time.
--
-- /Performance note/: Use @(\<$)@ over @(\<$>)@, and @(\<*)@\/@(*>)@ over
-- /Performance tip/: Use @(\<$)@ over @(\<$>)@, and @(\<*)@\/@(*>)@ over
-- @liftA2@\/@(\<*>)@ when ignoring the result of a @RE@. Knowing the result is
-- ignored allows compiling to a faster parser.
--
-- Memory usage for parsing is \(O(nm)\).
-- Memory usage for parsing is \(O(nm)\), but
--
-- * If the result of a @RE@ is ignored using @(\<$)@, @(\<*)@, or @(*>)@, only
-- \(O(m)\) memory is required.
Expand Down
Loading