-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
O(1) indexing on bytes #443
Comments
Could you please elaborate with type signatures and semantics of suggested functions? |
-- | O(1) returns the char given the index for the first code unit.
-- Panics if the indices are on invalid char boundaries
index :: HasCallStack => Int -> Text -> Char
indexMaybe :: Int -> Text -> Maybe Char
-- | O(1) returns a slice of the text given the start and end code units.
-- Panics if the indices are on invalid char boundaries
slice :: HasCallStack => Int -> Int -> Text -> Text
sliceMaybe :: Int -> Int -> Text -> Maybe Text Detecting if a byte is a char boundary can be done with |
I’m not a fan of partial APIs, to be honest. Isn’t |
Well, we already have a lot of partial functions. Libraries like |
Maybe a compromise is to add this to
I recently added |
How does this break the level of abstraction and why does that matter? We are indexing by code units which make up a |
OK there's a subtlety there but that's beyond my point. Replace what I said with "
If you change the encoding of Maintaining this abstraction means that users don't have to know about the fact that It is an arbitrary and fairly strong restriction, but it at least provides a well-defined threshold to control the growth of the |
@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes? |
I think those functions would be good. I understand now that |
I thought about this for a while and I am reopening because I think this is useful and also I thought of a good compromise. First, this function is useful, especially for compilers, parsers, and text processing algorithms. In a compiler, storing only the start and end utf8 indices can be much more efficient than storing lines and columns or something else. It is also flexible, because utf8 indices can be converted to other positions easily and efficiently. This is also useful for parsers, because only tracking byte level indices and then converting those indices into other positions is much faster. For example, I made a parser combinator library that is 8-10x faster than attoparsec and megaparsec, and it also uses byte level indices.
To solve this, we should name the function |
What's unsafe w.r.t. |
I should have been more careful because they are actually not unsafe. However I think the interface is not ergonomic and the semantics are weird. Slicing is a more common operation than taking and dropping. I can't think of a usecase where I would want taking and dropping over slicing. Silently accepting invalid utf8 indices seems dubious. It is also probably slower than just checking if the index given is valid. Also, the |
I am interested in making a pr for this. We would have to check that the index is on a correct char boundary. Partial and non partial variants would be provided
The text was updated successfully, but these errors were encountered: