Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming of xlsx file support #144

Merged
merged 165 commits into from
Dec 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
165 commits
Select commit Hold shift + click to select a range
93a5701
Change type alias for CellMap
jappeace Jan 27, 2021
0e241be
Try seeing if we can list everything in the test data
jappeace Jan 27, 2021
d705488
Loop trough the entire thing
jappeace Jan 28, 2021
d050598
Split of go into seperate function
jappeace Jan 28, 2021
5b42e5b
Don't print the bytestring
jappeace Jan 28, 2021
2a50042
I'm finding out it aint ordered
jappeace Jan 29, 2021
d6a5f50
Go a long way towards completion
jappeace Jan 29, 2021
ce7cff8
Use coerce for free speed
jappeace Feb 10, 2021
fa8df15
Do some rundimentary parsing
jappeace Feb 10, 2021
bc66011
Add styling the binary to see what's going on
jappeace Feb 15, 2021
d596b31
Filter out the formula
jappeace Feb 24, 2021
8404ab7
Add string table parsing, cleanup warnings
jappeace Feb 24, 2021
b88381b
THis only get's the last string (lol)
jappeace Feb 24, 2021
36dca07
Fix shared string bug
jappeace Mar 4, 2021
bdc3779
Filter empty rows
jappeace Mar 4, 2021
b5ec1e4
Try allow user to use sideffect to lookup string
jappeace Mar 5, 2021
ace6031
Revert "Try allow user to use sideffect to lookup string"
jappeace Mar 5, 2021
0455509
Make a conduit out of shared string instead of use state monad
jappeace Mar 5, 2021
c0d2231
Try fix tests after that Cellmap change
jappeace Mar 8, 2021
b55f125
Revert "Try fix tests after that Cellmap change"
jappeace Mar 8, 2021
63f91b0
Revert "Change type alias for CellMap"
jappeace Mar 8, 2021
eccf7f8
Fix the tests
jappeace Mar 8, 2021
ddc9ee8
Remove MonadIO constraint, throw errors with monadthrow
jappeace Mar 9, 2021
0616416
Remove several redundant constraints
jappeace Mar 9, 2021
5180ee7
Add haskcallstack, but it's not a good idea after reading the docs
jappeace Mar 9, 2021
db46ca3
Revert "Add haskcallstack, but it's not a good idea after reading the…
jappeace Mar 9, 2021
d17dbdc
Add bool parse support
jappeace Mar 9, 2021
9c27047
Add some docs
jappeace Mar 9, 2021
cd9be9a
Show the issue with a test
jappeace Mar 9, 2021
87ea6ba
Fix tests for shared strings
jappeace Mar 11, 2021
d94d4b9
Add nix flake template
awkure Apr 19, 2021
6329020
Add stylish haskell to nix shell and fix shell
awkure Apr 21, 2021
000e696
Update streaming parser and add docs
awkure Apr 21, 2021
fca5598
Update tests and benchmarks
awkure Apr 21, 2021
b07d300
Enable profiling and collect data for all parsers
awkure Apr 21, 2021
d0e8686
Update docs and fix error with parsing some strings
awkure May 3, 2021
7e9c4ec
Add conduit for creating shared string table
jappeace Jun 1, 2021
15b2d2e
Add geenration of shared string table
jappeace Jun 1, 2021
01fe7e6
Add support for writing rows
jappeace Jun 1, 2021
4c087fc
Solve warnings
jappeace Jun 1, 2021
eb11b3b
Make it produce a file
jappeace Jun 3, 2021
27cf20a
Add test that checks our writing code to the existing implementation
jappeace Jun 3, 2021
9d0801d
Cleanup tag soup with more use of the library
jappeace Jun 3, 2021
f94e958
I guess this made the tests pass,
jappeace Jun 3, 2021
a6762d3
Cleanup old test suite
jappeace Jun 4, 2021
929b0b6
Add support for parsing errors
jappeace Jun 4, 2021
9337b60
Add test that compares existing value to the writer
jappeace Jun 4, 2021
78429e1
Add printing of most boilerplate
jappeace Jun 4, 2021
2884eaa
Fix namespace issue for shared strings
jappeace Jun 4, 2021
1f752a5
Add namespaces to pretty much everything
jappeace Jun 7, 2021
67db7da
I think we managed to obtain relationships
jappeace Jun 7, 2021
2a6d10e
Workbooks now get loaded
jappeace Jun 7, 2021
418094a
Simplify the tests
jappeace Jun 7, 2021
2bdef1b
This passes the test of writing and reading
jappeace Jun 8, 2021
d2bbdb5
Add documentation on how to validate and how to use ssts
jappeace Jun 8, 2021
4eabcea
Make the result be digestable by excel
jappeace Jun 8, 2021
b622d43
Stop writing in and out
jappeace Jun 8, 2021
da72680
Move not api related code to an internal module
jappeace Jun 8, 2021
4c02079
Add additional notes on arguments
jappeace Jun 8, 2021
b6c8aa5
Since this realized, no longer need outdated docs
jappeace Jun 10, 2021
576a1b2
Remove implementation details from haddock
jappeace Jun 10, 2021
14cee74
Fix bug where writing of larger files got truncated to last result
jappeace Jun 23, 2021
80494eb
Parse every column of the row, not just 1 (set sheet state)
timds Jun 4, 2021
b6e03d0
Do not treat excel booleans as indices into the shared strings
timds Jun 4, 2021
000d83d
Add some possibly helpful comments to Parser.Stream
timds Jun 7, 2021
2103b06
Replace hard-coded path with getTemporaryDirectory in StreamTests
timds Jun 8, 2021
4fa6cf3
[Draft] Add an alternate Stream API that can stream certain sheets
Jun 9, 2021
b384c39
(Streaming via "zip") Include API for both XML events and SheetItem
Jun 11, 2021
55932a4
Stream module: Expose more functions & clarify docs
Jun 11, 2021
86e061b
Stream module via "zip": Add API to count number of rows in a sheet
Jun 11, 2021
4549de1
Remove getSheetXmlSource from public API
Jun 17, 2021
4b69412
Use a vector instead of a map for storing shared strings
Jun 18, 2021
6d63578
[remove me] log time taken to parse shared strings
Jun 18, 2021
4fdd845
[remove me?] add many SCC annotations to Parser.Stream module
Jun 18, 2021
ba72012
Parser.Stream: replace xml-conduit with xeno (experimental hack)
Jun 22, 2021
3cba572
Parser.Stream: Replace xeno with hexpat as the xml parser
Jun 25, 2021
23f1043
Add countrows executable for easier profiling
Jun 28, 2021
c2544f5
add to .gitignore
Jul 6, 2021
66f0347
illustration of incompatible synchronous libxml<->conduit APIs
Jul 6, 2021
1e99e19
try using concurrency to yield results of libxml in conduit
Jul 7, 2021
343632c
Modify libxml parser to take a IO callback rather than conduit CB
Jul 8, 2021
03d73a3
Stream parser: remove xeno parser code
Jul 8, 2021
7b6250b
Stream parser: Add callback-based expat interface
Jul 8, 2021
139b5a2
Ignore cabal.project.local
Jul 12, 2021
29eb326
Add nix dev setup that doesn't use flakes
Jul 12, 2021
dfb37bc
Remove parser based on zip-archive
Jul 12, 2021
89484de
Remove -fno-full-laziness
Jul 12, 2021
b0e228a
Update copyright and haddocks
Jul 12, 2021
92162ba
Rename sourceSheet to readSheet
Jul 12, 2021
5c74b85
Fix haddock typo
Jul 14, 2021
f5e66ec
Stream parser: replace Map with IntMap as CellRow type
Jul 15, 2021
3dab49b
Parse shared strings with hexpat rather than xml-conduit
Jul 15, 2021
67ad6e9
Stream parser: remove xml-conduit from the API
Jul 15, 2021
107fd01
stream parser: remove getOrParseSharedStrings from public API
Jul 15, 2021
6ea2835
stream parser: Add API to get workbook info (sheet names & numbers)
Jul 15, 2021
7c4ca8b
stream benchmark: update to reflect newer callback API
Jul 15, 2021
11b29fd
stream parsing: remove libxml usage
Jul 19, 2021
1298e73
Revert countRowsInSheet to faster version that does minimal parsing
Jul 19, 2021
1c5549d
This fixes the compile error in the tests.
jappeace Jul 19, 2021
3fc300f
Add the SheetView api
jappeace Jul 19, 2021
7aeab7a
Stream writing: Add column sizing support
jappeace Jul 21, 2021
b862b3a
Add row properties
jappeace Jul 22, 2021
c2dfbfe
Add style parsing and writing
jappeace Jul 22, 2021
afa9aea
Update src/Codec/Xlsx/Parser/Stream.hs
jappeace Jul 23, 2021
4ef6cb6
Rename sstable -> sharedStrings'
jappeace Jul 23, 2021
710422c
Fix writing of empty sheetview list.
jappeace Jul 26, 2021
74e83b8
Fix collect items reversing the row order.
jappeace Jul 26, 2021
d4b4cf1
Add another square test case that's smaller
jappeace Jul 26, 2021
c82d3b1
Filter cells or cellvalues if they contain no information
jappeace Jul 26, 2021
ba9b652
Delete profiling files, flake stuff, and count rows binary
jappeace Aug 4, 2021
f3f97af
Fix CI issues
jappeace Aug 18, 2021
0a9c163
Address review comments
jappeace Sep 1, 2021
1d19354
Fix more review comments
jappeace Sep 1, 2021
ec5fb40
Selective imports
jappeace Sep 1, 2021
2fb0eaf
Improve getWorkbookInfo API to include name and r:id attribute
Sep 15, 2021
af0e97a
Add countRowsInSheetByName function
Oct 14, 2021
8d47600
Fix confusion between sheet's r:id and sheetId
Oct 15, 2021
2f3d3ae
Revert "Fix confusion between sheet's r:id and sheetId"
Oct 15, 2021
91b4291
stream parser: Access individual sheets via relationships
Oct 15, 2021
2077ff8
stream parser: Don't expose getWorkbookRelationships as API
Oct 15, 2021
da9446e
Streaming parser: Don't error when a shared string table is absent
Oct 14, 2021
ea4127a
Streaming parser: add support for inline strings
Oct 14, 2021
17d456a
Re-enable test suite in nix
Oct 19, 2021
ee518dc
Add test that inline strings can be parsed
Oct 19, 2021
10fc4dc
Fix type in excellvaluetype docs
jappeace Oct 20, 2021
27283a9
Fix type in internal stream comment
jappeace Oct 20, 2021
220de36
Fix typo in comment in writer/stream
jappeace Oct 20, 2021
7d08d86
Fix typo in comment in writer/stream
jappeace Oct 20, 2021
868c467
expose cabal, disable profiling by default.
jappeace Oct 20, 2021
290f0ef
Add motivation hexpat copy
jappeace Oct 20, 2021
c288be1
Move typeapplications above other language pragmas
jappeace Oct 20, 2021
bbcbdb8
Expose errors, don't use error but exception mechanism.
jappeace Oct 20, 2021
37e928b
Clear typealiases, rename sharedstringstate to sharedstringsstate
jappeace Oct 20, 2021
8a94064
Fix more review comments
jappeace Oct 20, 2021
3dd5da7
Apply indented stylish rules
jappeace Oct 20, 2021
16c8e32
rename mapFold -> upsertSharedString
jappeace Oct 20, 2021
0f09fed
Add single sheetness support to the name of writesettings
jappeace Nov 10, 2021
b509078
Remove TODO comment
jappeace Nov 10, 2021
5a2c8cd
Delete inline comment about test idea
jappeace Nov 10, 2021
a0d265e
Split sheetitem into two types, one without sheet index
jappeace Nov 10, 2021
06add90
Fix docs
jappeace Nov 10, 2021
9fa44b7
Add warning to zipOpts
jappeace Nov 10, 2021
86ce3ae
replace excell -> Excel
jappeace Nov 10, 2021
3c7bb45
Fix grammar issue in comments
jappeace Nov 10, 2021
3d09f30
Fix another grammar issue
jappeace Nov 10, 2021
7a3500f
Delete commented code
jappeace Nov 10, 2021
c5d47b7
Add better description of Clark Notation
jappeace Nov 10, 2021
b905d85
Delete unused rels
jappeace Nov 10, 2021
f1ef1d7
Change comment from tab support to write multiple sheets
jappeace Dec 8, 2021
ccce953
Delete commented out code
jappeace Dec 8, 2021
b6b9e71
Remove unused pragmas
jappeace Dec 8, 2021
b38d690
Remove unused dependencies
jappeace Dec 8, 2021
f54af34
Remove Reduncant lock statement from rebase
jappeace Dec 8, 2021
d664e04
Fix lex error
jappeace Dec 8, 2021
a963af2
Add counting bench
jappeace Dec 8, 2021
7b84164
Add minor api changes for figuring out the index.
jappeace Dec 8, 2021
3220c8a
Merge pull request #20 from SupercedeTech/minor-api-changes-and-bench…
jappeace Dec 8, 2021
d1f09f8
Add benchmarks for writing with sst and existing lib call
jappeace Dec 8, 2021
e22df3a
Rename RowItem -> Row
jappeace Dec 9, 2021
90535dc
Add memoize functionality
jappeace Dec 14, 2021
b65d523
Merge pull request #21 from SupercedeTech/review-comments-iv
jappeace Dec 14, 2021
6d118b7
Remove blank line
jappeace Dec 14, 2021
bb30ade
Remove more blank lines.
jappeace Dec 14, 2021
095fa1c
Clean out docs for memoized module
jappeace Dec 15, 2021
e549d1f
Fix typo readWorkbookRelatinoships -> readWorkbookrelationships
jappeace Dec 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,24 @@
TAGS
cabal-dev
dist
dist-newstyle
*sandbox*
#*#
*.*~
specs
samples
.stack-work
*.lock

# nix
result
result-doc
*.lock
*.o
*.hi
*.prof
*.aux
*.hp
*.ps
.envrc
.direnv
cabal.project.local
27 changes: 27 additions & 0 deletions benchmarks/Main.hs
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,47 @@
module Main (main) where

import Codec.Xlsx
import Codec.Xlsx.Parser.Stream
import Codec.Xlsx.Writer.Stream
import Control.DeepSeq
import Control.Lens
import Control.Monad (void)
import Criterion.Main
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LB
import qualified Data.Conduit as C
import qualified Data.Conduit.Combinators as C
import Data.Maybe

main :: IO ()
main = do
let filename = "data/testInput.xlsx"
-- "data/6000.rows.x.26.cols.xlsx"
bs <- BS.readFile filename
let bs' = LB.fromStrict bs
parsed :: Xlsx
parsed = toXlsxFast bs'
idx <- fmap (fromMaybe (error "ix not found")) $ runXlsxM filename $ makeIndexFromName "Sample list"
items <- runXlsxM filename $ collectItems idx
deepseq (parsed, bs', idx, items) (pure ())
defaultMain
[ bgroup
"readFile"
[ bench "with xlsx" $ nf toXlsx bs'
, bench "with xlsx fast" $ nf toXlsxFast bs'
, bench "with stream (counting)" $ nfIO $ runXlsxM filename $ countRowsInSheet idx
, bench "with stream (reading)" $ nfIO $ runXlsxM filename $ readSheet idx (pure . rwhnf)
]
, bgroup
"writeFile"
[ bench "with xlsx" $ nf (fromXlsx 0) parsed
, bench "with stream (no sst)" $
nfIO $ C.runConduit $
void (writeXlsxWithSharedStrings defaultSettings mempty $ C.yieldMany $ view si_row <$> items)
C..| C.fold
, bench "with stream (sst)" $
nfIO $ C.runConduit $
void (writeXlsx defaultSettings $ C.yieldMany $ view si_row <$> items)
C..| C.fold
]
]
Binary file added data/inline-strings.xlsx
Binary file not shown.
27 changes: 27 additions & 0 deletions default.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
let
rev = "07ca3a021f05d6ff46bbd03c418b418abb781279"; # first 21.05 release
url = "https://github.com/NixOS/nixpkgs/archive/${rev}.tar.gz";
compiler = "ghc884";
isLibraryProfiling = false;
pkgs = import (builtins.fetchTarball url) {
config = if isLibraryProfiling then ({
packageOverrides = pkgs_super: {
haskell = pkgs_super.haskell // {
packages = pkgs_super.haskell.packages // {
"${compiler}" = pkgs_super.haskell.packages."${compiler}".override {
overrides = self: super: {
mkDerivation = args: super.mkDerivation (args // {
enableLibraryProfiling = true;
});
};
};
};
};
};
}) else {};
};

hpkgs = pkgs.haskell.packages."${compiler}";
in pkgs.haskell.lib.overrideCabal (hpkgs.callCabal2nix "xlsx" ./. {}) {
libraryToolDepends = [pkgs.cabal-install];
}
2 changes: 2 additions & 0 deletions shell.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
(import ./default.nix).env # not flake-based
# (import ./.).devShell."${builtins.currentSystem}" # flake-based
10 changes: 5 additions & 5 deletions src/Codec/Xlsx/Parser.hs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE PackageImports #-}
{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TupleSections #-}
Expand All @@ -16,7 +17,7 @@ module Codec.Xlsx.Parser
, Parser
) where

import qualified Codec.Archive.Zip as Zip
import qualified "zip-archive" Codec.Archive.Zip as Zip
import Control.Applicative
import Control.Arrow (left)
import Control.Error.Safe (headErr)
Expand All @@ -27,7 +28,7 @@ import Lens.Micro
#else
import Control.Lens hiding ((<.>), element, views)
#endif
import Control.Monad (forM, join, void)
import Control.Monad (join, void)
import Control.Monad.Except (catchError, throwError)
import Data.Bool (bool)
import Data.ByteString (ByteString)
Expand All @@ -54,7 +55,6 @@ import Codec.Xlsx.Parser.Internal
import Codec.Xlsx.Parser.Internal.PivotTable
import Codec.Xlsx.Types
import Codec.Xlsx.Types.Cell (formulaDataFromCursor)
import Codec.Xlsx.Types.Common (xlsxTextToCellValue)
import Codec.Xlsx.Types.Internal
import Codec.Xlsx.Types.Internal.CfPair
import Codec.Xlsx.Types.Internal.CommentTable as CommentTable
Expand All @@ -71,7 +71,7 @@ import Codec.Xlsx.Types.PivotTable.Internal
toXlsx :: L.ByteString -> Xlsx
toXlsx = either (error . show) id . toXlsxEither

data ParseError = InvalidZipArchive
data ParseError = InvalidZipArchive String
| MissingFile FilePath
| InvalidFile FilePath Text
| InvalidRef FilePath RefId
Expand Down Expand Up @@ -106,7 +106,7 @@ toXlsxEitherBase ::
-> L.ByteString
-> Parser Xlsx
toXlsxEitherBase parseSheet bs = do
ar <- left (const InvalidZipArchive) $ Zip.toArchiveOrFail bs
ar <- left InvalidZipArchive $ Zip.toArchiveOrFail bs
sst <- getSharedStrings ar
contentTypes <- getContentTypes ar
(wfs, names, cacheSources, dateBase) <- readWorkbook ar
Expand Down
53 changes: 53 additions & 0 deletions src/Codec/Xlsx/Parser/Internal/Memoize.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE TypeApplications #-}

-- | I rewrote: https://hackage.haskell.org/package/unliftio-0.2.20/docs/src/UnliftIO.Memoize.html#Memoized
-- for monad trans basecontrol
-- we don't need a generic `m` anyway. it's good enough in base IO.
module Codec.Xlsx.Parser.Internal.Memoize
( Memoized
, runMemoized
, memoizeRef
) where

import Control.Applicative as A
import Control.Monad (join)
import Control.Monad.IO.Class
import Data.IORef
import Control.Exception

-- | A \"run once\" value, with results saved. Extract the value with
-- 'runMemoized'. For single-threaded usage, you can use 'memoizeRef' to
-- create a value. If you need guarantees that only one thread will run the
-- action at a time, use 'memoizeMVar'.
--
-- Note that this type provides a 'Show' instance for convenience, but not
-- useful information can be provided.
newtype Memoized a = Memoized (IO a)
deriving (Functor, A.Applicative, Monad)
instance Show (Memoized a) where
show _ = "<<Memoized>>"

-- | Extract a value from a 'Memoized', running an action if no cached value is
-- available.
runMemoized :: MonadIO m => Memoized a -> m a
runMemoized (Memoized m) = liftIO m
{-# INLINE runMemoized #-}

-- | Create a new 'Memoized' value using an 'IORef' under the surface. Note that
-- the action may be run in multiple threads simultaneously, so this may not be
-- thread safe (depending on the underlying action).
memoizeRef :: IO a -> IO (Memoized a)
memoizeRef action = do
ref <- newIORef Nothing
pure $ Memoized $ do
mres <- readIORef ref
res <-
case mres of
Just res -> pure res
Nothing -> do
res <- try @SomeException action
writeIORef ref $ Just res
pure res
either throwIO pure res
Loading