Add streaming of xlsx file support #144

jappeace · 2021-08-18T12:03:02Z

This goes a long way of implementing #132

It adds both a parser and a writer module for streaming xlsx files.
Although inline with the library in general "only basic functionality at the moment".

The parser doesn't use conduit because we wanted to make it as fast as xlsx2csv, which we did (credits go to Tim).
The writer is a lot slower and in the future we may need to change this API as well to speed it up.
However, both are functional now and in production at supercede.
This doesn't mean we're set on this particular approach and we welcome feedback.

I'm sorry for doing this as a massive code dump and I don't expect this to be accepted in a timely manner.
However for us we had to get it done in a timely manner which is why we decided to work from a temporary fork.
I hope these changes are still welcome in the upstream module even though it's a lot in one go.

jappeace · 2021-08-18T12:06:21Z

test/Main.hs

    ]

-testXlsx :: Xlsx
-testXlsx = Xlsx sheets minimalStyles definedNames customProperties DateBase1904


This was moved to TestXlsx so StreamTest can depend on it.

jappeace · 2021-08-18T12:42:12Z

I've broken the microlens build because I use makePrisms. I'll define them in place instead.

src/Codec/Xlsx/Types/Common.hs

test/StreamTests.hs

qrilka · 2021-08-18T18:21:20Z

@jappeace thanks for contibuting this. I hope to find a window of time this weekend to read your PR.
I have one question though - you talk about the solution being "fast" but what about having some at least basic benchmarks demonstrating that?

jappeace · 2021-08-19T12:30:51Z

Yes, parsing is as fast as the best substitute for streaming which is that xlsx2csv program. Compared to the existing code it's slower but it can handle 1 million row excell files now (which the existing implementation couldn't).

For a benchmark, do you mean to add it to benchmarks/Main and then compare it to the xlsx2csv program or the existing implementation? Or both?

qrilka · 2021-08-19T12:43:58Z

Just choose whatever you see more appropriate

qrilka

Github doesn't allow comments on binary files, would you mind removing unnecessary files .DS_Store and data/simple.xlsx?
See my other comments, the most of them are minor but the Writer API discrepancy seems to be an important detail needing to be fixed.
And thanks for this great contribution.

default.nix

xlsx.cabal

src/Codec/Xlsx/Parser.hs

src/Codec/Xlsx/Parser/Stream.hs

test/StreamTests.hs

jappeace · 2021-09-01T15:09:49Z

Sorry, haven't addressed all comments yet, nor did the benchmark, I'll see if I can get another window soon.

qrilka · 2021-09-01T20:38:11Z

No problem, there seem to be no particular rush

This allows us to add stream support per row. Which should be good enough

too much work, let's keep original design

This reverts commit 13fe548.

Better grammar

Add benchmark support. The changes remove figuring out the index from the functions I want to benchmark. Also add a writer benchmark for streaming, results: ``` benchmarking readFile/with xlsx time 137.7 ms (133.7 ms .. 140.8 ms) 0.999 R² (0.996 R² .. 1.000 R²) mean 136.5 ms (134.1 ms .. 138.8 ms) std dev 3.580 ms (2.321 ms .. 5.190 ms) variance introduced by outliers: 11% (moderately inflated) benchmarking readFile/with xlsx fast time 28.18 ms (28.00 ms .. 28.48 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 28.33 ms (28.10 ms .. 29.02 ms) std dev 796.8 μs (198.1 μs .. 1.473 ms) benchmarking readFile/with stream (counting) time 13.57 ms (13.49 ms .. 13.65 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 13.56 ms (13.52 ms .. 13.61 ms) std dev 120.9 μs (96.27 μs .. 156.7 μs) benchmarking readFile/with stream (reading) time 33.17 ms (33.05 ms .. 33.32 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 33.11 ms (32.93 ms .. 33.25 ms) std dev 343.3 μs (226.4 μs .. 545.2 μs) benchmarking writeFile/stream time 88.02 ms (87.62 ms .. 88.33 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 88.32 ms (88.15 ms .. 88.49 ms) std dev 290.4 μs (181.4 μs .. 424.5 μs) ```

…marking Add minor api changes for figuring out the index.

jappeace · 2021-12-08T16:05:07Z

These are the results of the benchmark, the parse stream solution is surprisingly competitive to the existing variant:

benchmarking readFile/with xlsx
time                 137.7 ms   (133.7 ms .. 140.8 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 136.5 ms   (134.1 ms .. 138.8 ms)
std dev              3.580 ms   (2.321 ms .. 5.190 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 28.18 ms   (28.00 ms .. 28.48 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 28.33 ms   (28.10 ms .. 29.02 ms)
std dev              796.8 μs   (198.1 μs .. 1.473 ms)

benchmarking readFile/with stream (counting)
time                 13.57 ms   (13.49 ms .. 13.65 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.56 ms   (13.52 ms .. 13.61 ms)
std dev              120.9 μs   (96.27 μs .. 156.7 μs)

benchmarking readFile/with stream (reading)
time                 33.17 ms   (33.05 ms .. 33.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 33.11 ms   (32.93 ms .. 33.25 ms)
std dev              343.3 μs   (226.4 μs .. 545.2 μs)

benchmarking writeFile/stream
time                 88.02 ms   (87.62 ms .. 88.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.32 ms   (88.15 ms .. 88.49 ms)
std dev              290.4 μs   (181.4 μs .. 424.5 μs)

counting skips the parsing step and just counts the rows in an excel file.
reading does a full parse of a row.
Keep in mind the goal of streaming isn't speed but constant memory.

qrilka · 2021-12-08T18:05:49Z

@jappeace could you add non-streamed writing for comparison?

results: ``` benchmarking readFile/with xlsx time 130.6 ms (127.9 ms .. 133.4 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 131.4 ms (129.7 ms .. 133.3 ms) std dev 2.832 ms (1.906 ms .. 4.470 ms) variance introduced by outliers: 11% (moderately inflated) benchmarking readFile/with xlsx fast time 27.49 ms (27.29 ms .. 27.72 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 28.16 ms (27.78 ms .. 28.97 ms) std dev 1.159 ms (298.6 μs .. 1.773 ms) variance introduced by outliers: 10% (moderately inflated) benchmarking readFile/with stream (counting) time 13.29 ms (13.23 ms .. 13.35 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 13.32 ms (13.28 ms .. 13.39 ms) std dev 124.6 μs (81.60 μs .. 214.2 μs) benchmarking readFile/with stream (reading) time 32.86 ms (32.70 ms .. 32.97 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 32.83 ms (32.59 ms .. 32.97 ms) std dev 373.9 μs (155.5 μs .. 655.5 μs) benchmarking writeFile/with xlsx time 83.07 ms (82.81 ms .. 83.30 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 82.68 ms (82.33 ms .. 82.85 ms) std dev 415.2 μs (170.9 μs .. 677.9 μs) benchmarking writeFile/with stream (no sst) time 88.15 ms (87.88 ms .. 88.35 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 88.00 ms (87.83 ms .. 88.12 ms) std dev 248.2 μs (176.6 μs .. 321.5 μs) benchmarking writeFile/with stream (sst) time 89.90 ms (89.71 ms .. 90.11 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 89.95 ms (89.85 ms .. 90.05 ms) std dev 168.1 μs (132.2 μs .. 223.6 μs) ```

jappeace · 2021-12-08T20:59:38Z

Benchmarks for the existing writing function, I also added the streaming variant that creates a shared strings table:

benchmarking readFile/with xlsx
time                 130.6 ms   (127.9 ms .. 133.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 131.4 ms   (129.7 ms .. 133.3 ms)
std dev              2.832 ms   (1.906 ms .. 4.470 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 27.49 ms   (27.29 ms .. 27.72 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.16 ms   (27.78 ms .. 28.97 ms)
std dev              1.159 ms   (298.6 μs .. 1.773 ms)
variance introduced by outliers: 10% (moderately inflated)

benchmarking readFile/with stream (counting)
time                 13.29 ms   (13.23 ms .. 13.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.32 ms   (13.28 ms .. 13.39 ms)
std dev              124.6 μs   (81.60 μs .. 214.2 μs)

benchmarking readFile/with stream (reading)
time                 32.86 ms   (32.70 ms .. 32.97 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 32.83 ms   (32.59 ms .. 32.97 ms)
std dev              373.9 μs   (155.5 μs .. 655.5 μs)

benchmarking writeFile/with xlsx
time                 83.07 ms   (82.81 ms .. 83.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 82.68 ms   (82.33 ms .. 82.85 ms)
std dev              415.2 μs   (170.9 μs .. 677.9 μs)

benchmarking writeFile/with stream (no sst)
time                 88.15 ms   (87.88 ms .. 88.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.00 ms   (87.83 ms .. 88.12 ms)
std dev              248.2 μs   (176.6 μs .. 321.5 μs)

benchmarking writeFile/with stream (sst)
time                 89.90 ms   (89.71 ms .. 90.11 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 89.95 ms   (89.85 ms .. 90.05 ms)
std dev              168.1 μs   (132.2 μs .. 223.6 μs)

In future we may have a faster variant for writing.

qrilka

That looks like, just some minor comments

src/Codec/Xlsx/Parser/Stream.hs

test/StreamTests.hs

Based on the review comment of qlirka I added a memoized module. Which cleans up reading functions such as getWorkbookRelationships quite a bit. This is a little slower because the zip archive gets read multiple times. But this seemed to have no impact on the benchmarks. Furthermore I changed runExpat to be in IO. The only external change is that I Attached context to workbook errors. But these are exceptions that weren't exposed in the first place. Add TODO item for bad reading of zip

Add memoize functionality

... some people have strong opinions about this.

I copy pasted this but forget to delete the no longer relavant words

qrilka

With a typo fixed this looks to be ready to be merged

src/Codec/Xlsx/Parser/Stream.hs

qrilka · 2021-12-19T20:40:46Z

Greatt work @jappeace
This change looks quite good to release version 1.0 of the library but I'd like also to have ocramz/xeno#53 so it could support GHC 9.2 as well.
Unfortunately I didn't have time yet to look into that problem.

jappeace · 2021-12-19T21:05:53Z

I'll see if I can fix it Wednesday or in the holiday

jappeace commented Aug 18, 2021

View reviewed changes

jappeace marked this pull request as draft August 18, 2021 12:38

jappeace marked this pull request as ready for review August 18, 2021 16:39

jappeace commented Aug 18, 2021

View reviewed changes

src/Codec/Xlsx/Types/Common.hs Show resolved Hide resolved

jappeace commented Aug 18, 2021

View reviewed changes

test/StreamTests.hs Show resolved Hide resolved

qrilka requested changes Aug 22, 2021

View reviewed changes

qrilka mentioned this pull request Sep 17, 2021

Be Excel-compatible when handling pre 1900-03-01 dates #146

Merged

jappeace added 17 commits December 8, 2021 12:46

Change type alias for CellMap

93a5701

This allows us to add stream support per row. Which should be good enough

Try seeing if we can list everything in the test data

0e241be

Loop trough the entire thing

d705488

Split of go into seperate function

d050598

Don't print the bytestring

5b42e5b

I'm finding out it aint ordered

2a50042

Go a long way towards completion

d6a5f50

Use coerce for free speed

ce7cff8

Do some rundimentary parsing

fa8df15

Add styling the binary to see what's going on

bc66011

Filter out the formula

d596b31

Add string table parsing, cleanup warnings

8404ab7

THis only get's the last string (lol)

b88381b

Fix shared string bug

36dca07

Filter empty rows

bdc3779

Try allow user to use sideffect to lookup string

b5ec1e4

too much work, let's keep original design

Revert "Try allow user to use sideffect to lookup string"

ace6031

This reverts commit 13fe548.

jappeace added 7 commits December 8, 2021 12:51

Add better description of Clark Notation

c5d47b7

Better grammar

Delete unused rels

b905d85

Change comment from tab support to write multiple sheets

f1ef1d7

Delete commented out code

ccce953

Remove unused pragmas

b6b9e71

Remove unused dependencies

b38d690

Remove Reduncant lock statement from rebase

f54af34

jappeace force-pushed the master branch from 180cae3 to f54af34 Compare December 8, 2021 12:08

jappeace added 4 commits December 8, 2021 14:36

Fix lex error

d664e04

Add counting bench

a963af2

Merge pull request #20 from SupercedeTech/minor-api-changes-and-bench…

3220c8a

…marking Add minor api changes for figuring out the index.

jappeace requested a review from qrilka December 8, 2021 16:11

Rename RowItem -> Row

e22df3a

qrilka reviewed Dec 11, 2021

View reviewed changes

src/Codec/Xlsx/Parser/Stream.hs Outdated Show resolved Hide resolved

src/Codec/Xlsx/Parser/Stream.hs Outdated Show resolved Hide resolved

src/Codec/Xlsx/Parser/Stream.hs Show resolved Hide resolved

test/StreamTests.hs Outdated Show resolved Hide resolved

jappeace added 5 commits December 14, 2021 14:50

Merge pull request #21 from SupercedeTech/review-comments-iv

b65d523

Add memoize functionality

Remove blank line

6d118b7

Remove more blank lines.

bb30ade

... some people have strong opinions about this.

Clean out docs for memoized module

095fa1c

I copy pasted this but forget to delete the no longer relavant words

jappeace requested a review from qrilka December 19, 2021 10:55

qrilka reviewed Dec 19, 2021

View reviewed changes

src/Codec/Xlsx/Parser/Stream.hs Outdated Show resolved Hide resolved

Fix typo readWorkbookRelatinoships -> readWorkbookrelationships

e549d1f

qrilka merged commit 5a7dc61 into qrilka:master Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming of xlsx file support #144

Add streaming of xlsx file support #144

jappeace commented Aug 18, 2021

jappeace Aug 18, 2021

jappeace commented Aug 18, 2021

qrilka commented Aug 18, 2021

jappeace commented Aug 19, 2021

qrilka commented Aug 19, 2021

qrilka left a comment

jappeace commented Sep 1, 2021

qrilka commented Sep 1, 2021

jappeace commented Dec 8, 2021 •

edited

Loading

qrilka commented Dec 8, 2021

jappeace commented Dec 8, 2021

qrilka left a comment

qrilka left a comment

qrilka commented Dec 19, 2021

jappeace commented Dec 19, 2021

Add streaming of xlsx file support #144

Add streaming of xlsx file support #144

Conversation

jappeace commented Aug 18, 2021

jappeace Aug 18, 2021

Choose a reason for hiding this comment

jappeace commented Aug 18, 2021

qrilka commented Aug 18, 2021

jappeace commented Aug 19, 2021

qrilka commented Aug 19, 2021

qrilka left a comment

Choose a reason for hiding this comment

jappeace commented Sep 1, 2021

qrilka commented Sep 1, 2021

jappeace commented Dec 8, 2021 • edited Loading

qrilka commented Dec 8, 2021

jappeace commented Dec 8, 2021

qrilka left a comment

Choose a reason for hiding this comment

qrilka left a comment

Choose a reason for hiding this comment

qrilka commented Dec 19, 2021

jappeace commented Dec 19, 2021

jappeace commented Dec 8, 2021 •

edited

Loading