Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming of xlsx file support #144

Merged
merged 165 commits into from
Dec 19, 2021
Merged

Add streaming of xlsx file support #144

merged 165 commits into from
Dec 19, 2021

Conversation

jappeace
Copy link
Contributor

This goes a long way of implementing #132

It adds both a parser and a writer module for streaming xlsx files.
Although inline with the library in general "only basic functionality at the moment".

The parser doesn't use conduit because we wanted to make it as fast as xlsx2csv, which we did (credits go to Tim).
The writer is a lot slower and in the future we may need to change this API as well to speed it up.
However, both are functional now and in production at supercede.
This doesn't mean we're set on this particular approach and we welcome feedback.

I'm sorry for doing this as a massive code dump and I don't expect this to be accepted in a timely manner.
However for us we had to get it done in a timely manner which is why we decided to work from a temporary fork.
I hope these changes are still welcome in the upstream module even though it's a lot in one go.

]

testXlsx :: Xlsx
testXlsx = Xlsx sheets minimalStyles definedNames customProperties DateBase1904
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to TestXlsx so StreamTest can depend on it.

@jappeace jappeace marked this pull request as draft August 18, 2021 12:38
@jappeace
Copy link
Contributor Author

I've broken the microlens build because I use makePrisms. I'll define them in place instead.

@jappeace jappeace marked this pull request as ready for review August 18, 2021 16:39
@qrilka
Copy link
Owner

qrilka commented Aug 18, 2021

@jappeace thanks for contibuting this. I hope to find a window of time this weekend to read your PR.
I have one question though - you talk about the solution being "fast" but what about having some at least basic benchmarks demonstrating that?

@jappeace
Copy link
Contributor Author

Yes, parsing is as fast as the best substitute for streaming which is that xlsx2csv program. Compared to the existing code it's slower but it can handle 1 million row excell files now (which the existing implementation couldn't).

For a benchmark, do you mean to add it to benchmarks/Main and then compare it to the xlsx2csv program or the existing implementation? Or both?

@qrilka
Copy link
Owner

qrilka commented Aug 19, 2021

Just choose whatever you see more appropriate

Copy link
Owner

@qrilka qrilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github doesn't allow comments on binary files, would you mind removing unnecessary files .DS_Store and data/simple.xlsx?
See my other comments, the most of them are minor but the Writer API discrepancy seems to be an important detail needing to be fixed.
And thanks for this great contribution.

@jappeace
Copy link
Contributor Author

jappeace commented Sep 1, 2021

Sorry, haven't addressed all comments yet, nor did the benchmark, I'll see if I can get another window soon.

@qrilka
Copy link
Owner

qrilka commented Sep 1, 2021

No problem, there seem to be no particular rush

Add benchmark support.
The changes remove figuring out the index from
the functions I want to benchmark.
Also add a writer benchmark for streaming, results:

```
benchmarking readFile/with xlsx
time                 137.7 ms   (133.7 ms .. 140.8 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 136.5 ms   (134.1 ms .. 138.8 ms)
std dev              3.580 ms   (2.321 ms .. 5.190 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 28.18 ms   (28.00 ms .. 28.48 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 28.33 ms   (28.10 ms .. 29.02 ms)
std dev              796.8 μs   (198.1 μs .. 1.473 ms)

benchmarking readFile/with stream (counting)
time                 13.57 ms   (13.49 ms .. 13.65 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.56 ms   (13.52 ms .. 13.61 ms)
std dev              120.9 μs   (96.27 μs .. 156.7 μs)

benchmarking readFile/with stream (reading)
time                 33.17 ms   (33.05 ms .. 33.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 33.11 ms   (32.93 ms .. 33.25 ms)
std dev              343.3 μs   (226.4 μs .. 545.2 μs)

benchmarking writeFile/stream
time                 88.02 ms   (87.62 ms .. 88.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.32 ms   (88.15 ms .. 88.49 ms)
std dev              290.4 μs   (181.4 μs .. 424.5 μs)
```
…marking

Add minor api changes for figuring out the index.
@jappeace
Copy link
Contributor Author

jappeace commented Dec 8, 2021

These are the results of the benchmark, the parse stream solution is surprisingly competitive to the existing variant:

benchmarking readFile/with xlsx
time                 137.7 ms   (133.7 ms .. 140.8 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 136.5 ms   (134.1 ms .. 138.8 ms)
std dev              3.580 ms   (2.321 ms .. 5.190 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 28.18 ms   (28.00 ms .. 28.48 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 28.33 ms   (28.10 ms .. 29.02 ms)
std dev              796.8 μs   (198.1 μs .. 1.473 ms)

benchmarking readFile/with stream (counting)
time                 13.57 ms   (13.49 ms .. 13.65 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.56 ms   (13.52 ms .. 13.61 ms)
std dev              120.9 μs   (96.27 μs .. 156.7 μs)

benchmarking readFile/with stream (reading)
time                 33.17 ms   (33.05 ms .. 33.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 33.11 ms   (32.93 ms .. 33.25 ms)
std dev              343.3 μs   (226.4 μs .. 545.2 μs)

benchmarking writeFile/stream
time                 88.02 ms   (87.62 ms .. 88.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.32 ms   (88.15 ms .. 88.49 ms)
std dev              290.4 μs   (181.4 μs .. 424.5 μs)

counting skips the parsing step and just counts the rows in an excel file.
reading does a full parse of a row.
Keep in mind the goal of streaming isn't speed but constant memory.

@jappeace jappeace requested a review from qrilka December 8, 2021 16:11
@qrilka
Copy link
Owner

qrilka commented Dec 8, 2021

@jappeace could you add non-streamed writing for comparison?

results:

```
benchmarking readFile/with xlsx
time                 130.6 ms   (127.9 ms .. 133.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 131.4 ms   (129.7 ms .. 133.3 ms)
std dev              2.832 ms   (1.906 ms .. 4.470 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 27.49 ms   (27.29 ms .. 27.72 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.16 ms   (27.78 ms .. 28.97 ms)
std dev              1.159 ms   (298.6 μs .. 1.773 ms)
variance introduced by outliers: 10% (moderately inflated)

benchmarking readFile/with stream (counting)
time                 13.29 ms   (13.23 ms .. 13.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.32 ms   (13.28 ms .. 13.39 ms)
std dev              124.6 μs   (81.60 μs .. 214.2 μs)

benchmarking readFile/with stream (reading)
time                 32.86 ms   (32.70 ms .. 32.97 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 32.83 ms   (32.59 ms .. 32.97 ms)
std dev              373.9 μs   (155.5 μs .. 655.5 μs)

benchmarking writeFile/with xlsx
time                 83.07 ms   (82.81 ms .. 83.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 82.68 ms   (82.33 ms .. 82.85 ms)
std dev              415.2 μs   (170.9 μs .. 677.9 μs)

benchmarking writeFile/with stream (no sst)
time                 88.15 ms   (87.88 ms .. 88.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.00 ms   (87.83 ms .. 88.12 ms)
std dev              248.2 μs   (176.6 μs .. 321.5 μs)

benchmarking writeFile/with stream (sst)
time                 89.90 ms   (89.71 ms .. 90.11 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 89.95 ms   (89.85 ms .. 90.05 ms)
std dev              168.1 μs   (132.2 μs .. 223.6 μs)
```
@jappeace
Copy link
Contributor Author

jappeace commented Dec 8, 2021

Benchmarks for the existing writing function, I also added the streaming variant that creates a shared strings table:

benchmarking readFile/with xlsx
time                 130.6 ms   (127.9 ms .. 133.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 131.4 ms   (129.7 ms .. 133.3 ms)
std dev              2.832 ms   (1.906 ms .. 4.470 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 27.49 ms   (27.29 ms .. 27.72 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.16 ms   (27.78 ms .. 28.97 ms)
std dev              1.159 ms   (298.6 μs .. 1.773 ms)
variance introduced by outliers: 10% (moderately inflated)

benchmarking readFile/with stream (counting)
time                 13.29 ms   (13.23 ms .. 13.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.32 ms   (13.28 ms .. 13.39 ms)
std dev              124.6 μs   (81.60 μs .. 214.2 μs)

benchmarking readFile/with stream (reading)
time                 32.86 ms   (32.70 ms .. 32.97 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 32.83 ms   (32.59 ms .. 32.97 ms)
std dev              373.9 μs   (155.5 μs .. 655.5 μs)

benchmarking writeFile/with xlsx
time                 83.07 ms   (82.81 ms .. 83.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 82.68 ms   (82.33 ms .. 82.85 ms)
std dev              415.2 μs   (170.9 μs .. 677.9 μs)

benchmarking writeFile/with stream (no sst)
time                 88.15 ms   (87.88 ms .. 88.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.00 ms   (87.83 ms .. 88.12 ms)
std dev              248.2 μs   (176.6 μs .. 321.5 μs)

benchmarking writeFile/with stream (sst)
time                 89.90 ms   (89.71 ms .. 90.11 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 89.95 ms   (89.85 ms .. 90.05 ms)
std dev              168.1 μs   (132.2 μs .. 223.6 μs)

In future we may have a faster variant for writing.

Copy link
Owner

@qrilka qrilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like, just some minor comments

Based on the review comment of qlirka I added a memoized module.
Which cleans up reading functions such as getWorkbookRelationships
quite a bit.
This is a little slower because the zip archive gets read multiple
times.
But this seemed to have no impact on the benchmarks.

Furthermore I changed runExpat to be in IO.

The only external change is that I
Attached context to workbook errors.
But these are exceptions that weren't
exposed in the first place.

Add TODO item for bad reading of zip
... some people have strong opinions about this.
I copy pasted this but forget to delete the no longer
relavant words
@jappeace jappeace requested a review from qrilka December 19, 2021 10:55
Copy link
Owner

@qrilka qrilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a typo fixed this looks to be ready to be merged

@qrilka qrilka merged commit 5a7dc61 into qrilka:master Dec 19, 2021
@qrilka
Copy link
Owner

qrilka commented Dec 19, 2021

Greatt work @jappeace
This change looks quite good to release version 1.0 of the library but I'd like also to have ocramz/xeno#53 so it could support GHC 9.2 as well.
Unfortunately I didn't have time yet to look into that problem.

@jappeace
Copy link
Contributor Author

I'll see if I can fix it Wednesday or in the holiday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants