-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming of xlsx file support #144
Conversation
] | ||
|
||
testXlsx :: Xlsx | ||
testXlsx = Xlsx sheets minimalStyles definedNames customProperties DateBase1904 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved to TestXlsx
so StreamTest
can depend on it.
I've broken the microlens build because I use makePrisms. I'll define them in place instead. |
@jappeace thanks for contibuting this. I hope to find a window of time this weekend to read your PR. |
Yes, parsing is as fast as the best substitute for streaming which is that xlsx2csv program. Compared to the existing code it's slower but it can handle 1 million row excell files now (which the existing implementation couldn't). For a benchmark, do you mean to add it to |
Just choose whatever you see more appropriate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github doesn't allow comments on binary files, would you mind removing unnecessary files .DS_Store and data/simple.xlsx?
See my other comments, the most of them are minor but the Writer API discrepancy seems to be an important detail needing to be fixed.
And thanks for this great contribution.
Sorry, haven't addressed all comments yet, nor did the benchmark, I'll see if I can get another window soon. |
No problem, there seem to be no particular rush |
This allows us to add stream support per row. Which should be good enough
too much work, let's keep original design
This reverts commit 13fe548.
Better grammar
Add benchmark support. The changes remove figuring out the index from the functions I want to benchmark. Also add a writer benchmark for streaming, results: ``` benchmarking readFile/with xlsx time 137.7 ms (133.7 ms .. 140.8 ms) 0.999 R² (0.996 R² .. 1.000 R²) mean 136.5 ms (134.1 ms .. 138.8 ms) std dev 3.580 ms (2.321 ms .. 5.190 ms) variance introduced by outliers: 11% (moderately inflated) benchmarking readFile/with xlsx fast time 28.18 ms (28.00 ms .. 28.48 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 28.33 ms (28.10 ms .. 29.02 ms) std dev 796.8 μs (198.1 μs .. 1.473 ms) benchmarking readFile/with stream (counting) time 13.57 ms (13.49 ms .. 13.65 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 13.56 ms (13.52 ms .. 13.61 ms) std dev 120.9 μs (96.27 μs .. 156.7 μs) benchmarking readFile/with stream (reading) time 33.17 ms (33.05 ms .. 33.32 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 33.11 ms (32.93 ms .. 33.25 ms) std dev 343.3 μs (226.4 μs .. 545.2 μs) benchmarking writeFile/stream time 88.02 ms (87.62 ms .. 88.33 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 88.32 ms (88.15 ms .. 88.49 ms) std dev 290.4 μs (181.4 μs .. 424.5 μs) ```
…marking Add minor api changes for figuring out the index.
These are the results of the benchmark, the parse stream solution is surprisingly competitive to the existing variant:
counting skips the parsing step and just counts the rows in an excel file. |
@jappeace could you add non-streamed writing for comparison? |
results: ``` benchmarking readFile/with xlsx time 130.6 ms (127.9 ms .. 133.4 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 131.4 ms (129.7 ms .. 133.3 ms) std dev 2.832 ms (1.906 ms .. 4.470 ms) variance introduced by outliers: 11% (moderately inflated) benchmarking readFile/with xlsx fast time 27.49 ms (27.29 ms .. 27.72 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 28.16 ms (27.78 ms .. 28.97 ms) std dev 1.159 ms (298.6 μs .. 1.773 ms) variance introduced by outliers: 10% (moderately inflated) benchmarking readFile/with stream (counting) time 13.29 ms (13.23 ms .. 13.35 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 13.32 ms (13.28 ms .. 13.39 ms) std dev 124.6 μs (81.60 μs .. 214.2 μs) benchmarking readFile/with stream (reading) time 32.86 ms (32.70 ms .. 32.97 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 32.83 ms (32.59 ms .. 32.97 ms) std dev 373.9 μs (155.5 μs .. 655.5 μs) benchmarking writeFile/with xlsx time 83.07 ms (82.81 ms .. 83.30 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 82.68 ms (82.33 ms .. 82.85 ms) std dev 415.2 μs (170.9 μs .. 677.9 μs) benchmarking writeFile/with stream (no sst) time 88.15 ms (87.88 ms .. 88.35 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 88.00 ms (87.83 ms .. 88.12 ms) std dev 248.2 μs (176.6 μs .. 321.5 μs) benchmarking writeFile/with stream (sst) time 89.90 ms (89.71 ms .. 90.11 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 89.95 ms (89.85 ms .. 90.05 ms) std dev 168.1 μs (132.2 μs .. 223.6 μs) ```
Benchmarks for the existing writing function, I also added the streaming variant that creates a shared strings table:
In future we may have a faster variant for writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like, just some minor comments
Based on the review comment of qlirka I added a memoized module. Which cleans up reading functions such as getWorkbookRelationships quite a bit. This is a little slower because the zip archive gets read multiple times. But this seemed to have no impact on the benchmarks. Furthermore I changed runExpat to be in IO. The only external change is that I Attached context to workbook errors. But these are exceptions that weren't exposed in the first place. Add TODO item for bad reading of zip
Add memoize functionality
... some people have strong opinions about this.
I copy pasted this but forget to delete the no longer relavant words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a typo fixed this looks to be ready to be merged
Greatt work @jappeace |
I'll see if I can fix it Wednesday or in the holiday |
This goes a long way of implementing #132
It adds both a parser and a writer module for streaming xlsx files.
Although inline with the library in general "only basic functionality at the moment".
The parser doesn't use conduit because we wanted to make it as fast as xlsx2csv, which we did (credits go to Tim).
The writer is a lot slower and in the future we may need to change this API as well to speed it up.
However, both are functional now and in production at supercede.
This doesn't mean we're set on this particular approach and we welcome feedback.
I'm sorry for doing this as a massive code dump and I don't expect this to be accepted in a timely manner.
However for us we had to get it done in a timely manner which is why we decided to work from a temporary fork.
I hope these changes are still welcome in the upstream module even though it's a lot in one go.