-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow2 merge avro #17
Conversation
…d XXX but found YYY" (apache#1448) * Fix bug in projection exec * Only extract field metadata for direct field access * clippy
* Support identifiers with `.` in them * simplify * fix: clippy * fix: more clippy * Add test for "...."
…ion (apache#1430) * Fixes for working with functions in dataframes * clippy fix * Update datafusion/src/logical_plan/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/src/logical_plan/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * remove broken regex test * remove uneeded comments * fix: cargo fmt Co-authored-by: Andrew Lamb <[email protected]>
…che#1408) * support sum/avg agg for decimal * support sum/avg agg for decimal * suppor the avg and add test * add comments and const
* Turn off default features of ahash as they're not needed The default feature set of the ahash crate is ["std"][1]. The std feature enables features which require the standard library, namely `AHashMap` and `AHashSet`. DataFusion currently only uses `AHasher`, `CallHasher`, and `RandomState`, none of which require the standard library. This gives more control to projects depending on datafusion to minimize the amount of code they depend on. [1]: https://github.com/tkaitchuck/aHash/blob/e77cab8c1e15bfc9f54dfd28bd8820c2a7bb27c4/Cargo.toml#L24-L25 * Turn off default features of chrono as they're not needed In fact, the "oldtime" feature is [considered deprecated][1] and only included by default for backwards compatibility. The other features don't appear to be used by datafusion or ballista, so this gives projects depending on these crates more flexibility in what they choose to include. [1]: https://github.com/chronotope/chrono/blame/f6bd567bb677262645c1fc3131c8c1071cd77ec3/README.md#L88-L94
Thanks to the work of @rdettai @xudong963 and others, we are making great progress here Also added apache#122 as @liukun4515 is actively working on it cc @hntd187
* fix calculate in many_to_many_hash_partition test. * fix Clippy Lints
…#1291) * Left join could use bitmap for left join instead of Vec<bool> * fix * Fix * Finish implementation * Update hash_join.rs
… Datafusion's time types (apache#1455) * point to UL repos * Make ScalarValue::TimestampNanosecond moderately timezone aware * cargo fmt * fix ballista build * fix ballista tests * ScalarValue is only 64b on aarch64; it is still 48 on amd64 * remove debugging code * add tests for timestamp coercion * minmax test on mixed ts types, allow creation of timestamp tables with a timezone, fix a missed case in the binary ops applied to timestamp types with timezones
apache#1466) * Pass local address host so we do not get mismatch between IPv4 and IPv6 addresses
* WIP: Significant test refactoring first-pass * WIP: Significant Test refactoring first pass * Add license header, and add 1 missing test * Fix clippy warnings
* add nested struct support to python implements nested structs `col("a")['b']` * add test for indexed fields
* Fix sort on aggregate * Use ExprRewriter. * For review comment * Update datafusion/src/logical_plan/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/src/logical_plan/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/src/logical_plan/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * Fix format. Co-authored-by: Andrew Lamb <[email protected]>
* Fix single_distinct_to_groupby for arbitrary expressions * Fix fmt Co-authored-by: James Katz <[email protected]>
…mplifier` (apache#1401) * Combine simplify and Simplifier * Make nullable more functional
* Initial implementation of variance * get simple f64 type tests working * add math functions to ScalarValue, some tests * add to expressions and tests * add more tests * add test for ScalarValue add * add tests for scalar arithmetic * add test, finish variance * fix warnings * add more sql tests * add stddev and tests * add the hooks and expression * add more tests * fix lint and clipy * address comments and fix test errors * address comments * add population and sample for variance and stddev * address more comments * fmt * add test for less than 2 values * fix inconsistency in the merge logic * fix lint and clipy
I haven't made all tests pass so this is a draft |
77a8950
to
0c524da
Compare
Got this nicety when running miri on nightly :
|
0c524da
to
257a7c5
Compare
Thank you @Igosuki ! I will take a close look at it tonight and merge it. Then 3 of us can collaborate on fixing the failing tests through follow up PRs.
I believe SIMD requiring nightly is expected? |
@houqp Yeah I mentioned it on another PR and got the answer, I didn't think it was required |
Thanks a lot for this! pass The pyo3 was not added because |
@jorgecarleitao thanks for clarifying, I hope https://github.com/houqp/arrow-datafusion/blob/arrow2_merge/datafusion/src/pyarrow.rs is a good enough replacement, I used the code in your integration tests. |
@jorgecarleitao i noticed the IPC stream writer's write API is not consistent with what we have for the file writer: https://github.com/jorgecarleitao/arrow2/blob/a886e35748b2813e6c5a122cf9603a3e59f9f80e/src/io/ipc/write/stream.rs#L58. Is this expected? It takes a slice instead of option for the ipc field argument. |
It is an API bug. It should be |
Rationale for this change
Merge latest of arrow-datafusion, with latest arrow2 that still has RecordBatch
What changes are included in this PR?
Support for latest datafusion changes : timezone, decimal in more places, stddev, variance
Dictionary has a third value which is whether the dict is sorted
Pyo3 integration changed because arrow2 uses ffi directly
Suddenly serializing ipc requires ipc fields, I put empty ipc fields where I couldn't get the schema so this may require a fix
Avro support fully migrated to arrow
simd integration is not available on stable, this requires a PR fixed on arrow2
arrow2 DataType and ScalarValue don't implement display
Some code, notably in tests could be further refactored due to better ergonomics in arrow2.
Are there any user-facing changes?
As required, no API changes externally for datafusion, however