Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow2 merge avro #17

Merged
merged 40 commits into from
Jan 13, 2022
Merged

Arrow2 merge avro #17

merged 40 commits into from
Jan 13, 2022

Conversation

Igosuki
Copy link

@Igosuki Igosuki commented Jan 12, 2022

Rationale for this change

Merge latest of arrow-datafusion, with latest arrow2 that still has RecordBatch

What changes are included in this PR?

Support for latest datafusion changes : timezone, decimal in more places, stddev, variance
Dictionary has a third value which is whether the dict is sorted
Pyo3 integration changed because arrow2 uses ffi directly
Suddenly serializing ipc requires ipc fields, I put empty ipc fields where I couldn't get the schema so this may require a fix
Avro support fully migrated to arrow
simd integration is not available on stable, this requires a PR fixed on arrow2
arrow2 DataType and ScalarValue don't implement display

Some code, notably in tests could be further refactored due to better ergonomics in arrow2.

Are there any user-facing changes?

As required, no API changes externally for datafusion, however

alamb and others added 30 commits December 15, 2021 14:46
…d XXX but found YYY" (apache#1448)

* Fix bug in projection exec

* Only extract field metadata for direct field access

* clippy
* Support identifiers with `.` in them

* simplify

* fix: clippy

* fix: more clippy

* Add test for "...."
…ion (apache#1430)

* Fixes for working with functions in dataframes

* clippy fix

* Update datafusion/src/logical_plan/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/src/logical_plan/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* remove broken regex test

* remove uneeded comments

* fix: cargo fmt

Co-authored-by: Andrew Lamb <[email protected]>
…che#1408)

* support sum/avg agg for decimal

* support sum/avg agg for decimal

* suppor the avg and add test

* add comments and const
* Turn off default features of ahash as they're not needed

The default feature set of the ahash crate is ["std"][1]. The std
feature enables features which require the standard library, namely
`AHashMap` and `AHashSet`. DataFusion currently only uses `AHasher`,
`CallHasher`, and `RandomState`, none of which require the standard
library.

This gives more control to projects depending on datafusion to minimize
the amount of code they depend on.

[1]: https://github.com/tkaitchuck/aHash/blob/e77cab8c1e15bfc9f54dfd28bd8820c2a7bb27c4/Cargo.toml#L24-L25

* Turn off default features of chrono as they're not needed

In fact, the "oldtime" feature is [considered deprecated][1] and only
included by default for backwards compatibility.

The other features don't appear to be used by datafusion or
ballista, so this gives projects depending on these crates more
flexibility in what they choose to include.

[1]: https://github.com/chronotope/chrono/blame/f6bd567bb677262645c1fc3131c8c1071cd77ec3/README.md#L88-L94
Thanks to the work of @rdettai @xudong963  and others, we are making great progress here

Also added apache#122 as @liukun4515  is actively working on it

cc @hntd187
* fix calculate in many_to_many_hash_partition test.

* fix Clippy Lints
…#1291)

* Left join could use bitmap for left join instead of Vec<bool>

* fix

* Fix

* Finish implementation

* Update hash_join.rs
… Datafusion's time types (apache#1455)

* point to UL repos

* Make ScalarValue::TimestampNanosecond moderately timezone aware

* cargo fmt

* fix ballista build

* fix ballista tests

* ScalarValue is only 64b on aarch64; it is still 48 on amd64

* remove debugging code

* add tests for timestamp coercion

* minmax test on mixed ts types, allow creation of timestamp tables with a timezone, fix a missed case in the binary ops applied to timestamp types with timezones
apache#1466)

* Pass local address host so we do not get mismatch between IPv4 and IPv6 addresses
* WIP: Significant test refactoring first-pass

* WIP: Significant Test refactoring first pass

* Add license header, and add 1 missing test

* Fix clippy warnings
* add nested struct support to python

implements nested structs `col("a")['b']`

* add test for indexed fields
* Fix sort on aggregate

* Use ExprRewriter.

* For review comment

* Update datafusion/src/logical_plan/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/src/logical_plan/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/src/logical_plan/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Fix format.

Co-authored-by: Andrew Lamb <[email protected]>
* Fix single_distinct_to_groupby for arbitrary expressions

* Fix fmt

Co-authored-by: James Katz <[email protected]>
…mplifier` (apache#1401)

* Combine simplify and Simplifier

* Make nullable more functional
* Initial implementation of variance

* get simple f64 type tests working

* add math functions to ScalarValue, some tests

* add to expressions and tests

* add more tests

* add test for ScalarValue add

* add tests for scalar arithmetic

* add test, finish variance

* fix warnings

* add more sql tests

* add stddev and tests

* add the hooks and expression

* add more tests

* fix lint and clipy

* address comments and fix test errors

* address comments

* add population and sample for variance and stddev

* address more comments

* fmt

* add test for less than 2 values

* fix inconsistency in the merge logic

* fix lint and clipy
@Igosuki
Copy link
Author

Igosuki commented Jan 12, 2022

I haven't made all tests pass so this is a draft

@Igosuki
Copy link
Author

Igosuki commented Jan 12, 2022

Got this nicety when running miri on nightly :

test tests::ballista_round_trip::q1 ... error: unsupported operation: can't call foreign function: epoll_create1
    --> /home/geps/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.7.14/src/sys/unix/selector/epoll.rs:34:9
     |
34   |         syscall!(epoll_create1(flag)).map(|ep| Selector {
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't call foreign function: epoll_create1

cf rust-lang/miri#602

@houqp houqp requested a review from yjshen January 12, 2022 19:00
@houqp
Copy link
Owner

houqp commented Jan 12, 2022

Thank you @Igosuki ! I will take a close look at it tonight and merge it. Then 3 of us can collaborate on fixing the failing tests through follow up PRs.

simd integration is not available on stable, this requires a PR fixed on arrow2

I believe SIMD requiring nightly is expected?

@Igosuki
Copy link
Author

Igosuki commented Jan 12, 2022

@houqp Yeah I mentioned it on another PR and got the answer, I didn't think it was required

@houqp houqp merged commit 257a7c5 into houqp:arrow2_merge Jan 13, 2022
@houqp
Copy link
Owner

houqp commented Jan 13, 2022

@Igosuki I have pushed your changes into my arrow2_merge branch as is. Let's collaborate on this branch to fix the remaining tests. cc @yjshen

@jorgecarleitao
Copy link

jorgecarleitao commented Jan 13, 2022

Thanks a lot for this!

pass None to the IPC fields to cause arrow2 to derive defaults for them. They are only relevant for dictionary arrays. This is in preparation to automatically re-use dictionary arrays.

The pyo3 was not added because pyo3 on arrow2 causes all dependencies to require a specific version of it (polars, datafusion-python)

@Igosuki
Copy link
Author

Igosuki commented Jan 13, 2022

@jorgecarleitao thanks for clarifying, I hope https://github.com/houqp/arrow-datafusion/blob/arrow2_merge/datafusion/src/pyarrow.rs is a good enough replacement, I used the code in your integration tests.

@houqp
Copy link
Owner

houqp commented Jan 13, 2022

@Igosuki @yjshen I fixed all the decimal tests, so we are now down to 8 test failures, mostly file format related.

@houqp
Copy link
Owner

houqp commented Jan 14, 2022

@jorgecarleitao i noticed the IPC stream writer's write API is not consistent with what we have for the file writer: https://github.com/jorgecarleitao/arrow2/blob/a886e35748b2813e6c5a122cf9603a3e59f9f80e/src/io/ipc/write/stream.rs#L58. Is this expected? It takes a slice instead of option for the ipc field argument.

@jorgecarleitao
Copy link

@jorgecarleitao i noticed the IPC stream writer's write API is not consistent with what we have for the file writer: https://github.com/jorgecarleitao/arrow2/blob/a886e35748b2813e6c5a122cf9603a3e59f9f80e/src/io/ipc/write/stream.rs#L58. Is this expected? It takes a slice instead of option for the ipc field argument.

It is an API bug. It should be Option<&[]> like the other. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.