Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move CompressionZstrong.cpp into fb directory #36

Closed
wants to merge 105 commits into from

Conversation

pedroerp
Copy link
Contributor

Summary:
Moving it to a fb/ directory prevents shipIt from exposing it to open
source.

Differential Revision: D56523416

facebook-github-bot and others added 30 commits December 14, 2023 11:56
fbshipit-source-id: 7d9c3fbad922ea80c21c9f2982049c23f5f1a470
Summary: Pull Request resolved: facebookincubator/nimble#1

Reviewed By: xiaoxmeng

Differential Revision: D52203900

fbshipit-source-id: 5263b9754f80d4f06aa68c15a8a5b182729aef17
…oken

Summary:
We have found in the S385274 that there is some con-currency issue when multiple thread try to access or update the Storage access token (BAT/SAT). This change currently introduce a lock mechanism to address this issue.

Next steps:
We have also seen that currently we call WS token service multiple time and override the token instead we can add a way to ensure that token is only fetched once. But this might require some changes in the way we use koski. As it might add an additional step to resolve the token once.

Reviewed By: Magoja

Differential Revision: D52144115

fbshipit-source-id: e49331ab51d8544969a613f8beb7405e42da0e79
Summary:
Pull Request resolved: facebookincubator/nimble#2

Split the process wide memory manager creation and get.
Deprecate the default memory manager and memory pool APIs.

X-link: facebookincubator/velox#7168

Reviewed By: mbasmanova

Differential Revision: D50513171

Pulled By: xiaoxmeng

fbshipit-source-id: 0f147c1e3c971ea4f39271fe5609a2224971509d
Summary: With this change, it is now possible to use captured encoding layout file when using Alpha transform.

Reviewed By: Sullivan-Patrick

Differential Revision: D50957360

fbshipit-source-id: c4e82a23bad81231e426ef4a026ead318eb648d0
Summary:
With this change, we now use lazy stats instead of eager stats.
This affects both "ad-hoc" encoding selection, and "replay" encoding selection.

I measured the performance of this change for "ad-hoc" encoding selection, and it seems negligible.

Transforming a file with eager stats took 1m28s on average and the same was observed for lazy.

Performance of captured encodings transform shows a big perf improvement, at 1m08s on average (~20% improvement).

Reviewed By: Sullivan-Patrick

Differential Revision: D50943760

fbshipit-source-id: d602480d2df125724dfef9283146e650f2d00da8
Summary:
Adding the ability to dump a captured encoding layout file to Alpha Dump.

Output looks as follows:
```
Node Id      Parent Id    Node Type        Node Name          Encoding Layout
0            0            Row                                 Trivial{Uncompressed}
1            0            Scalar                              Trivial{Zstrong}
2            0            Scalar                              RLE{Uncompressed}[RunLengths:FixedBitWidth{Zstrong},RunValues:Trivial{Zstrong}]
3            0            Scalar                              Trivial{Zstrong}[Lengths:Constant{Uncompressed}]
4            0            Scalar                              Trivial{Zstrong}[Lengths:Constant{Uncompressed}]
5            0            Scalar                              RLE{Uncompressed}[RunLengths:Varint{Uncompressed},RunValues:FixedBitWidth{Uncompressed}]
6            0            Scalar                              FixedBitWidth{Zstrong}
7            0            Scalar                              Constant{Uncompressed}
8            0            Scalar                              RLE{Uncompressed}[RunLengths:FixedBitWidth{Zstrong},RunValues:Trivial{Zstrong}[Lengths:FixedBitWidth{Zstrong}]]
9            0            Scalar                              FixedBitWidth{Zstrong}
10           0            Map                                 FixedBitWidth{Zstrong}
11           10           Scalar                              Trivial{Zstrong}
12           10           Scalar                              Varint{Uncompressed}
13           0            FlatMap                             Constant{Uncompressed}
14           13           Scalar           1154997419         Trivial{Zstrong}
15           13           Scalar           1451287068         Trivial{Zstrong}
16           13           Scalar           1135016111         Trivial{Zstrong}
17           13           Scalar           1354897182         RLE{Uncompressed}[RunLengths:FixedBitWidth{Zstrong},RunValues:Dictionary{Uncompressed}[Alphabet:Trivial{Uncompressed},Indices:FixedBitWidth{Zstrong}]]
18           13           Scalar           1138012430         Trivial{Zstrong}
19           13           Scalar           2072355203         Trivial{Zstrong}
20           13           Scalar           1467446921         Trivial{Zstrong}
21           13           Scalar           1097049326         Trivial{Zstrong}
22           13           Scalar           1227099726         Trivial{Zstrong}
23           13           Scalar           1944408639         RLE{Uncompressed}[RunLengths:FixedBitWidth{Zstrong},RunValues:Trivial{Zstrong}]
24           13           Scalar           110159             Trivial{Zstrong}
25           13           Scalar           1410611079         RLE{Uncompressed}[RunLengths:FixedBitWidth{Zstrong},RunValues:Trivial{Zstrong}]
...
```

Reviewed By: Sullivan-Patrick

Differential Revision: D50957981

fbshipit-source-id: 2f4cf4b44acb24c064612dcec38cf1ba793b2a09
Summary:
Allowing training of streams in parallel.
Reduced the time of training from minutes to seconds (with current simple training).

With this, we can now implement more demanding trainings (like brute forcing).

Reviewed By: HuamengJiang

Differential Revision: D51076951

fbshipit-source-id: 96acdc932116b73b7960db72790ec8fad44c7f18
The internal and external repositories are out of sync. This Pull Request attempts to brings them back in sync by patching the GitHub repository. Please carefully review this patch. You must disable ShipIt for your project in order to merge this pull request. DO NOT IMPORT this pull request. Instead, merge it directly on GitHub using the MERGE BUTTON. Re-enable ShipIt after merging.
Summary: Pull Request resolved: facebookincubator/nimble#4

Reviewed By: HuamengJiang

Differential Revision: D52280685

Pulled By: pedroerp

fbshipit-source-id: 46e94f7669b8d629861693985e437b64a8d4ed24
Summary:
DWIO contbuild is broken, because the DPP Reader tool (in the Alpha tools folder) can no longer be built in opt mode...

Until we have a fix for that, I'm moving this tool outside of the dwio folder.

I'm using this opportunity to move few other tools as well. These tools have "external" (heavy) dependencies, like Koski, which make building the alpha directory slower.

Reviewed By: HuamengJiang

Differential Revision: D52304448

fbshipit-source-id: 85a4c86a2e36dff49b8f21a3eace47ee7d34c129
Summary:
Minor cleanups in Alpha:
1. Running glean to remove unused headers
3. Fixing lint warnings

Reviewed By: HuamengJiang

Differential Revision: D52304447

fbshipit-source-id: 34c130688453386f560b1d291eed1dc5cc826c3c
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: dmm-fb

Differential Revision: D52356232

fbshipit-source-id: c47aba63a8b4d2a728de7a03e91f542cfcb4182d
Reviewed By: fadimounir

Differential Revision: D52322091

fbshipit-source-id: db087755928bed065b9f336f255c3c5aa44ae64e
Summary: Unify Alpha and DWRF feature flattening utils.

Reviewed By: marxhxxx

Differential Revision: D52425910

fbshipit-source-id: 2b0e4bd0ebbb7caf5ad8290e7dc9b8b1a7b24c37
Summary: Add fbpkg definition such that we can use data copy in chronos jobs.

Reviewed By: marxhxxx

Differential Revision: D52509300

fbshipit-source-id: a79c555a432681f2782850307b3aa1d7cfa9a2c4
Summary: As title, adding a cache here that keeps track of the cache value and size.

Reviewed By: kparichay, HuamengJiang

Differential Revision: D50424176

fbshipit-source-id: 7645ebb67d00ec2ec36bdc7b600bedebd9e34292
Summary:
Adding unit tests for caching behavior.

Tests two write cases - writing one vector at once vs. writing two vectors, ensures that the two readers provide the same output result w.r.t. correctness + compression.

Reviewed By: HuamengJiang

Differential Revision: D51322567

fbshipit-source-id: ff75dfbb70e35dd9a9522604c2f9b22e3feafaa6
Summary:
These variable names are needlessy difficult to understand imo

(`offsets` and `offsets_`, `childRanges` intertwined with `filteredRanges`, `casted` for a casted `arrayVector`)

Changing them to better represent what they actually are

Reviewed By: kparichay

Differential Revision: D50847850

fbshipit-source-id: f00fe5a6a905776ed7c94ad031f6f7e2f00d9871
Summary:
algorithm header is where std::reverse is defined (which is used below in
this file)

Reviewed By: tanjialiang

Differential Revision: D52709839

fbshipit-source-id: 274cab8085f17db32fd93162e4707c09d13215ff
Summary:
Adding a first sketch for an open source cmake-based build system for Alpha. For now it only compiles alpha/common, and requires a few internal dependencies to be commented out first.
In the next PRs I'll clean them up.

Pull Request resolved: facebookincubator/nimble#5

Reviewed By: zzhao0

Differential Revision: D52717233

Pulled By: pedroerp

fbshipit-source-id: a3c5dcef235dbcf6107002e24e3960457e00175a
Summary:
Remove usages of checked_memcpy() to remove dependency on internal
library.

Reviewed By: zzhao0

Differential Revision: D52741439

fbshipit-source-id: 951cb24a07d8fa66a3b113218c07b694af2a75dc
Summary:
Three fixes:
* Setting default dependency source as AUTO (tries to find locally, bundles if not found).
* Adds flag to ignore existing folly compiler warning.
* Removes StopWatch.cpp and its tests from CMake, since it relies on an internal library and is not used anywhere in Alpha today.

With this change, CMake builds successfully (though they only compile alpha/common today).

Pull Request resolved: facebookincubator/nimble#6

Reviewed By: zzhao0

Differential Revision: D52808613

Pulled By: pedroerp

fbshipit-source-id: 8c61b6eceba1360778d539477d67c721295e08dd
Summary:
Remove unused include and adding an explicit include and dependency
since the file uses a symbol from velox/common/file (InMemoryWriteFile).

Reviewed By: HuamengJiang

Differential Revision: D52823742

fbshipit-source-id: ac218a444d9128f3d6f182ae84cc12a866a135d0
Summary:
Adding a compile-time dependency on flatbuffers for now, but this can be removed in the future if we decide to check in the generated files.

Adding third-party cmake toolchain for flatbuffer support.

Pull Request resolved: facebookincubator/nimble#7

Reviewed By: zzhao0

Differential Revision: D52823929

Pulled By: pedroerp

fbshipit-source-id: 2589468a26443e26d0c2812178c65cb5240bc44c
Summary:
The order specified in designator intialization should follow the
declaration order. Some compilers refuse to compile otherwise.

Reviewed By: HuamengJiang

Differential Revision: D52823278

fbshipit-source-id: 5feb51b4b3901dacfe6cfc2d75bbcd94f8527b07
Summary: Remove unused thrift files

Reviewed By: helfman

Differential Revision: D52989670

fbshipit-source-id: ce4cf72d4663a9a1b87978ce4672c9a1344f7cbb
…cy.h

Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: bunnypak, dmm-fb

Differential Revision: D53011692

fbshipit-source-id: 861ca264edc6adc812c595b3a11607542a0826eb
Summary:
Pull Request resolved: facebookincubator/nimble#8

Move compression/uncompression code to a different file to be able to
selectively add them based on compilation flags. This is required to skip
zstrong in external builds.

Reviewed By: zzhao0

Differential Revision: D52897034

fbshipit-source-id: 3d9180890d957c9f66f07acaa478e2a2465e75a1
Summary: Add missing overrides

Reviewed By: zzhao0

Differential Revision: D53030470

fbshipit-source-id: 8c6291a5630b648ae002a1716fc2fabe62d23f3e
helfman and others added 27 commits April 1, 2024 13:23
Summary:
Future changes to FieldWriter (removing chunking from FieldWriter) will change the way parallelism is applied during writes.

To simplify following diffs, we revert the current implementation, and we will reimplement it after the field writer is updated.

Reviewed By: HuamengJiang

Differential Revision: D54076455

fbshipit-source-id: cb8d62e5d5585c1690af837493777c317f4ad7a8
Summary:
Until now, we had one "main" stream per schema node (*).
The (*) means that we actually needed 2 main streams for ArrayWithOffset, but we worked around it, by modeling the second stream as a child schema node (which is incorrect).

With this change, we now allow having multiple streams within each schema node.
So schema nodes now truly represent the schema tree, and each node type, describes its own stream, without affecting the schema tree.

The assumption of one main stream was used throughout the code, so I had to find all this locations and fix them.
Hopefully, I found them all (all tests pass...).

I'm splitting this change to two. One for writer and one for reader.
I'm doing so trying to make these changes smaller (although, it is still big).
Because of this split, I also separated the TypeBuilder from Type (e.g. it no longer inherits from Type).
I might bring it back at some point, although, currently, I think I like it...

NOTE: As part of this change, the layout of an encoding tree (e.g. encoding db) has changed. I didn't implement a backward compatible deserializer, as we still don't use this thing in production. Once we land this, we should regenerate captured encodings, if we have any stored. (Code should not fail if an old version is stored. It will just ignore it.)

Reviewed By: HuamengJiang

Differential Revision: D54076457

fbshipit-source-id: cf29a372a6771b91e5a23022648751d2b6dc2ae8
Summary:
Second part of the change to support multiple streams per schema node.
See description of parent diff.

Reviewed By: HuamengJiang

Differential Revision: D54076456

fbshipit-source-id: 28692a2d2c42b76d996a47ab463e08aed93d7a1a
Summary:
Future diff will introduce the need to be able to materialize empty encodings.
With current use of the C memcpy, these cases fail in opt builds, as we call memcpy with nullptr and count of zero.

We can add if statements all over to prevent calling memcpy on empty encodings, but instead of adding more branching, I opted for removing branching, by using the std::memcpy, which handles the zero size memcpy correctly.

I also used this opportunity to change some places to use std::copy, as in some cases it might have some vectorized improvements internally.

Reviewed By: HuamengJiang

Differential Revision: D55383916

fbshipit-source-id: 3ad21688dbf1f609d57963f66bf23c736a6f4ba2
Summary:
Currently, stream chunking is implemented in the FieldWriter.

However, field writer shouldn't know anything about how data is layed out in the file.
This is the responsibility of the VeloxWriter layer.

This change creates a clear separation between how FieldWriter is accumulating data and how VeloxWriter is later encoding/chunking the data.

To do so, we introduce the concept of StreamData. FieldWriter is accumulating raw data in StreamData objects, and VeloxWriter is using these objects to decide when and how to flush.

As part of this change, I had to also fix chunking for string types. Previously it was easy to disable string chunking, as FieldWriter had the context to identify string sata easily. With the new implementation, it is not that easy, so I opted for fixing string chunking.

With this change I also introduced min size (configurable) for chunking, which prevents creating chunks that are too small.

Reviewed By: HuamengJiang

Differential Revision: D54879575

fbshipit-source-id: b8fa6bca229cb13b886cfa925eb5bbddecb600a8
Summary:
Enabling optional parallel encoding during writes.

NOTE: Currently, we added a mutex to the Buffer class, which will affect all callsites. In a future diff, we will change encodings to accept two buffers, one for internal temporary allocations, and another for the final encoded result. The internal buffer will not be thread safe, whereas the exteranl one will be, thus reducing the mutex contention.

Reviewed By: HuamengJiang

Differential Revision: D55383917

fbshipit-source-id: 8de10acf5e13118fb44907ae206aedd8950de4e1
…nfusion

Summary:
X-link: facebookincubator/velox#9244

A `TypeWithId` must be a proper tree without diamonds because each node references its parent.  The constructor is reparenting the children, so taking shared ownership on children is misleading, especially when they are `const` qualified.

With this change, the constructor will take `unique_ptr` of children so they will not be shared between multiple parents.  Once the tree is constructed, the individual nodes is immutable and can be shared, so they are stored as `shared_ptr<const TypeWithId>` internally.

bypass-github-export-checks

Reviewed By: oerling

Differential Revision: D55318609

fbshipit-source-id: 12098e76110e59d364fe7eaf8448cae46103b7ed
Summary:
Original commit changeset: 8de10acf5e13

Original Phabricator Diff: D55383917

Reviewed By: prasoontelang

Differential Revision: D55647585

fbshipit-source-id: 2eda76bf58d9795c8f77579eeb48257939f1431c
Summary:
Original commit changeset: b8fa6bca229c

Original Phabricator Diff: D54879575

Reviewed By: prasoontelang

Differential Revision: D55648607

fbshipit-source-id: 1f4d657a1c5e12871441760ff62ce0af0845019d
Summary:
Original commit changeset: 3ad21688dbf1

Original Phabricator Diff: D55383916

Reviewed By: prasoontelang

Differential Revision: D55648608

fbshipit-source-id: 596133c36ac7c1c313970da322d635f6d2b13d57
Summary:
MemoryLeakCheck checks for total memory used which can vary based on seed (coming from fuzzer and repeated copies)
This diff relaxes the memory used check

Created from CodeHub with https://fburl.com/edit-in-codehub

Reviewed By: DanielMunozT

Differential Revision: D55871379

fbshipit-source-id: 5cd9b783a6a753d5552244dd9f67c2bad2928717
Summary: Previously this code conformed from clang-format 12.

Reviewed By: igorsugak

Differential Revision: D56065247

fbshipit-source-id: f5a985dd8f8b84f2f9e1818b3719b43c5a1b05b3
Summary:
Redoing diff D55383916 which was backed out at D55648608
With fix for the issue introduced (incorrect handling of multi-buffer IOBufs).

Reviewed By: HuamengJiang

Differential Revision: D55822996

fbshipit-source-id: ea1910149688ac98482186d8db58ee84ecd8900f
Summary:
This was originally introduced in D54879575 and reverted in D55648607 (because a base diff D55648608 was failing)

Reintorducing this diff, rebased on the fixed version.

Reviewed By: HuamengJiang

Differential Revision: D55822998

fbshipit-source-id: e034b53abba40f4ca750d48303c41e70ac4cffa7
Summary:
This was originally introduced in D55383917 and reverted in D55647585 (because a base diff D55648608 was failing)

Reintorducing this diff, rebased on the fixed version.

Reviewed By: HuamengJiang

Differential Revision: D55822997

fbshipit-source-id: 7d7fc6e2ccf5a4333a719354a52098fd2dfa1815
Summary: Recent change broke memory accounting for strings, which led to writing huge stripes if data had huge string columns.

Reviewed By: zzhao0, HuamengJiang

Differential Revision: D56199608

fbshipit-source-id: b95cbefd97419dd9b21bae8dc73c5f9e7b049ceb
Summary: Explicit test for verifying correct memory accounting of string streams.

Reviewed By: zzhao0

Differential Revision: D56204244

fbshipit-source-id: 316b8e6905852497a58481a018c15fbad5e3d0fb
Summary:
TSIA.

I explicitly not rename the following:
* XldbAlphaLogger
* Input/Output format strings and Serde
* Serde config keys
* JNI/Jave paths

Reviewed By: sdruzkin, pedroerp

Differential Revision: D56225190

fbshipit-source-id: 2da7e826fbbb50f3a7c212d20ac92d96cfb05b6a
The internal and external repositories are out of sync. This Pull Request attempts to brings them back in sync by patching the GitHub repository. Please carefully review this patch. You must disable ShipIt for your project in order to merge this pull request. DO NOT IMPORT this pull request. Instead, merge it directly on GitHub using the MERGE BUTTON. Re-enable ShipIt after merging.

Co-authored-by: Facebook Community Bot <[email protected]>
Summary:
Our reader lifecycle tests verify that internal velox vectors and buffers are being reused as much as possible (as long as the caller is not holding a reference to these buffers).

In these tests we verify that the pointer before and after calling next() is the same.
However, for null buffers, they might still change, depending if the data contains nulls, vs not containing any nulls (in this case, we free the nulls vector).

Since we randomly fuzz input buffers, we sometimes get parts of the vector containing no nulls and sometimes containing nulls, leading to a nulls buffer release/allocation, and this is failing the test.

I changed the tests to only verify null pointers, if both sides are not null.

Reviewed By: sdruzkin

Differential Revision: D56361752

fbshipit-source-id: f6bc6b473c289bdd6ad88cace5c221fa6bf7886d
Summary: Pull Request resolved: facebookincubator/nimble#29

Reviewed By: zzhao0

Differential Revision: D56377002

Pulled By: pedroerp

fbshipit-source-id: c4eb5ff7a382f81bbef63e9407e82f6544ecd67d
Summary: Pull Request resolved: facebookincubator/nimble#31

Reviewed By: zzhao0

Differential Revision: D56381223

Pulled By: pedroerp

fbshipit-source-id: 3f9c90d47e939324e907c7aec480d54fb56376fb
Summary: A few small fixes two unblock open source compilation.

Reviewed By: zzhao0

Differential Revision: D56381504

fbshipit-source-id: 99ca8dafb38b61c1a1aa7d0a073cc4ee4ae75d4d
Summary:
Adding missing files, fixing last compilation nit exposed by clang 16, and disabling compilation of test which currently brings too many velox dependencies.

Pull Request resolved: facebookincubator/nimble#32

Reviewed By: helfman

Differential Revision: D56453875

Pulled By: pedroerp

fbshipit-source-id: 88cf902a897cf35cb0c48f589e54c08328c8e799
Summary: Pull Request resolved: facebookincubator/nimble#34

Reviewed By: zzhao0

Differential Revision: D56494415

Pulled By: pedroerp

fbshipit-source-id: d8a6daa5649493df62c54f359b285e83c27fda6f
…nstead

Summary:
X-link: facebookincubator/velox#9566

As said

Reviewed By: kgpai, Magoja

Differential Revision: D56430332

fbshipit-source-id: 6c6698bf82c9075930630974f48d911a5b62ea8a
Summary:
Moving it to a fb/ directory prevents shipIt from exposing it to open
source.

Differential Revision: D56523416
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants