Skip to content

Commit

Permalink
[DYOD18/19] Add LZ4 Point Access (also fixes hyrise#1522 and fixes hy…
Browse files Browse the repository at this point in the history
…rise#1516) (hyrise#1521)

* fix test

* handle empty strings in decompression

* format

* add single char test case

* fix linter

* fix indices and add more expects

* fix offset

* fix test

* move third party includes

* Add comment

* remove std::move without effect

* Remove const references and use std::move

* add advance and distance to to sequential iterator

* string lz4segment point access

* inlining

* universal reference

* make offsets optional

* maybe fix test

* add second constructor

* fix copy

* fix remaining tests

* format

* remove const

* use pmr_string instead of std::string

* english

* format

* merge lz4 encoder

* merge lz4 segment

* merge lz4 iterators

* comment out string code

* Remove malicious semicolon

* fix compile errors

* make constant constexpression

* remove std::string

* rename offset function in tests

* add point access string decompression

* add string segment decompression

* debug

* typo

* possible fix

* debug

* handle string decompression edge case

* debug

* fix empty last block

* remove debug output

* fix indent lint errors and make debugassert an assert

* format

* add segment docstrings

* add string segment test for all segments

* more comments

* fix row count calls

* fix test case class

* fix class name

* fix uint

* fix

* add debug

* debug

* fix multi block string

* more debug

* more debug

* fix block index access

* remove debug

* more debug

* try different decompress method

* try larger input block size

* assert for max block size

* Fix string decode error

* remove old method call

* fix if clause

* add empty non string segment test

* remove code in empty loop

* Add extra test case

* multi block tests

* fix index in test

* proper string dictionary learning

* debug

* debug

* debug

* debug

* dbeug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* skip empty int segment test

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* try to increase small dictionariess otherwise nullopt

* debug

* debug

* debug

* a little refactoring

* dictionary decompression method

* better lz4 use

* add dictionary abort for small value vectors

* single char test

* return empty dict on error

* finish single char test

* remove duplicate test case

* debug

* introduce second type param to decompress method

* better zero one test

* Revert "introduce second type param to decompress method"

This reverts commit 338e19c.

* docstring

* refactor dictionary gen

* generate dict with more data

* docstring

* Revert "debug"

This reverts commit 1dfb74d.

* Revert "debug"

This reverts commit b38ece9.

* Revert "debug"

This reverts commit 0150fe7.

* Revert "try to increase small dictionariess otherwise nullopt"

This reverts commit 069b5cf.

* fix hyrise#1516: null vector size in value segment

* lz4 estimate memory usage

* refactor dictionary generation and docstring

* move compress

* calculate metadata

* remove const

* remove unused variables

* refactor and remove duplicate code

* point access docstring

* format

* linter

* fix docstring

* more docstring

* Skip only failing test instances

* Fix LZ4 and RunLength encoding for empty Segments

* fullci

* add simple caching

* fix ternary operator

* docstring

* add nolint for std::pair unzip in variable assigment

* remove random data from string segment test

* better commtent for test skipping

* caching with char vector

* fix method signature

* string caching

* wrap caching method for simple string decompression

* move wrapper methods below caching implementations

* more code deduplication

* format

* generate -> train

* typos & ternary operator to std::max

* more typo fixing

* refactor & typos

* refactor encoding emtpy segment test

* re-add empty loop for empty segment test

* remove duplicate empty segment test in encoded string segment test

* don't store offsets in empty string segment

* fix typo

* fix typo

* fix simdbp128 on empty segments

* comment

* size_t initialization

* remove test skipping

* Use constant for number of bits in a byte

* remove this in tests

* remove dictionary padding

* refactor lz4 iterable

* change size_t construction

* fix typo and implicit bool

* change pair constructor

* update dependencies.md

* rename previous block to cached block

* more size_t construction

* remove random null values

* fix shrinking comment

* remove repeated comment in constructor

* fix name shadowing

* improve dictionary training comment

* add general comment explaining zstd dictionary to encoder

* add comment explaining string use_caching variable

* add comment for skipping of dictionary

* make bool usage explicit

* use lz4segment::size instead of null_values.size

* use simdbp128 vector compression for string offsets

* format

* rename vector_decompressor to offset_decompressor

* clarified decompression comment

* add comment explaining block size

* only store null values vector when there are null values

* Use proper vector compression interface

* refactor lz4 test

* comments and extra test case

* refactor optional access to use value()

* format

* reset lz4 stream decoder (fix pointer overflow)

* fix another comment

* fix

* add const

* NULL to nullptr

* change debugassert to assert

* null values in iterator

* optional code style

* format

* more code style

* multi block string test

* more code style
  • Loading branch information
janehmueller authored and mrks committed Mar 29, 2019
1 parent 6ef2a54 commit d6f5343
Show file tree
Hide file tree
Showing 19 changed files with 1,428 additions and 223 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@
[submodule "third_party/join-order-benchmark"]
path = third_party/join-order-benchmark
url = https://github.com/gregrahn/join-order-benchmark.git
[submodule "third_party/zstd"]
path = third_party/zstd
url = https://github.com/facebook/zstd.git
[submodule "third_party/jemalloc"]
path = third_party/jemalloc
url = https://github.com/jemalloc/jemalloc.git
2 changes: 2 additions & 0 deletions DEPENDENCIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
- cxxopts (https://github.com/jarro2783/cxxopts.git)
- googletest (https://github.com/google/googletest)
- libpqxx (https://github.com/jtv/libpqxx)
- lz4 (https://github.com/lz4/lz4)
- sql-parser (https://github.com/hyrise/sql-parser)
- pgasus (https://github.com/kateyy/pgasus)
- cpp-btree (https://github.com/algorithm-ninja/cpp-btree)
- cqf (https://github.com/ArneMayer/cqf)
- jemalloc (https://github.com/jemalloc/jemalloc)
- zstd (https://github.com/facebook/zstd)
1 change: 1 addition & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ include_directories(
${PROJECT_SOURCE_DIR}/third_party/flat_hash_map
${PROJECT_SOURCE_DIR}/third_party/json
${PROJECT_SOURCE_DIR}/third_party/lz4
${PROJECT_SOURCE_DIR}/third_party/zstd
)

if (${ENABLE_JIT_SUPPORT})
Expand Down
1 change: 1 addition & 0 deletions src/lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,7 @@ set(
cqf
uninitialized_vector
lz4
zstd
custom_jemalloc
${FILESYSTEM_LIBRARY}
${Boost_CONTAINER_LIBRARY}
Expand Down
361 changes: 305 additions & 56 deletions src/lib/storage/lz4/lz4_encoder.hpp

Large diffs are not rendered by default.

40 changes: 27 additions & 13 deletions src/lib/storage/lz4/lz4_iterable.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,40 @@ class LZ4Iterable : public PointAccessibleSegmentIterable<LZ4Iterable<T>> {

auto decompressed_segment = _segment.decompress();

auto begin = Iterator<ValueIterator>{decompressed_segment.cbegin(), _segment.null_values().cbegin()};
auto end = Iterator<ValueIterator>{decompressed_segment.cend(), _segment.null_values().cend()};
/**
* If the null value vector doesn't exist, then the segment does not have any row value that is null. In that case,
* we can just use a default initialized boolean vector.
*/
const auto null_values = _segment.null_values() ? *_segment.null_values() : pmr_vector<bool>(_segment.size());

auto begin = Iterator<ValueIterator>{decompressed_segment.cbegin(), null_values.cbegin()};
auto end = Iterator<ValueIterator>{decompressed_segment.cend(), null_values.cend()};

functor(begin, end);
}

/**
* For now this point access iterator decompresses the whole segment.
* For the point access, we first retrieve the values for all chunk offsets in the position list and then save
* the decompressed values in a vector. The first value in that vector (index 0) is the value for the chunk offset
* at index 0 in the position list.
*/
template <typename Functor>
void _on_with_iterators(const std::shared_ptr<const PosList>& position_filter, const Functor& functor) const {
using ValueIterator = typename std::vector<T>::const_iterator;

const auto decompressed_segment = std::make_shared<std::vector<T>>(_segment.decompress());
auto decompressed_filtered_segment = std::vector<ValueType>(position_filter->size());
auto cached_block = std::vector<char>{};
auto cached_block_index = std::optional<size_t>{};
for (auto index = size_t{0u}; index < position_filter->size(); ++index) {
const auto& position = (*position_filter)[index];
auto [value, block_index] = _segment.decompress(position.chunk_offset, cached_block_index, cached_block); // NOLINT
decompressed_filtered_segment[index] = std::move(value);
cached_block_index = block_index;
}

auto begin = PointAccessIterator<ValueIterator>{decompressed_segment, &_segment.null_values(),
auto begin = PointAccessIterator<ValueIterator>{decompressed_filtered_segment, &_segment.null_values(),
position_filter->cbegin(), position_filter->cbegin()};
auto end = PointAccessIterator<ValueIterator>{decompressed_segment, &_segment.null_values(),
auto end = PointAccessIterator<ValueIterator>{decompressed_filtered_segment, &_segment.null_values(),
position_filter->cbegin(), position_filter->cend()};

functor(begin, end);
Expand Down Expand Up @@ -113,7 +129,7 @@ class LZ4Iterable : public PointAccessibleSegmentIterable<LZ4Iterable<T>> {
using IterableType = LZ4Iterable<T>;

// Begin Iterator
PointAccessIterator(const std::shared_ptr<std::vector<T>>& data, const pmr_vector<bool>* null_values,
PointAccessIterator(const std::vector<T>& data, const std::optional<pmr_vector<bool>>* null_values,
const PosList::const_iterator position_filter_begin, PosList::const_iterator position_filter_it)
: BasePointAccessSegmentIterator<PointAccessIterator<ValueIterator>,
SegmentPosition<T>>{std::move(position_filter_begin),
Expand All @@ -126,16 +142,14 @@ class LZ4Iterable : public PointAccessibleSegmentIterable<LZ4Iterable<T>> {

SegmentPosition<T> dereference() const {
const auto& chunk_offsets = this->chunk_offsets();
const auto value = (*_data)[chunk_offsets.offset_in_referenced_chunk];
const auto is_null = (*_null_values)[chunk_offsets.offset_in_referenced_chunk];
const auto value = _data[chunk_offsets.offset_in_poslist];
const auto is_null = *_null_values && (**_null_values)[chunk_offsets.offset_in_referenced_chunk];
return SegmentPosition<T>{value, is_null, chunk_offsets.offset_in_poslist};
}

private:
// LZ4 PointAccessIterators share the materialized segment
std::shared_ptr<std::vector<T>> _data;

const pmr_vector<bool>* _null_values;
const std::vector<T> _data;
const std::optional<pmr_vector<bool>>* _null_values;
};
};

Expand Down
Loading

0 comments on commit d6f5343

Please sign in to comment.