Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner Data Generation #106

Merged
merged 48 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
3afec8a
generate IRC data on a bind mount. Hopefully not to mix it with expec…
Tmonster Feb 11, 2025
52c53a7
add fixed data that cannot be regenerated
Tmonster Feb 11, 2025
cfdf57f
slowly understanding how to generate test data using different catalogs
Tmonster Feb 11, 2025
96752ae
Merge remote-tracking branch 'upstream/main' into robust-data-generation
Tmonster Feb 17, 2025
e4b7983
deleting everything in data:
Tmonster Feb 17, 2025
a11dd37
update makefile
Tmonster Feb 17, 2025
74a9a99
renaming many folders. Have an idea for a directory structure. need t…
Tmonster Feb 17, 2025
0f97b90
more removing/moving around files
Tmonster Feb 18, 2025
890e282
more moving around
Tmonster Feb 18, 2025
778391d
now we can generate both rest and local/hadoop iceberg files
Tmonster Feb 18, 2025
362943e
added more files. You can now easily generate tables by just adding s…
Tmonster Feb 18, 2025
638269a
even more changes and differences, but now everything shows up under …
Tmonster Feb 18, 2025
79c8d83
modify a number of test files
Tmonster Feb 18, 2025
e3425da
use data_generated to preserve parity with data_fixed
Tmonster Feb 18, 2025
e21488e
ignore pycache
Tmonster Feb 18, 2025
ede8aa7
fixing data generation again, making sure tests pass
Tmonster Feb 18, 2025
999802a
Rest workflow now generates data and tests on it
Tmonster Feb 18, 2025
c718e32
remove uneeded pycache
Tmonster Feb 18, 2025
29d46f8
really get rid of pycache
Tmonster Feb 18, 2025
f838fee
ignore temporary lineitem files as well
Tmonster Feb 18, 2025
c1d7144
remove unused provision files
Tmonster Feb 18, 2025
288dbed
tmp data can live in one place
Tmonster Feb 18, 2025
4569bd3
configure CI addition and rest.yml fix
Tmonster Feb 19, 2025
c922e61
create paths for docker compose files
Tmonster Feb 19, 2025
707c0d5
add tmate
Tmonster Feb 19, 2025
6a11f50
make sure python packages are installed
Tmonster Feb 19, 2025
5231231
first install requirements, then make the data
Tmonster Feb 19, 2025
349b9b6
start rest client before makeing the data
Tmonster Feb 19, 2025
3c58707
debug statements "
Tmonster Feb 19, 2025
733f515
install docker compose when running configure ci
Tmonster Feb 19, 2025
a5586ba
docker compose should be executatble
Tmonster Feb 19, 2025
a6bc9b0
attempt to download correct docker-compose
Tmonster Feb 19, 2025
9c55b11
docker compose not docker compose
Tmonster Feb 19, 2025
468edb3
remove steps in configure ci
Tmonster Feb 19, 2025
aeda19b
attempting to debug this docker compose isse
Tmonster Feb 19, 2025
e4e58e9
needd to create the spark-rest subdirectory. Unsure how I feel about …
Tmonster Feb 19, 2025
20febb3
first generate rest data, then local data
Tmonster Feb 19, 2025
6df0132
install requirements in start-rest-catalog.sh
Tmonster Feb 19, 2025
6c08eb8
data/generated and data/persistent
Tmonster Feb 19, 2025
c074605
modify test use different paths
Tmonster Feb 19, 2025
6c32223
remove configure ci and add cache to github workflow file
Tmonster Feb 19, 2025
e71fa03
.gitignore ignores data/generated
Tmonster Feb 19, 2025
9bb5a0a
Merge branch 'main' into robust-data-generation
Tmonster Feb 19, 2025
51abdf6
do not install dependences on start rest catalog
Tmonster Feb 19, 2025
ac45c6e
install requirements
Tmonster Feb 19, 2025
ff3c266
introduce make data_ci
Tmonster Feb 19, 2025
e0a0df3
make data requires restarting the catalog
Tmonster Feb 19, 2025
354a3ab
make data and not make data_ci
Tmonster Feb 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions .github/workflows/Rest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ jobs:
with:
vcpkgGitCommitId: 5e5d0e1cd7785623065e77eff011afdeec1a3574

- name: Setup Ccache
uses: hendrikmuhs/ccache-action@main
continue-on-error: true

- name: Build extension
env:
GEN: ninja
Expand All @@ -47,12 +51,16 @@ jobs:
make release

- name: Start Rest Catalog
working-directory: scripts/
run: |
./start-rest-catalog.sh
make start-rest-catalog

- name: Generate data
run: |
make data

- name: Test With rest catalog
- name: Test with rest catalog
env:
ICEBERG_SERVER_AVAILABLE: 1
DUCKDB_ICEBERG_HAVE_GENERATED_DATA: 1
run: |
make test_release
make test_release
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@ data/iceberg/generated_*
scripts/metastore_db/
scripts/derby.log
scripts/test-script-with-path.sql
scripts/data_generators/__pycache__/
scripts/data_generators/*/__pycache__/
scripts/data_generators/*/*/*.parquet
data/generated/*
17 changes: 11 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,23 @@ EXT_NAME=iceberg
EXT_CONFIG=${PROJ_DIR}extension_config.cmake

# We need this for testing
CORE_EXTENSIONS='httpfs'
CORE_EXTENSIONS='parquet;httpfs'

# Include the Makefile from extension-ci-tools
include extension-ci-tools/makefiles/duckdb_extension.Makefile

start-rest-catalog: install_requirements
./scripts/start-rest-catalog.sh

install_requirements:
python3 -m pip install -r scripts/requirements.txt

# Custom makefile targets
data: data_clean
python3 scripts/test_data_generator/generate_iceberg.py 0.001 data/iceberg/generated_spec1_0_001 1
python3 scripts/test_data_generator/generate_iceberg.py 0.001 data/iceberg/generated_spec2_0_001 2
data: data_clean start-rest-catalog
python3 scripts/data_generators/generate_data.py

data_large: data data_clean
python3 scripts/test_data_generator/generate_iceberg.py 1 data/iceberg/generated_spec2_1 2
python3 scripts/data_generators/generate_data.py

data_clean:
rm -rf data/iceberg/generated_*
rm -rf data/generated

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

This file was deleted.

Loading
Loading