-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iceberg catalog read support #98
Add iceberg catalog read support #98
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I've added a few comments
CMakeLists.txt
Outdated
@@ -4,7 +4,7 @@ cmake_minimum_required(VERSION 2.8.12) | |||
set(TARGET_NAME iceberg) | |||
project(${TARGET_NAME}) | |||
|
|||
set(CMAKE_CXX_STANDARD 14) | |||
set(CMAKE_CXX_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to bump this? I would prefer not to, to avoid any CI issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I needed it for something I can't quite recall now. I think it was used for parts for "writing"
src/catalog_api.cpp
Outdated
} | ||
return result; | ||
|
||
// throw std::runtime_error("No AWS credentials found for table"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be removed
src/storage/ic_transaction.cpp
Outdated
} | ||
} | ||
|
||
// ICConnection &ICTransaction::GetConnection() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's just remove all commented out code. This has been copied from the uc_catalog extension which has a prototype-quality level codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all come from uc_catalog extension. I left them as is
test/sql/iceberg_catalog_read.test
Outdated
require httpfs | ||
|
||
statement ok | ||
CREATE SECRET ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for a first PR, but we should think about the UX a little here: right now there is only 1 iceberg secret which is then fetched when calling ATTACH '' AS my_datalake (TYPE ICEBERG);
However it's probably desirable to be able to do something like:
CREATE SECRET irc_secret_1 (TYPE ICEBERG, ENDPOINT 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, ENDPOINT 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'irc_secret_1' AS irc1 (TYPE ICEBERG);
ATTACH 'irc_secret_2' AS irc2 (TYPE ICEBERG);
or perhaps
CREATE SECRET irc_secret_1 (TYPE ICEBERG, SCOPE 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, SCOPE 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'http://127.0.0.1:8181' AS irc1 (TYPE ICEBERG);
ATTACH 'http://some.other.thing.com' AS irc2 (TYPE ICEBERG);
We can have some discussion on what is nicest, these are currently also open questions for other catalog extensions such as postgres and mysql i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named secrets should work too. If you don't provide a name, it uses __default_iceberg
src/storage/ic_table_set.cpp
Outdated
} | ||
|
||
auto &ic_catalog = catalog.Cast<ICCatalog>(); | ||
// TODO: handle out-of-order columns using position property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO is copy pasted from uc_catalog: we should figure out if this is also relevant to iceberg, if so it can stay otherwise we should remove it to avoid confusion
src/storage/ic_table_entry.cpp
Outdated
auto &get = (LogicalGet &)*op; | ||
bind_data = std::move(get.bind_data); | ||
|
||
return parquet_scan_function; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't work with tables with deletes and potentially return incorrect results. Can we add a test for this?
If it does indeed fail, we should take a look into how to solve this, or at least throw an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was quite hard to figure out, and I didn't work on insert/update/delete yet 😬
src/storage/ic_table_entry.cpp
Outdated
auto table_ref = iceberg_scan_function.bind_replace(context, bind_input); | ||
|
||
// 1) Create a Binder and bind the parser-level TableRef -> BoundTableRef | ||
auto binder = Binder::CreateBinder(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting
src/storage/ic_table_entry.cpp
Outdated
throw NotImplementedException("BindUpdateConstraints"); | ||
} | ||
|
||
struct MyIcebergFunctionData : public FunctionData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is dead code i think?
CMakeLists.txt
Outdated
src/common/utils.cpp | ||
src/common/schema.cpp | ||
src/common/iceberg.cpp | ||
src/iceberg_functions/iceberg_snapshots.cpp | ||
src/iceberg_functions/iceberg_scan.cpp | ||
src/iceberg_functions/iceberg_metadata.cpp) | ||
src/iceberg_functions/iceberg_metadata.cpp | ||
src/storage/ic_catalog.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this structure could be a bit clearer
What do you think about renaming the IC to IRC (this code is meant for the Iceberg Rest Catalog after all) and moving everything:
src/storage/irc/irc_catalog.cpp
?
This would allow us to later on create another catalog which can handle static iceberg tables, similar to duckdb/duckdb-delta#110 which could then live in src/storage/static/static_catalog.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want all of these files to be icr_*
or just ic_catalog.cpp
to be icr_catalog.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think irc_*
(not icr_
😄) everywhere that is related to the Iceberg Rest Catalog makes the most sense. We can also look to using C++ namespacing for the iceberg extension at some point, but I think just prefixing everything with IRC is fine for now.
Very excited future user here! Chiming in on a review like this may be frowned upon. Please ignore if it's a distraction. One minor request to make the ICEBERG secret agnostic of object storage provider. Rather than accepting |
src/storage/irc_table_entry.cpp
Outdated
table_data->data_source_format); | ||
} | ||
|
||
if (table_data->storage_location.find("file://") != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hardcoded file://
maybe is unnecessary? I feel like we can omit the condition and just try to pull the credentials
src/storage/irc_table_entry.cpp
Outdated
// Inject secret into secret manager scoped to this path | ||
CreateSecretInfo info(OnCreateConflict::ERROR_ON_CONFLICT, SecretPersistType::TEMPORARY); | ||
info.name = "__internal_ic_" + table_data->table_id; | ||
info.type = "s3"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, based on the comment here, we should see if the spec returns information on the type of the secret so we don't have to hard code this kind of information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Tmonster, went through the code again and added a bunch of nitpicks, other than that it looks great
src/catalog_api.cpp
Outdated
|
||
IRCAPISchema IRCAPI::CreateSchema(const string &catalog, const string &internal, const string &schema, IRCCredentials credentials) { | ||
throw NotImplementedException("IRCAPI::Create Schema not Implemented"); | ||
// string post_data = "{\"namespace\":[\"" + schema + "\"]}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we can remove this commented block
src/iceberg_extension.cpp
Outdated
return make_uniq<ICTransactionManager>(db, ic_catalog); | ||
} | ||
|
||
class ICCatalogStorageExtension : public StorageExtension { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IRCStorageExtension
src/iceberg_extension.cpp
Outdated
} | ||
} | ||
|
||
// if no secret is specified we default to the unnamed mysql secret, if it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: iceberg secret
namespace duckdb { | ||
struct CreateSchemaInfo; | ||
|
||
class ICSchemaSet : public ICCatalogSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IRCSchemaSet
namespace duckdb { | ||
class ICTransaction; | ||
|
||
class ICSchemaEntry : public SchemaCatalogEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IRCSchemaEntry
src/storage/irc_schema_entry.cpp
Outdated
ICSchemaEntry::~ICSchemaEntry() { | ||
} | ||
|
||
ICTransaction &GetUCTransaction(CatalogTransaction transaction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ICTransaction &GetUCTransaction(CatalogTransaction transaction) { | |
IRCTransaction &GetIRCTransaction(CatalogTransaction transaction) { |
src/storage/irc_schema_entry.cpp
Outdated
throw NotImplementedException("REPLACE ON CONFLICT in CreateView"); | ||
} | ||
} | ||
// auto &ic_transaction = GetUCTransaction(transaction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
2023-03-11 11 k | ||
2023-03-12 12 l | ||
|
||
# test deletes (see provision.py for where deletes occur) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scan on a table with deletes should throw a notimplementedexception for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of those cases where the whole table is rewritten and there are no point deletes. I'll see if I can add a case with point deletes that throws an error
… will clean them up though
Piggy backing off of #95
This PR adds support for attaching an Iceberg catalog and be able to read iceberg tables from a datalake. It specifically supports the following sql commands:
Attach your catalog
Some tests as well to make sure everything works. There is also a double check on the config since it's possible some catalogs won't return configs. In this case the user is required to have a second key.