Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: As a last resort, try inserting TEXT into a UUID column #387

Closed
wants to merge 2 commits into from

Conversation

grigi
Copy link

@grigi grigi commented Jul 12, 2024

Have some data that is of type UUID where the source treats it as text.

Instead of failing with an exception:

NotImplementedError: Altering columns is not supported. Could not convert column 'public.users.uuid' from 'UUID' to 'TEXT'.

Just treat them as compatible.

I couldn't think of a different way to do this, as even though the Postgres integration supports UUID, not all do. And technically UUID isn't TEXT (just that it's represented as text in sql, so the DB handles it transparently), so it shouldn't be part of merge_sql_types.

@grigi grigi changed the title As a last resort, try inserting TEXT into a UUID column feat: As a last resort, try inserting TEXT into a UUID column Jul 12, 2024
@edgarrmondragon
Copy link
Member

Hi @grigi, thanks for the PR!

Could you work around this by overriding the schema to something like:

plugins:
  extractors:
  - name: tap-example
    schema:
      my_stream:
        my_uuid_column:
          type: ["string", "null"]
          format: uuid

The UUID format is already handled:

if jsonschema_type.get("format") == "uuid":

nb This is if you're using Meltano, otherwise you could pass a custom catalog file.

@grigi
Copy link
Author

grigi commented Jul 12, 2024

Thanks for the rapid response. I've tried that (using meltano, yes) but I can't get it to work.
It updates the schema generated in the catalog cache .meltano/run/tap-mysql/tap.properties.json, but when I invoke the extractor it overrides the schema in the output.

Manually updating the discovered catalog and providing it, it also gets ignored.
I'm not sure if this is a bug in meltano or pipelinewise-tap-mysql.

There's a high chance that I'm missing something 🤷

@edgarrmondragon
Copy link
Member

edgarrmondragon commented Jul 12, 2024

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before

self._adapt_column_type(
schema_name=cast(str, schema_name),
table_name=table_name,
column_name=column_name,
sql_type=sql_type,
connection=connection,
column_object=column_object,
)

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch


That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR transferwise/pipelinewise-tap-mysql#186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available

@grigi
Copy link
Author

grigi commented Jul 12, 2024

Ooh thanks for that. That PR makes it all work for me :-)
It's certainly the more correct solution.

@visch
Copy link
Member

visch commented Jul 12, 2024

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before

self._adapt_column_type(
schema_name=cast(str, schema_name),
table_name=table_name,
column_name=column_name,
sql_type=sql_type,
connection=connection,
column_object=column_object,
)

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch

That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR transferwise/pipelinewise-tap-mysql#186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available

That seems like a reasonable setting idea! Gives folks a way to allow postgres to just figure it out for you which will work in a lot of cases.

I think there's a whole flow with the other targets around different data types https://www.stitchdata.com/docs/replication/loading/understanding-table-structural-changes#columns-mixed-data-types that we haven't gone down but we could.

I kind of like the current method as it's explicit and instead of making columns you don't expect it just fails and then you have to create a solution (we could do a much better job explaining what's going on when it fails for this reason) but it all depends on the use case people have.

@edgarrmondragon edgarrmondragon self-assigned this Jul 13, 2024
@grigi
Copy link
Author

grigi commented Jul 18, 2024

I see your PR got closed without comment. That's sad.
But thanks for it. I'm going to keep on using your branch until we decommission our last MySQL servers. (Hopefully by end of the year).

So, considering that, I feel that this PR can be closed.

Thanks for your assistance!

@grigi grigi closed this Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants