feat: As a last resort, try inserting TEXT into a UUID column #387

grigi · 2024-07-12T15:45:04Z

Have some data that is of type UUID where the source treats it as text.

Instead of failing with an exception:

NotImplementedError: Altering columns is not supported. Could not convert column 'public.users.uuid' from 'UUID' to 'TEXT'.

Just treat them as compatible.

I couldn't think of a different way to do this, as even though the Postgres integration supports UUID, not all do. And technically UUID isn't TEXT (just that it's represented as text in sql, so the DB handles it transparently), so it shouldn't be part of merge_sql_types.

edgarrmondragon · 2024-07-12T15:51:58Z

Hi @grigi, thanks for the PR!

Could you work around this by overriding the schema to something like:

plugins:
  extractors:
  - name: tap-example
    schema:
      my_stream:
        my_uuid_column:
          type: ["string", "null"]
          format: uuid

The UUID format is already handled:

target-postgres/target_postgres/connector.py

Line 290 in fd881a6

if jsonschema_type.get("format") == "uuid":

nb This is if you're using Meltano, otherwise you could pass a custom catalog file.

grigi · 2024-07-12T17:08:01Z

Thanks for the rapid response. I've tried that (using meltano, yes) but I can't get it to work.
It updates the schema generated in the catalog cache .meltano/run/tap-mysql/tap.properties.json, but when I invoke the extractor it overrides the schema in the output.

Manually updating the discovered catalog and providing it, it also gets ignored.
I'm not sure if this is a bug in meltano or pipelinewise-tap-mysql.

There's a high chance that I'm missing something 🤷

edgarrmondragon · 2024-07-12T18:40:26Z

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before

target-postgres/target_postgres/connector.py

Lines 431 to 438 in fd881a6

    
           self._adapt_column_type( 
        
               schema_name=cast(str, schema_name), 
        
               table_name=table_name, 
        
               column_name=column_name, 
        
               sql_type=sql_type, 
        
               connection=connection, 
        
               column_object=column_object, 
        
           )

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch

That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR transferwise/pipelinewise-tap-mysql#186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available

grigi · 2024-07-12T19:09:03Z

Ooh thanks for that. That PR makes it all work for me :-)
It's certainly the more correct solution.

visch · 2024-07-12T19:21:02Z

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before

target-postgres/target_postgres/connector.py

Lines 431 to 438 in fd881a6

self._adapt_column_type(

schema_name=cast(str, schema_name),

table_name=table_name,

column_name=column_name,

sql_type=sql_type,

connection=connection,

column_object=column_object,

)

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch

That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR transferwise/pipelinewise-tap-mysql#186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:
plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available

That seems like a reasonable setting idea! Gives folks a way to allow postgres to just figure it out for you which will work in a lot of cases.

I think there's a whole flow with the other targets around different data types https://www.stitchdata.com/docs/replication/loading/understanding-table-structural-changes#columns-mixed-data-types that we haven't gone down but we could.

I kind of like the current method as it's explicit and instead of making columns you don't expect it just fails and then you have to create a solution (we could do a much better job explaining what's going on when it fails for this reason) but it all depends on the use case people have.

grigi · 2024-07-18T09:24:19Z

I see your PR got closed without comment. That's sad.
But thanks for it. I'm going to keep on using your branch until we decommission our last MySQL servers. (Hopefully by end of the year).

So, considering that, I feel that this PR can be closed.

Thanks for your assistance!

As a last resort, try inserting TEXT into a UUID column

1209af8

grigi changed the title ~~As a last resort, try inserting TEXT into a UUID column~~ feat: As a last resort, try inserting TEXT into a UUID column Jul 12, 2024

Satisfy ruff

ea24173

edgarrmondragon self-assigned this Jul 13, 2024

grigi closed this Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: As a last resort, try inserting TEXT into a UUID column #387

feat: As a last resort, try inserting TEXT into a UUID column #387

grigi commented Jul 12, 2024

edgarrmondragon commented Jul 12, 2024

grigi commented Jul 12, 2024

edgarrmondragon commented Jul 12, 2024 •

edited

Loading

grigi commented Jul 12, 2024

visch commented Jul 12, 2024

grigi commented Jul 18, 2024

feat: As a last resort, try inserting TEXT into a UUID column #387

feat: As a last resort, try inserting TEXT into a UUID column #387

Conversation

grigi commented Jul 12, 2024

edgarrmondragon commented Jul 12, 2024

grigi commented Jul 12, 2024

edgarrmondragon commented Jul 12, 2024 • edited Loading

grigi commented Jul 12, 2024

visch commented Jul 12, 2024

grigi commented Jul 18, 2024

edgarrmondragon commented Jul 12, 2024 •

edited

Loading