Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(store-sync): skip invalid utf-8 characters in strings before inserting into postgres #3562

Merged
merged 3 commits into from
Feb 4, 2025

Conversation

alvrs
Copy link
Member

@alvrs alvrs commented Feb 4, 2025

Currently an invalid UTF-8 character in a string type column makes the decoded postgres indexer error with PostgresError: invalid byte sequence for encoding "UTF8": 0x00. This PR adds a cleanup step before upserting values into postgres to remove 0x00 characters.

Copy link

changeset-bot bot commented Feb 4, 2025

🦋 Changeset detected

Latest commit: 0a524cd

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 30 packages
Name Type
@latticexyz/store-sync Patch
@latticexyz/cli Patch
@latticexyz/dev-tools Patch
@latticexyz/explorer Patch
@latticexyz/store-indexer Patch
@latticexyz/world-module-erc20 Patch
@latticexyz/world-modules Patch
@latticexyz/abi-ts Patch
@latticexyz/block-logs-stream Patch
@latticexyz/common Patch
@latticexyz/config Patch
create-mud Patch
@latticexyz/entrykit Patch
@latticexyz/faucet Patch
@latticexyz/gas-report Patch
@latticexyz/paymaster Patch
@latticexyz/protocol-parser Patch
@latticexyz/react Patch
@latticexyz/recs Patch
@latticexyz/schema-type Patch
solhint-config-mud Patch
solhint-plugin-mud Patch
@latticexyz/stash Patch
@latticexyz/store-consumer Patch
@latticexyz/store Patch
@latticexyz/utils Patch
vite-plugin-mud Patch
@latticexyz/world-module-callwithsignature Patch
@latticexyz/world-module-metadata Patch
@latticexyz/world Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@frolic
Copy link
Member

frolic commented Feb 4, 2025

wouldn't this corrupt the data if a \0 is in the middle of the string somewhere?

@frolic
Copy link
Member

frolic commented Feb 4, 2025

some interesting discussion here around this problem: ponder-sh/ponder#1456

@alvrs
Copy link
Member Author

alvrs commented Feb 4, 2025

wouldn't this corrupt the data if a \0 is in the middle of the string somewhere?

Postgres doesn't support \x00 bytes in strings, so the only alternative to removing the bytes here is skipping the entire record or throwing an error, both of which are worse my opinion (especially because this is often user provided data, so a user could make the indexer break for everyone by providing a malformed string).

Seems like the discussion in ponder-sh/ponder#1456 also tends towards this behaviour as default. I don't think we need to add a flag to disable this behaviour until someone requests it.

@@ -0,0 +1,10 @@
import { SchemaToPrimitives, ValueSchema } from "@latticexyz/protocol-parser/internal";

export function cleanStrings<TSchema extends ValueSchema>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Ponder's naming

Suggested change
export function cleanStrings<TSchema extends ValueSchema>(
export function removeNullCharacters<TSchema extends ValueSchema>(

encodedLengths: record.encodedLengths ?? "0x",
dynamicData: record.dynamicData ?? "0x",
}),
);
Copy link
Member

@frolic frolic Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on doing this call just before we pass into postgres, rather than here, in case we later need to use value for something?

e.g.

 await tx
    .insert(sqlTable)
    .values({
      __keyBytes: keyBytes,
      __lastUpdatedBlockNumber: blockNumber,
      ...key,
      ...removeNullCharacters(value),
    })

@frolic
Copy link
Member

frolic commented Feb 4, 2025

ohh I was worried that these fields were being used to re-encode the data before mutating via event, but I don't think that's a concern here because we still store the raw bytes and use that as the source of truth

@alvrs alvrs merged commit df5d393 into main Feb 4, 2025
16 checks passed
@alvrs alvrs deleted the alvrs/indexer-fix branch February 4, 2025 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants