Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add created_at and updated_at to all tables #3225

Open
MartinquaXD opened this issue Jan 8, 2025 · 4 comments
Open

feat: Add created_at and updated_at to all tables #3225

MartinquaXD opened this issue Jan 8, 2025 · 4 comments

Comments

@MartinquaXD
Copy link
Contributor

Problem

This was sparked by discussion regarding data analytics efforts. Mirroring the contents of a DB are easier if these fields are available. Also it's generally nice to have timestamps for all sorts of data.

Suggested solution

Implement a DB migration that adds triggers to all tables which set created_at once when a new row gets created and updated_at whenever the row gets modified.
That means whenever a new row gets created created_at and updated_at would be initialized to the same value.
Doing it in the DB means we don't have to adjust any Rust code and it's probably less error prone as well.

Acceptance criteria

All tables receive created_at and updated_at columns storing timestamps of the respective event.
All historic data gets backfilled with the timestamp when the migration gets applied.

@sunce86
Copy link
Contributor

sunce86 commented Jan 8, 2025

Can you give us more details why this is exactly needed?

I'd like us to challenge every decision to add data to backend database that is not directly used by backend services (unless really necessary which seems like it is in this case). Otherwise, we keep on adding dependencies on other teams that use our database directly and it's getting increasingly harder to make backend changes without breaking other stuff.

@MartinquaXD
Copy link
Contributor Author

Overall I personally feel like created_at and updated_at (that less so) are generally things that make sense to have in any DB. Besides that this suggestion came up to aid the efforts of creating a unified and reliable data pipeline for any analytics needs. This fields can be used to optimize synchronizing backend and analytics DB after we applied a migration to the backend DB.

@fleupold
Copy link
Contributor

fleupold commented Jan 8, 2025

This has come up multiple times in the past (e.g. when discussing data bucketing by month which requires a timestamp for the solver rewards accounting script to know in which bucket a certain settlement falls and recently when adjusting the tenderly web3 action).

I believe the solver team now built a workaround to fetch the timestamp for trades and in the web3 action we have to add a ton of network specific logic in order to express something like "check if the trade was less than 24h ago".

I agree with your worry that there is risk in adding domain specific columns which other teams depend on and that make it harder for us to refactor later. However, timestamps of when a db row is created (and updated) is not domain specific and doesn't contribute to the risk you mentioned (there are DBs which store this metadata by default, unfortunately postgres is not one of them).

@ferrau10
Copy link

ferrau10 commented Jan 9, 2025

For the data pipeline this would be really useful: we copy data from the backend db to the analytics db. We will do a one time full load of all the tables we use in the pipeline, and then automate incremental loads, using dune synch v2. We could of course either look at the index of source table and only copy data when the index does not exist in the target table but this can be quite heavy for large datasets and we need to be sure each table has an index. And in case a row is updated but the index does not change, we would not catch the change. We could also look at tables that have columns that increase like block number and only load the rows where block number is higher than the max block number we have in the analytics db. BUT, if there is every a block that somehow was added to the backend db later on, then we would not catch that. AND, as far as I now , not all tables have a column that increases linearily.
Generally, loading data from source table to target table where the max(updated_at) of target_table is lower than the updated_at of the row is very easy, reliable, and it also simplifies the process as it's the same for all tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants