-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip duplicate upserts. #137
base: master
Are you sure you want to change the base?
Skip duplicate upserts. #137
Conversation
Hey @mmontagna, thanks for the PR! It's an interesting idea. Have you tried benchmarking it to see how much impact it has on performance? |
@ankane I haven't run anything too scientific. I am happy to run more benchmarks, just let me know what you want to see. I'll outline roughly what I've done so far below. When doing a sync from local postgres to local postgres it drops the time from ~8 seconds to overwrite a ~30MB table with 130k rows (~50 columns and 7 indices) from ~8 to ~2 seconds. When doing remote to local syncs it doesn't matter that much since the time is largely dominated by network transfer speeds. All my tests here have come out a wash but I expect for users with faster internet connections there will likely be an improvement. I've run some tests/monitored the target machines when syncing 2GB+ tables and while it doesn't save very much, if any immediate resource cost. It does prevent accumulating vacuum workload when running frequent syncs, which is really the motivating factor here. eg
|
Great, just tried it with pgbench locally and also see a significant improvement. pgbench -i -s 10 pgsync_bench1
pgbench -i -s 10 pgsync_bench2
pgsync pgbench_accounts --from pgsync_bench1 --to pgsync_bench2 --overwrite Latest version: ~21s This shouldn't change the result of the sync, so let's add it without an option. |
Alright. I've updated the PR to add it without an option. |
Don't merge this yet. There might be a breakage here around json columns |
Pushed a commit to fix that case. |
Good catch. I think it'll be an issue with any column types that don't support the equality operator (including custom types). I also think there may be an issue with Due to this, I'm not sure it's worth the extra complexity to support. |
Fair enough. I'm going to keep fiddling with this since it has the potential to vastly improve the performance/speed of some of our use-cases, but I won't expect you to merge anything upstream. You're right about the null issue, that won't effect correctness, I think, but does mean this is writing more than it needs to. |
Sounds good. fwiw, I think the null issue will affect correctness/cause it to sync less rows than desired, so you'll probably want to test different combinations of null in the source and destination. |
is there any update for this? :) |
When using the overwrite mode, pgsync does a blind upsert which ends up creating duplicate/new row versions for ALL rows involved in the sync regardless of whether or not any of the values in those rows have changed.
This is largely fine/functionally correct, but it's not ideal as it places additional write/transaction load on the receiving database which isn't strictly needed.
It might make sense to just make this behavior the default, but I haven't fully thought that through.
Any thoughts on the idea?