Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-2209 Treat completing a client reset as receiving a MARK message #7921

Closed
wants to merge 2 commits into from

Conversation

tgoyne
Copy link
Member

@tgoyne tgoyne commented Jul 23, 2024

Client resets which did not recovery any changes (either because changes were discarded, there was nothing to recover, or the recovered changesets became empty after merging) don't need to wait for a server round-trip to mark the reset as complete, as that round-trip merely consisted of sending a MARK to the server and waiting for a response. This partially reverts #6196 and fixes that bug by immediately removing the client reset tracker as part of the diff commit if there was nothing recovered.

Performing a client reset involves waiting for download completion and bringing the Realm file into the state it would have been in if it had completed downloading, so it should fire download completion handlers. Previously we did everything we would do on download completion except for this. Since we performed a wait for download completion after applying a client reset diff the handlers would eventually get called, but the exact timing depended on server behavior which is changing in QBSv2 (and the wait for download completion is removed by the above change).

Copy link

coveralls-official bot commented Jul 23, 2024

Pull Request Test Coverage Report for Build thomas.goyne_494

Details

  • 428 of 471 (90.87%) changed or added relevant lines in 17 files are covered.
  • 117 unchanged lines in 22 files lost coverage.
  • Overall coverage decreased (-0.02%) to 91.106%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/sync/noinst/client_history_impl.cpp 7 8 87.5%
test/object-store/util/sync/sync_test_utils.cpp 37 39 94.87%
src/realm/sync/noinst/pending_reset_store.cpp 73 89 82.02%
test/object-store/sync/flx_sync.cpp 178 202 88.12%
Files with Coverage Reduction New Missed Lines %
src/realm/array_backlink.cpp 1 91.38%
src/realm/dictionary.cpp 1 85.16%
src/realm/query_engine.hpp 1 93.94%
src/realm/sync/network/websocket.cpp 1 72.43%
src/realm/sync/noinst/client_impl_base.cpp 1 83.34%
src/realm/util/serializer.cpp 1 90.43%
src/realm/uuid.cpp 1 98.48%
test/test_dictionary.cpp 1 99.83%
test/test_query2.cpp 1 98.73%
src/realm/db.cpp 2 92.63%
Totals Coverage Status
Change from base Build 2555: -0.02%
Covered Lines: 217435
Relevant Lines: 238662

💛 - Coveralls

@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch 4 times, most recently from 4cc2831 to a4cbf2f Compare July 31, 2024 23:47
Comment on lines 981 to 982
m_sending_session = sess;
m_sending = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be set before calling async_write_binary() to support the completion handler being called synchronously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch, but i can't think of any issues it could have caused - was it causing some failures in your testing or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added test socket provider crashes without this change because it synchronously calls the completion handler.

@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only)

auto tr = m_db->start_read();
// Start with a reader so it doesn't try to write until we are ready
SyncMetadataSchemaVersionsReader schema_versions_reader(tr);
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the changes to migration store, pending bootstrap store, sync metadata schema, pending bootstrap store, and subscriptions are just secondary effects of making PendingResetStore::has_pending_reset() take a Group rather than a TransactionRef.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from passing around a ref vs const ref to a shared_ptr, is there a reason why this changed from a Transaction to a Group?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from passing around a const ref instead of a const ref to a shared_ptr, is there a reason the parameter is a Group now and not a Transaction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from passing around a ref vs ptr, is there a reason why this changed from a Transaction to a Group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root change is enabling PendingResetStore::has_pending_reset(realm->read_group()), which previously didn't work because the function expected a Transaction even though it didn't do anything which required a Transaction. All of these functions should have been taking a Group the whole time as they don't change the transaction state.

@@ -308,6 +308,9 @@ StatusWith<std::shared_ptr<Realm>> async_open_realm(const Realm::Config& config)
std::shared_ptr<Realm> successfully_async_open_realm(const Realm::Config& config)
{
auto status = async_open_realm(config);
if (!status.is_ok()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected errors here previously didn't log the error so it was annoying to debug.

@@ -1196,6 +1269,7 @@ TEST_CASE("flx: client reset", "[sync][flx][client reset][baas]") {
auto subs = realm->get_latest_subscription_set();
auto result = subs.get_state_change_notification(sync::SubscriptionSet::State::Complete).get();
CHECK(result == sync::SubscriptionSet::State::Complete);
SyncSession::OnlyForTesting::pause_async(*realm->sync_session()).get();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server was sufficiently fast it could theoretically send the client reset we trigger below to the current sync session, which isn't what these tests want. Probably never actually happened in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - in case you weren't aware, the client reset triggered by the command below just invalidates the file ident and the reset is not initiated until the session reconnects.

Comment on lines -286 to -295
else
tr->commit_and_continue_writing();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find anywhere that this was being used where this commit would do anything useful and it made the client reset tests which verified that exactly two commits were made more complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also assume this is fine; if the caller had opened a write transaction, they will also commit it at some point.

{
// Write transaction required
REALM_ASSERT(wr_tr->get_transact_stage() == DB::TransactStage::transact_Writing);
auto reset_store = PendingResetStore::load_or_create_schema(wr_tr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the schema here was kinda slow and it wasn't actually being used for anything.

@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch 2 times, most recently from 9204cfc to 65330eb Compare August 1, 2024 17:44
@tgoyne tgoyne marked this pull request as ready for review August 2, 2024 02:47
@tgoyne tgoyne requested review from jbreams and ironage August 2, 2024 02:47
Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements! LGTM.

CHANGELOG.md Outdated
@@ -6,6 +6,8 @@
### Fixed
* <How do the end-user experience this issue? what was the impact?> ([#????](https://github.com/realm/realm-core/issues/????), since v?.?.?)
* Sync client may report duplicate compensating write errors ([#7708](https://github.com/realm/realm-core/issues/7708), since v14.8.0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix was released yesterday so the following additions will have to be updated to go to the new section

Comment on lines -286 to -295
else
tr->commit_and_continue_writing();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also assume this is fine; if the caller had opened a write transaction, they will also commit it at some point.

{
std::vector<SyncMetadataTable> unified_schema_version_table_def{
{&m_table,
c_sync_internal_schemas_table,
{&m_schema_group_field, c_meta_schema_schema_group_field, type_String},
{{&m_version_field, c_meta_schema_version_field, type_Int}}}};

// Any type of transaction is allowed, including frozen and write, as long as it supports reading
REALM_ASSERT_EX(tr->get_transact_stage() != DB::transact_Ready, tr->get_transact_stage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a group everywhere makes the intent much more clear. 💯

return this;
}

TestClientReset* TestClientReset::expect_reset_error(std::optional<SyncError>& err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice simplification 👍

@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch 3 times, most recently from 329c1c6 to 7e20b9f Compare August 5, 2024 16:27
Comment on lines 981 to 982
m_sending_session = sess;
m_sending = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch, but i can't think of any issues it could have caused - was it causing some failures in your testing or something?

@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only)

auto tr = m_db->start_read();
// Start with a reader so it doesn't try to write until we are ready
SyncMetadataSchemaVersionsReader schema_versions_reader(tr);
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from passing around a ref vs const ref to a shared_ptr, is there a reason why this changed from a Transaction to a Group?

@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only)

auto tr = m_db->start_read();
// Start with a reader so it doesn't try to write until we are ready
SyncMetadataSchemaVersionsReader schema_versions_reader(tr);
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from passing around a const ref instead of a const ref to a shared_ptr, is there a reason the parameter is a Group now and not a Transaction?

@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only)

auto tr = m_db->start_read();
// Start with a reader so it doesn't try to write until we are ready
SyncMetadataSchemaVersionsReader schema_versions_reader(tr);
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from passing around a ref vs ptr, is there a reason why this changed from a Transaction to a Group?

// A socket provider which claims to always work, but when `disconnect = true`
// will actually drop all incoming and outgoing messages. This enables testing
// going offline at very specfic points.
struct DisconnectingSocketProvider : sync::websocket::DefaultSocketProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more of a DiscardAllTrafficSocketProvider or something rather than a DisconnectingSocketProvider since it never actually disconnects you. Is there anything in this PR that would behave differently if there was an actual disconnect where we reset all the protocol state rather than just discarding messages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name, but it's "disconnecting" in the sense of disconnecting a network cable between you and the server.

Resetting the protocol state would require tying the tests to implementation details of sync connections, while this approach lets us test it via the public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more like disconnecting in the sense that an intermediate hop is dropping packets, but yeah. by "reset all the protocol state" i just meant calling websocket_closed_handler() to signal to the sync client that the connection has been closed. i think without some reworking of how the sync client handles closed connections this could be kinda tough though. I think the answer to my question is that there aren't any changes the depend on actually disconnecting the session since client_reset_if_needed() doesn't depend on having any of the Session's previous state be correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests I do just want to be absolutely sure that no synchronization has happened after the download of the fresh realm has completed (until it's time to allow sync to happen again) and the exact details are unimportant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked this type to call websocket_closed_handler() (or defer the call to DefaultSocketProvider::connect()) as it turned out that dropping packets really didn't work outside of simple cases.

@@ -947,6 +948,19 @@ void ClientHistory::update_sync_progress(const SyncProgress& progress, Downloada
root.set(s_progress_uploaded_bytes_iip,
RefOrTagged::make_tagged(uploaded_bytes)); // Throws

if (previous_upload_client_version < progress.upload.client_version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this assume that if we make any upload progress that we'll have fully uploaded all changes and we know for sure we aren't going to get another client reset from any recovered changesets? maybe now that we have compensating writes that doesn't really matter as much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploading changesets should either result in the server acknowledging the upload or sending a client reset and not both, so once our UPLOAD is acked the window for getting a client reset due to those changesets being invalid has ended.

I think this check is probably wrong and it needs to actually be checking if we've reached the client version at the time of the client reset (or at time of opening).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a while to figure out how to test it but this is indeed incorrect; it marks the client reset as complete as soon as any changesets are acked rather than when all of the recovered changesets are.

@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch 2 times, most recently from bf89359 to c78df18 Compare August 9, 2024 03:58
@tgoyne tgoyne changed the base branch from master to tg/unuploaded-changesets August 9, 2024 03:59
@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch from c78df18 to edc1174 Compare August 9, 2024 16:05
@tgoyne
Copy link
Member Author

tgoyne commented Aug 9, 2024

I've updated this to track which client version was the last one recovered by a client reset and mark the client reset as complete once that version has been uploaded (and acked). This is sort of a fake bug fix relative to the behavior in practice with PBS and QBSv1. There's some extreme edge cases that now work better (e.g. recovered changesets are successfully uploaded, then device goes offline before receiving the MARK and stays offline until the client file ident expires on the server), but the primary benefit is that the behavior of an async open that triggers a client reset no longer depends on how the server handles a MARK sent while uploading changesets. With PBS/QBSv1 the MARK doesn't wait for the server to have processed those UPLOADs, and with QBSv2 it (sometimes?) does. By not relying on MARK for marking client resets complete, we preserve the existing visible behavior for client resets.

@tgoyne tgoyne force-pushed the tg/unuploaded-changesets branch from 9e43725 to 9252ccc Compare August 9, 2024 17:56
Base automatically changed from tg/unuploaded-changesets to master August 9, 2024 18:49
@tgoyne tgoyne force-pushed the tg/download-completion-on-client-reset branch 2 times, most recently from 0857d0e to 80604f8 Compare August 9, 2024 19:31
@tgoyne tgoyne closed this Oct 16, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants