-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent stale command subscription entries in Device Connection Service #1858
Comments
I think the TTD defines the timeout of such a mapping. Why not hand over the responsibility to the device connection service, to remove this latest this point in time (now+ttd). Infinispan does have a TTL mechanism, and I guess for every SQL based backend you could easily implement a cleanup job like this. |
@ctron yes, letting Infinispan do the work of actually deleting the entries would be the idea, and the TTD would define the lifespan of the entries - if there is a TTD. In the case of MQTT where there is no TTD, we could use a predefined expiry value for the device connection service entries, defined so that the automatic re-creation interval for entries of still valid command subscriptions is less than that. |
This makes it possible to restrict the lifespan of entries set via the setCommandHandlingAdapterInstance method of the Device Connection API. Signed-off-by: Carsten Lohmann <[email protected]>
This makes it possible to restrict the lifespan of entries set via the setCommandHandlingAdapterInstance method of the Device Connection API. Signed-off-by: Carsten Lohmann <[email protected]>
This makes it possible to restrict the lifespan of entries set via the setCommandHandlingAdapterInstance method of the Device Connection API. Signed-off-by: Carsten Lohmann <[email protected]>
This makes it possible to restrict the lifespan of entries set via the setCommandHandlingAdapterInstance method of the Device Connection API. Signed-off-by: Carsten Lohmann <[email protected]>
This makes it possible to restrict the lifespan of entries set via the setCommandHandlingAdapterInstance method of the Device Connection API. Signed-off-by: Carsten Lohmann <[email protected]>
Signed-off-by: Carsten Lohmann <[email protected]>
PR #1916 has been created for using the ttd as the lifespan of the mapping entry in the device connection service. For the cases where no ttd is given and a periodic refresh of the mapping entry shall be done, a new Device Connection API method is needed for the refresh operation, performing a conditional update. |
Can't we simply add an (optional) flag to the setCommandHandlingAdapterInstance operation indicating whether the request is to be handled as an update only? |
I had thought about this, using If we name the flag |
…tations. Signed-off-by: Carsten Lohmann <[email protected]>
Also change the type of the lifespan parameter from int to Duration in the methods of the DeviceConnectionInfo, DeviceConnectionClient and DeviceConnectionService interfaces. Signed-off-by: Carsten Lohmann <[email protected]>
Also change the type of the lifespan parameter from int to Duration in the methods of the DeviceConnectionInfo, DeviceConnectionClient and DeviceConnectionService interfaces. Signed-off-by: Carsten Lohmann <[email protected]>
Also change the type of the lifespan parameter from int to Duration in the methods of the DeviceConnectionInfo, DeviceConnectionClient and DeviceConnectionService interfaces. Signed-off-by: Carsten Lohmann <[email protected]>
…red. With the 'lifespan' parameter of the 'setCommandHandlingAdapterInstance' operation being mandatory to implement now for Device Connection API implementations, the explicit removal of a command handling adapter instance entry can be skipped if the entry has expired. Signed-off-by: Carsten Lohmann <[email protected]>
Signed-off-by: Carsten Lohmann <[email protected]>
…as error. Command handling adapter instance entries may expire. Therefore, getting a NOT FOUND result when invoking removeCommandHandlingAdapterInstance() shouldn't be treated as an error. Signed-off-by: Carsten Lohmann <[email protected]>
Command handling adapter instance entries may expire. Therefore, getting a NOT FOUND result when invoking removeCommandHandlingAdapterInstance() shouldn't be treated as an error. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
…erInstance. Signed-off-by: Carsten Lohmann <[email protected]>
Signed-off-by: Carsten Lohmann <[email protected]>
In the Hono community call last Thursday there was the question, whether we can replace the periodic adapter instance mapping entry refresh call (with its associated overhead), with something, where the stale mapping entry gets deleted when forwarding a command to that adapter instance fails. Here's a closer look at the scenario: But, consider another scenario (featuring a qdrouter mesh): Removing the adapter instance mapping entry would be a mistake here - all following commands would get released without the device getting a chance to reestablish the command subscription. To mitigate this, an increased delay with additional checks for new credits before actually removing the mapping entry could be used. Accidentally removed mapping entries can't be prevented 100% in this case though. Another solution would be the following: When a protocol adapter gets shut down, it could call An "offline" marker created by I would have a better feeling using that 2nd solution with the offline markers, preventing accidental removals of mapping entries (however rare these might occur). _ slightly differently.Instead of "setAdapterInstanceOffline", we add "setAdapterInstanceStatus` with the possible states "online", "offline" and "suspected", implemented as a cache with key "adapter instance" and "online"/"suspected" as value. "Suspected" entries get deleted after an associated lifespan and thereby become "offline". A call to "getCommandHandlerAdapterInstances" will exclude entries with "suspected" state from the result and will delete an entry with an "offline" (ie. non-existant) state. Only downside I see here is the added adapter instance status check on every |
@calohmn thanks for the detailed analysis. I didn't think about the case where another router instance in a mesh would crash. I agree that we cannot distinguish these cases properly. |
Stale records would first be excluded from That means, that for command subscriptions without a ttd, no lifespan and no periodic updates/refresh operations (triggered by the protocol adapter that initiated the subscription) are needed. Basically, the advantage of the 2nd solution is that with one device connection api request, all mapping entries gone stale because of a killed adapter instance can get deactivated. Each stale mapping entry doesn't have to get removed individually by a protocol adapter that gets a "no credit" error. |
Understood. Sounds good to me. So, we would undo the addition of the lifespan parameter to the setCommandHandlingAdapterInstance operation and instead introduce a new (mandatory) operation setAdapterInstanceStatus to the Device Connection API, right? |
As I wrote above, I think the lifespan parameter for What could be undone here would be the addition of the
Yes. Implementors wouldn't have to store the "online" state though, only a set of "offline" adapter instances is needed (I thought "setAdapterInstanceOffline" would make this more clear, but that distinction is probably not needed from an API user viewpoint). |
@calohmn @sophokles73 I think proposed solution would work, but it introduces quite a complexity and I'm not sure if the problem it solves is that big. With having only the existing methods we can do the following: If adapter is not connected to the AMQP network, devices connected to it are unreachable. I think it's OK in that situation to disconnect all connection based devices (MQTT for example) anyways. I would also think that it'd be OK to let non-connection based devices be unreachable for the rest of their TTD (this shouldn't be a long period anyway). Next time they connect, they will be remapped or the connection will be refused (if AMQP network is still unreachable). If making these devices "offline" for the rest of their TTD is problem, an adapter could maybe cache this information and remap them on reconnect. Anyways, I just wanted to see your opinions on all this before proceeding with the proposed solution. |
I see these pros and cons here:
All in all, I would rather not do this. If we want to choose an approach without the lifespan and periodic mapping entry updates, and without the added "adapter instance offline" marker, the following solution would avoid the disconnects from the devices in most cases: This will cause fewer device disconnects. Still, there are of course potentially many device connection API requests done here every time the connection to the qdrouter is lost. That's something I would rather want to avoid. So, overall, I still consider the solution with the "adapter instance offline" marker to be the cleaner and preferable one. |
Ok, here's an updated "adapter instance online/offline status" approach with reduced complexity (no cleanup job with potentially expensive traversal over all tenants and devices, no influence on
If the "no credit" error was due to a Qdrouter outage:
With that approach, there's only trivial implementation effort for implementors of the device connection API. Potentially expensive operations (validating adapter instance mapping entries for many consumers), or operations noticed by the devices (disconnecting the device connections), are only done in presumably rare cases. |
@calohmn I'm generally OK with this, but here are a couple more thoughts to consider. I still see relatively limited value in Maybe a better optimization would be to introduce Of course, we can do both if it makes sense. |
@dejanb With About An alternative idea:
I admit this doesn't look like a particularly pretty solution (other suggestions concerning the names are welcome :) ). But overall, I see this solution as having the least performance impact. The cleaner solution I still think is the one above where offline/suspected adapter instances get excluded from the |
@calohmn I started looking to all this from the beginning today and I would propose to hold a meeting to discuss it. As it we'll probably get to the solution faster. As some folks are offline this week, how about to do it early next week (Monday)? Some things on my mind right now are: I'm not sure we're supposed to deal with this problem at all at this level. Adapter having an intermittent connection with either AMQP network or device connection service is expected. And things work properly right now and are "eventually consistent" as they should be. For example, the original problem of having an issue of removing the subscription from the service. The future commands to this device will fail anyways as the device is disconnected anyways. So the fact we temporarily have a "stale entry" doesn't affect the functioning of the system. When device reconnects, this will be fixed. The only issue is that device can be deleted in the meantime, so that this entry needs to be cleaned up somehow eventually, but I wouldn't put a burden of that job on the adapters in any way. That could also happen when the device is deleted from the registry but still connected. So maybe that's something we need to handle. I think the bigger issue is that we currently have (I might be missing something) is that we don't have a way to deal with the situation when adapter is killed unexpectedly. That will leave a bunch of stale data in the service. One solution to that would be to introduce adapter status updates like proposed. But to me it seems like the job of the platform and we should be able to either provide a hook to be notified of such an event or do checks internally periodically. In regard with all this, I don't think we need to "prevent the service to return stale entries" as that does not change the behaviour of the system in any way. I would focus instead on providing a way to enable other services and/or platform on notifying this service when data needs to be updated because of the external events. Or provide a way for the service to do this periodically on its own. The other common theme in all these solutions is that we need to notify somehow the service that it has a stale information after we fail to send a command. I think if we solve the first one, we don't need this as it should be a "normal system state" from time to time. It could help however if we want to use it to trigger some kind of a cleanup or a check within the service (if we find it necessary). So, the equivalent "atticLifespan" idea would be to just have Anyhow, I hope this make at least some sense :) As I said, I think it'd be very beneficial to talk through it as we'll figure it out faster. Let me know what you think. |
Yes, sure. Monday would be fine. Let's discuss details on Gitter.
The original imagined scenario that triggered this issue here was one, where things didn't get fixed: Good points raised in your comment, how to address stale entries in general. Let's discuss them in the meeting then. |
@calohmn Thanks for the description of the scenario. That was really helpful. One question to further clarify all requirements: So the problem exists for the devices that can connect both directly and using gateways. I would assume however that they can't be connected both ways at the same time? Or that is expected scenario as well? I don't see anything that would prevent that in the current APIs. If that's not expected, one solution could be that the In other case, that's not a solution for this particular scenario. |
A device can indeed be connected both ways at the same time. It could send telemetry/events via its gateway and at the same time receive commands on a direct connection to the adapter.
Even if we change the above described behaviour, clearing the mapping entry would mean that we would also somehow have to inform the adapter, which created the direct adapter instance mapping, to remove the command subscription link to the device then. How to do that? |
…apterInstance. This parameter won't be needed anymore. Signed-off-by: Carsten Lohmann <[email protected]>
This parameter won't be needed anymore. Signed-off-by: Carsten Lohmann <[email protected]>
Scenario:
A device, that is normally only interacting via a gateway, has done a
ttd
HTTP request to the Hono HTTP protocol adapter in order to receive a command message.Having received a command message, there is some error when invoking the
removeCommandHandlingAdapterInstance
method in the Device Connection Service, as part of the protocol adapter removing the command subscription. This leads to a stalecommandHandlingAdapterInstance
entry concerning the device in the Device Connection Service.Now, for all subsequent command subscriptions only done by the gateway of the device, there is the issue that the stale
commandHandlingAdapterInstance
entry will effectively prevent commands from getting delivered to the gateway. This is due to the device-specificcommandHandlingAdapterInstance
entry getting precedence.Only a subsequent created and removed command subscription for the specific device (not the gateway), where the
removeCommandHandlingAdapterInstance
method succeeds, will mitigate such a situation.To prevent or automatically resolve such a situation, this looks like a straightforward solution:
removeCommandHandlingAdapterInstance
invocation for as long as it fails.That is easy to implement but doesn't necessarily prevent the above problem if the adapter is restarted while the
commandHandlingAdapterInstance
entry hasn't been successfully removed yet.Therefore it looks like we need to implement the following instead (or as well):
commandHandlingAdapterInstance
entries expire after a certain time and implement periodic refresh requests from the protocol adapter while a subscription is still active.The text was updated successfully, but these errors were encountered: