Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework tf listener wrapper #35

Merged

Conversation

mhidalgo-bdai
Copy link
Contributor

This PR reworks TFListenerWrapper and futures' APIs to prevent the user from running into deadlocks. Builds on #32.

@amessing-bdai
Copy link
Collaborator

What are your thoughts about a flag (defaulting to off) for a process-wide tf listener? One that a user could access similar to the process-wide node or executor?

@mhidalgo-bdai
Copy link
Contributor Author

What are your thoughts about a flag (defaulting to off) for a process-wide tf listener? One that a user could access similar to the process-wide node or executor?

Sounds useful.

Copy link
Contributor

@jbarry-bdai jbarry-bdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at tf listener wrapper

Comment on lines 77 to 78
if not wait_for_frames:
timeout_sec = 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_frames is about whether a path exists from frame a to frame b. if the path doesn't exist, it should return immediately if wait_for_frames is true. however if the path does exist then it should wait timeout for the transform to come in at the requested time. I think the way this is written it will return immediately if wait_for_frames is true and transform_time is beyond what's in the tf buffer even if the path exists from frame a to frame b. this is why there was a loop here previously instead of using the built in functions.

this is useful for code that may be adding frames later in the processing and therefore if the path doesn't exist, waiting isn't going to help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the path doesn't exist, it should return immediately if wait_for_frames is true.

Hmm, that is not what the original implementation does.

I think the way this is written it will return immediately if wait_for_frames is true and transform_time is beyond what's in the tf buffer even if the path exists from frame a to frame b.

Hmm, it shouldn't. Didn't we test this a couple weeks ago? In fact, it's because it waits that the test_future_transform_wait test isn't flaky due to races.

this is useful for code that may be adding frames later in the processing and therefore if the path doesn't exist, waiting isn't going to help.

I see. It's a bit of an odd use case, as this is a problem you would run into only if you had a single-threaded, synchronous program where statements are not ordered properly. That said, we can make it behave that way. Are you sure you want it to behave that way? It will expose us to races when using lookups not guarded by waits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow ok I did not write this correctly:

If wait_for_frames is False then this should return immediately if a path from frame_a to frame_b does not exist. Otherwise it should wait for the transform to exist until the timeout (or forever if the timeout is None), which I think is the normal thing the underlying TF stuff does.

We could also potentially drop this argument as I'm not really sure anyone is using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, FTR I got the wait_for_frames semantics wrong the first time. It is not redundant with timeout_sec, it changes the scope of the wait to include (and exclude) lookups on frames for which a path does not exist yet. Fixed in 25bc002. Thanks for the rubber ducking @jbarry-bdai !

@mhidalgo-bdai mhidalgo-bdai force-pushed the rework-tf-listener-wrapper branch from 8c49fc7 to 561ada3 Compare October 18, 2023 18:02
Signed-off-by: Michel Hidalgo <[email protected]>
@@ -101,7 +101,7 @@ def test_future_transform_extrapolation_exception(
time.sleep(0.2)
timestamp = ros.node.get_clock().now()
with pytest.raises(ExtrapolationException):
tf_listener.lookup_a_tform_b(FRAME_ID, CHILD_FRAME_ID, timestamp)
tf_listener.lookup_a_tform_b(FRAME_ID, CHILD_FRAME_ID, timestamp, timeout_sec=0.0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbarry-bdai note this is necessary now as None for a timeout is always treated as infinity. Original code aliased None with 0 when wait_for_frames was False.

cache_time_s: Optional[int] = None,
) -> None:
class TFListenerWrapper:
def __init__(self, node: Optional[Node] = None, cache_time_s: Optional[float] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happeneed to the wait for init part? I do know people were using that...

Copy link
Contributor Author

@mhidalgo-bdai mhidalgo-bdai Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, quite a bit in fact (and not always correctly e.g. here we pass three frames). It is a problematic pattern, though. When the wrapper is bound to a node that is spinning, all is fine with waiting on a constructor. When the wrapper gets instantiated from within the constructor of the node it is bound to, it hangs, because the node can't be put to spin until it's fully constructed.

Eliminating that pattern was the original motivation behind this PR. You can still wait, but you have do to so explicitly after construction. Migrated code already accounts for this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Could we add a comment here explaining how to do that? The reason we added this was because people were constantly getting exceptions thrown from trying to use the listener too soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fc71daa. I added documentation, including basic examples, and it will also now generate warnings if you are about to shoot yourself in the foot. PTAL!

Returns:
The transform frame_a_t_frame_b at the time specified.
Raises:
All the possible TransformExceptions.
"""
if not wait_for_frames:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic now looks correct to me

Returns:
The timestamp from the latest recorded transform frame_a_t_frame_b
Raises:
All the possible TransformExceptions.
"""
transform = self._internal_lookup_a_tform_b(frame_a, frame_b, None, timeout, wait_for_frames)
transform = self.lookup_a_tform_b(frame_a, frame_b, timeout_sec, wait_for_frames)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw self._tf_buffer.get_latest_common_time(frame_a, frame_b) above. Is there a difference in the result from doing that verses this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version can wait for frames to come into existence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug on line 119 right there 👀 Fixed it and added a test (we didn't have any for this method) in 49ee392.

Copy link
Collaborator

@bhung-bdai bhung-bdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me, but I would want a second opinion on my concern about possible race conditions.


future.add_done_callback(done_callback)
return event.wait(timeout=timeout_sec)
context.on_shutdown(event.set)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the context has multiple events set and shuts down? Is there a possibility where two events can be waiting on each other and deadlock?

Copy link
Contributor Author

@mhidalgo-bdai mhidalgo-bdai Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the context has multiple events, all of them will be set. Events do not wait on each other. Callees wait on events. Callees could run into deadlock if, for example, they have to complete each other futures or you somehow manage to wait on a future within an on context shutdown handler (which I'm not even sure if the Context API will allow).

So I would say deadlocks are a non-issue here inasmuch as it won't happen by accident unless code is flawed.

@jbarry-bdai
Copy link
Contributor

I'm happy with the TF listener changes but haven't looked at the other files.

@mhidalgo-bdai mhidalgo-bdai merged commit d907d62 into bdaiinstitute:main Oct 19, 2023
2 checks passed
@mhidalgo-bdai mhidalgo-bdai deleted the rework-tf-listener-wrapper branch October 19, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants