Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydeephaven clients can get a RST_STREAM with CANCELLED error in some cases #5996

Closed
jmao-denver opened this issue Aug 28, 2024 · 3 comments · Fixed by #6401, #6424 or #6478
Closed

pydeephaven clients can get a RST_STREAM with CANCELLED error in some cases #5996

jmao-denver opened this issue Aug 28, 2024 · 3 comments · Fixed by #6401, #6424 or #6478
Assignees
Labels
bug Something isn't working core Core development tasks
Milestone

Comments

@jmao-denver
Copy link
Contributor

jmao-denver commented Aug 28, 2024

Discovered during #5990

   def test_publish_fetch(self):
        plugin_client = self.session.plugin_client(self.session.exportable_objects["plot3"])
        self.assertIsNotNone(plugin_client)

        with self.subTest("Plugin object"):
            # First fetch the Plugin object, then publish it
            export_plugin_client = self.session.fetch(plugin_client)
            shared_ticket = SharedTicket.random_ticket()
            self.session.publish(export_plugin_client, shared_ticket)

            # Another session to use the shared Plugin object
            sub_session = Session()
            server_obj = ServerObject(type="Figure", ticket=shared_ticket)
            sub_plugin_client = sub_session.plugin_client(server_obj)
            payload, refs = next(sub_plugin_client.resp_stream)
            self.assertGreater(len(payload), 0)
            self.assertGreater(len(refs), 0)
            ref = refs[0]
            self.assertEqual(ref.type, "Table")
            sub_plugin_client.close()
            sub_session.close()

        with self.subTest("Fetchable in the Plugin object"):
            payload, refs = next(plugin_client.resp_stream)
            self.assertGreater(len(payload), 0)
            self.assertGreater(len(refs), 0)
            ref = refs[0]
            self.assertEqual(ref.type, "Table")
            fetched = ref.fetch()
            self.assertIsNotNone(fetched)
            self.assertEqual(fetched.size, 30)

            # Publish the fetchable
            shared_ticket = SharedTicket.random_ticket()
            self.session.publish(ref, shared_ticket)

            # Another session to use the shared fetchable
            sub_session = Session()
            sub_table = sub_session.fetch_table(shared_ticket)
            self.assertIsNotNone(sub_table)
            self.assertEqual(sub_table.size, 30)
            sub_session.close()

        with self.subTest("released Plugin object"):
            sub_session = Session()
            server_obj = ServerObject(type="Figure", ticket=shared_ticket)
            sub_plugin_client = sub_session.plugin_client(server_obj)
            self.session.release(export_plugin_client)
            with self.assertRaises(Exception):
                payload, refs = next(sub_plugin_client.resp_stream)
            sub_session.close()

        plugin_client.close()
SubTest error: Traceback (most recent call last):
  File "/Users/jianfengmao/.pyenv/versions/3.8.18/lib/python3.8/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/Users/jianfengmao/.pyenv/versions/3.8.18/lib/python3.8/unittest/case.py", line 582, in subTest
    yield
  File "/Users/jianfengmao/git/deephaven-core/py/client/tests/test_plugin_client.py", line 64, in test_publish_fetch
    payload, refs = next(plugin_client.resp_stream)
  File "/Users/jianfengmao/git/deephaven-core/py/client/pydeephaven/experimental/plugin_client.py", line 128, in __next__
    resp = next(self.stream_resp)
  File "/Users/jianfengmao/Playground/venv3.8/lib/python3.8/site-packages/grpc/_channel.py", line 543, in __next__
    return self._next()
  File "/Users/jianfengmao/Playground/venv3.8/lib/python3.8/site-packages/grpc/_channel.py", line 952, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.CANCELLED
	details = "Received RST_STREAM with error code 8"
	debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:10000 {created_time:"2024-08-28T08:50:56.199878-06:00", grpc_status:1, grpc_message:"Received RST_STREAM with error code 8"}"
>

Note that this 8 error code is not a grpc status 8: Resource Exhausted, but an h2 RST_STREAM error code 8: Cancel.

It turns out that the Jetty server implementation is reading AsyncContext.complete() as not a call to end the stream as being finished correctly, but to send the CANCEL error code to the client. Likewise, the python client is the only gRPC client we're aware of that reads RST_STREAMs that arrive after the final end stream message and allow it to change the meaning of that last message.

Changing the implementation of gwt-servlet upstream does not cause any integration test failures in grpc-java, but that's not a surprise since only the python client sees this bug anyway. Our initial "fix" does however cause tomcat to start having this same bug, so that won't be an acceptable solution. We're going to open tickets with grpc-java, jetty, and tomcat, and hopefully get the servlet container providers on the same page before making a change to grpc-java itself.

In the meantime, for some cases of this, we could wrap python client calls with try/except and handle the CANCEL case as success. We can also investigate changing the behavior of specific grpc bidi streams so that this issue cannot happen, by forcing the client to half-close first.

@jmao-denver jmao-denver added bug Something isn't working triage labels Aug 28, 2024
@jmao-denver jmao-denver added this to the Backlog milestone Aug 28, 2024
@rcaudy rcaudy added core Core development tasks and removed triage labels Sep 25, 2024
jmao-denver added a commit that referenced this issue Sep 25, 2024
The skipped tests depends on a fix for #5996
@rcaudy
Copy link
Member

rcaudy commented Sep 27, 2024

I think we may want to consider two things here:

  1. Correct the tests to keep the publishing session open until the other session has exported the shared ticket.
  2. Making the error handling better.

nbauernfeind added a commit that referenced this issue Sep 30, 2024
…ef (#6155)

This is related to investigation in #5996. Swap the order of these tests to reduce
the probability of the py-grpc client from claiming the rpc failed.

One test was not testing the proper shared ticket; adjusted for correctness to test intended feature.

ObjectService change is to adopt the proper ExportObject dependency pattern. Otherwise 
correctness requires analysis/knowledge that the list of objects are all brand new server side 
exports.
@niloc132 niloc132 changed the title Read the original plot plugin stream gets a CANCELLED error if the plugin was shared with another session and was read by the other session first. pydeephaven clients can get a RST_STREAM with CANCELLED error in some cases Oct 10, 2024
niloc132 added a commit to niloc132/deephaven-core that referenced this issue Nov 19, 2024
niloc132 added a commit to niloc132/deephaven-core that referenced this issue Nov 19, 2024
@devinrsmith devinrsmith reopened this Nov 22, 2024
niloc132 added a commit that referenced this issue Nov 22, 2024
@niloc132
Copy link
Member

The original fix for this seems to break our grpc-web adapter, something to do with trailers being sent as a data frame instead of being flushed and closed with trailers. It appears to be on the server side though, not the client - we should ensure that trailers are being correctly written to other kinds of clients too.

We've also discussed adding tests with TLS - ideally we add another gradle docker flag so that all client tests can validate that they work correctly. JS API tests are somewhat more important to test this way, since we change transport implementations.

nbauernfeind pushed a commit to nbauernfeind/deephaven-core that referenced this issue Nov 29, 2024
…ven#6401)

This is a Jetty-specific workaround to avoid irritating the Python gRPC
client into failing calls that had already half-closed successfully.

See deephaven#6400
Fixes deephaven#5996
@niloc132
Copy link
Member

niloc132 commented Dec 5, 2024

Reopening, for two reasons:

  • The revert of the revert was in the commits for fix: Close Jetty h2 streams with RST_STREAM and no error code  #6424 but the final diff did not include it. Not quite clear why, perhaps because the revert in this PR was also separately applied to main. My mistake, I should have checked the diff more closely before submitting.
  • With the missing patch applied, jetty's AsyncContext default timeout is applying after any given stream has ended, causing a stack trace.
2024-12-05T16:30:09.605Z | qtp373682836-78      |  WARN | .e.j.u.t.QueuedThreadPool | Job failed

java.lang.IllegalStateException: COMPLETED
        at org.eclipse.jetty.server.HttpChannelState.sendError(HttpChannelState.java:911)
        at org.eclipse.jetty.server.HttpChannelState.nextAction(HttpChannelState.java:493)
        at org.eclipse.jetty.server.HttpChannelState.unhandle(HttpChannelState.java:418)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:650)
        at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:461)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
        at java.base/java.lang.Thread.run(Thread.java:1583)

Clients are not impacted by this, but it is very noisy in the server log for clients that specify a grpc-timeout header - namely the grpc health check that our docker images use.

@niloc132 niloc132 reopened this Dec 5, 2024
niloc132 added a commit to niloc132/deephaven-core that referenced this issue Dec 12, 2024
…eephaven#6478)

Changes from deephaven#6420 were supposed to be reverted in deephaven#6424, but apparently
git interpreted the branch's relationship with main differently than I
did.

This patch reapplies deephaven#6401, with an additional change to avoid noisy
stack traces from grpc clients that set `grpc-timeout`. At this time,
Deephaven doesn't handle timeouts nor do any default clients set
timeouts, so the impact here should be minimal.

Fixes deephaven#5996 
See deephaven#6400
devinrsmith pushed a commit that referenced this issue Dec 12, 2024
This is a Jetty-specific workaround to avoid irritating the Python gRPC
client into failing calls that had already half-closed successfully.

Since we're now using ServletOutputStream.close() in place of
AsyncContext.complete(), we need to apply the same wrapping for trailers
to close() - that is, when the stream is clsoed, we can't rely on the
servlet container sending our trailers because grpc-web trailers are
actually a DATA frame which must be explicitly written.

See #6400
Fixes #5996
Reapplies #6401
Backport #6424
Backport #6478
nbauernfeind pushed a commit to nbauernfeind/deephaven-core that referenced this issue Dec 16, 2024
niloc132 added a commit to niloc132/deephaven-core that referenced this issue Jan 17, 2025
…eephaven#6478)

Changes from deephaven#6420 were supposed to be reverted in deephaven#6424, but apparently
git interpreted the branch's relationship with main differently than I
did.

This patch reapplies deephaven#6401, with an additional change to avoid noisy
stack traces from grpc clients that set `grpc-timeout`. At this time,
Deephaven doesn't handle timeouts nor do any default clients set
timeouts, so the impact here should be minimal.

Fixes deephaven#5996
See deephaven#6400
niloc132 added a commit to niloc132/deephaven-core that referenced this issue Jan 18, 2025
…eephaven#6478)

Changes from deephaven#6420 were supposed to be reverted in deephaven#6424, but apparently
git interpreted the branch's relationship with main differently than I
did.

This patch reapplies deephaven#6401, with an additional change to avoid noisy
stack traces from grpc clients that set `grpc-timeout`. At this time,
Deephaven doesn't handle timeouts nor do any default clients set
timeouts, so the impact here should be minimal.

Fixes deephaven#5996
See deephaven#6400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment