Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JPERF-716 background process results #13

Merged

Conversation

dagguh
Copy link
Contributor

@dagguh dagguh commented Dec 11, 2020

This blocks investigating flaky https://github.com/atlassian/infrastructure integration tests.
And those block my ShipIt project 😉
I plan to continue this stream in Innovation Time and next ShipIt.

Pros:
* can get exit code, stdout, stderr
* can `stop` even if the process died earlier
* doesn't assume or require `screen` to be installed
* conveys the need to release an external resource via `AutoCloseable`
* doesn't use unreliable Kotlin `internal` to hide members

Cons:
* maintains a connection since start to stop
* could accumulate unnecessarily big stdout - but it can be fixed later
@dagguh dagguh requested a review from a team as a code owner December 11, 2020 14:34
@@ -11,6 +11,7 @@ import java.util.concurrent.TimeUnit
*
* @see [SshConnection.stopProcess]
*/
@Deprecated(message = "Use BackgroundProcess instead")
class DetachedProcess private constructor(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do it in DetachedProcess, but none of the approaches seemed to satisfy all requirements. So I wrote the tests, which guard the requirements. Maybe someone will come in the future and will be able to make them work with DetachedProcess.

But for now, I had to resign from the requirement of starting/stopping on different Session.Commands.

@dagguh
Copy link
Contributor Author

dagguh commented Dec 14, 2020

This PR is easier to read per-commit

Copy link
Contributor

@mzyromski-atlassian mzyromski-atlassian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dagguh, great work! Sorry for reviewing with such a delay.

@dagguh
Copy link
Contributor Author

dagguh commented Jan 7, 2021

No worries mate!

@dagguh dagguh merged commit 28a0c96 into atlassian:master Jan 7, 2021
@dagguh dagguh deleted the issue/JPERF-716-background-process-results branch January 7, 2021 08:48
installPing(sshHost)

val fail = sshHost.runInBackground("nonexistent-command")
val failResult = fail.stop(Duration.ofMillis(20))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line flakes when sometimes it doesn't get the process' exit status. I'm not sure what's the condition.

java.lang.IllegalStateException: command.exitStatus must not be null
	at com.atlassian.performance.tools.ssh.WaitingCommand.waitForResult(WaitingCommand.kt:24)
	at com.atlassian.performance.tools.ssh.SshjBackgroundProcess.stop(SshjBackgroundProcess.kt:20)
	at com.atlassian.performance.tools.ssh.api.SshTest$shouldTolerateEarlyFinish$1.invoke(SshTest.kt:61)
	at com.atlassian.performance.tools.ssh.api.SshTest$shouldTolerateEarlyFinish$1.invoke(SshTest.kt:8)
	at com.atlassian.performance.tools.ssh.api.SshContainer.useSsh$ssh(SshContainer.kt:19)
	at com.atlassian.performance.tools.ssh.api.SshTest.shouldTolerateEarlyFinish(SshTest.kt:56)

I managed to make the problem frequently reproducible by wrapping the "when->then" part of the test in a repeat(100) { ... } block.

    @Test
    fun shouldTolerateEarlyFinish() {
        SshContainer().useSsh { sshHost ->
            installPing(sshHost)

            repeat(100) {
                val fail = sshHost.runInBackground("nonexistent-command")
                val failResult = fail.stop(Duration.ofMillis(20))

                Assert.assertEquals(127, failResult.exitStatus)
            }
        }
    }

Our current API of BackgroundProcess#stop (which is failing) says:

Throws if getting the [SshConnection.SshResult] fails.

So the problem is not really a contract bug.

==== Start of my thoughts on the topic ====

I personally think that if it's possible to not handle exception then by default it shouldn't break your application, so it shouldn't really be thrown. Because of this I think this is an API design bug and IMHO we should fix it by introducing API that is returning nullable process result.

If we want to let the consumer distinguish reason of failures (missing result) we can do that by allowing the overwrite of problem handlers - by default returning null, but can throw if consumer wishes to do that. This way we don't leak the implementation specific exception layer into the interface, which should be polymorphic.

==== End of my thoughts on the topic ====

Nevertheless with the current API it's not a bug, yet it's a test failure. I'm trying to figure out how we can resolve it.

Possibly we could add try-catch block?

@dagguh do you remember if this test could work as expected if it would accept the process to not return exit status? Maybe it would be fine to repeat the runInBackground + stop couple of times? It doesn't sound as the best solution for me, but I struggle to find any better.

Copy link
Contributor Author

@dagguh dagguh Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is clear: assertEquals(127, failResult.exitStatus). That's what happens in a shell for a nonexistent command. That doesn't happen - why? We shouldn't try to hide that dysfunction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothesis 1: this would also flake for SshConnection.execute
Hypothesis 2: it's caused by the BackgroundProcess.stop timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothesis 3: missed close, the getExitStatus Javadoc asks to "Always call close() first before inspecting the exit status."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothesis 3: missed close

Noticed the javadoc as well. I tried to add close and run the test with repeat(100), but it didn't help (unless I checked something wrong.

Copy link
Contributor

@pczuj pczuj Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 milliseconds is the time where we wait for the command to react to our interrupt. We are sending this interrupt immediately after starting the process.

I made an experiment by changing the order of operations, so that we read stdout and log it before failing with the exit status being null. (original)

    fun waitForResult(): SshConnection.SshResult {
        command.waitForCompletion(timeout)
        command.close()
        val output = command.inputStream.readAndLog(stdout)
        val errorOutput = command.errorStream.readAndLog(stderr)
        return SshConnection.SshResult(
            exitStatus = command.exitStatus,
            output = output,
            errorOutput = errorOutput
        )
    }

The output suggests that it indeed doesn't get into the "command not found" part, because the first process that fails to read exit code has no "bash: nonexistent-command: command not found" output and only "^C".

2022-06-22T15:12:14,728Z DEBUG Test worker [] [com.atlassian.performance.tools.ssh.WaitingCommand.Companion] ^Cbash: nonexistent-command: command not found

2022-06-22T15:12:14,773Z DEBUG Test worker [] [com.atlassian.performance.tools.ssh.WaitingCommand.Companion] ^C

command.exitStatus must not be null
java.lang.IllegalStateException: command.exitStatus must not be null
	at com.atlassian.performance.tools.ssh.WaitingCommand.waitForResult(WaitingCommand.kt:26)
	at com.atlassian.performance.tools.ssh.SshjBackgroundProcess.stop(SshjBackgroundProcess.kt:20)
	at com.atlassian.performance.tools.ssh.api.SshTest$shouldTolerateEarlyFinish$1.invoke(SshTest.kt:61)
	at com.atlassian.performance.tools.ssh.api.SshTest$shouldTolerateEarlyFinish$1.invoke(SshTest.kt:8)
	at com.atlassian.performance.tools.ssh.api.SshContainer.useSsh$ssh(SshContainer.kt:19)
	at com.atlassian.performance.tools.ssh.api.SshTest.shouldTolerateEarlyFinish(SshTest.kt:56)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet this proves nothing really, it's only a clue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run it with Thread.sleep(20) before interrupt, then it succeeds

    @Test
    fun shouldTolerateEarlyFinish() {
        SshContainer().useSsh { sshHost ->
            installPing(sshHost)

            repeat(100) {
                val fail = sshHost.runInBackground("nonexistant-command")
                Thread.sleep(20)
                val failResult = fail.stop(Duration.ofMillis(20))

                Assert.assertEquals(127, failResult.exitStatus)
            }
        }
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with repeat(1000)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants