-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: multithreading support for erasure coding #1087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good, well done. I see one minor issue with the exception handling but otherwise looks great. We should wait for @dryajov's review on this one anyhow though.
codex/codex.nim
Outdated
taskpool = Taskpool.new(numThreads = config.numThreads) | ||
info "Threadpool started", numThreads = taskpool.numThreads | ||
except Exception: | ||
raise newException(Defect, "Failure in taskpool initialization.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Not sure we want to be casting everything as Defect and bubbling it up this way? I think we'd use a result type for this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@munna0908 Exception
is a parent type for Defect
and CatchableError
, the latter of which all other "catchable" error types are derived from. Defect is not catchable, and therefore we should not be using except Expection
where we can avoid it (despite its usage in other places in the codebase).
In this particular case, using except CatchableError
and converting to a Defect
is a valid pattern, and it also means that if a Defect was raised in the try block, it would not be caught in the except, and would be bubbled up the stack.
See https://status-im.github.io/nim-style-guide/errors.exceptions.html for more info on exception handling best practices (TL;DR using Result is recommended)
@gmega I agree that we should probably return a Result[void]
type here, read the Result in the layer above, and then quit on error. However, the layer above currently quits on exception, and there are a fair few instances of raised Defects in the new proc, so converting everything to Result would require a decent amount of cascading changes. For the scope of this PR, I think that Defect is suitable since there is no recovery opportunity here.
So my suggestion would be something like:
try:
# start taskpool
except CatchableError as parent:
raise newException(Defect, "Failure in taskpool initialisation: " & parent.msg, parent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job Rahul! I left mostly suggestions, just one question around requiring threading to be turned on and the configuration requirements/validation.
codex/codex.nim
Outdated
taskpool = Taskpool.new(numThreads = config.numThreads) | ||
info "Threadpool started", numThreads = taskpool.numThreads | ||
except Exception: | ||
raise newException(Defect, "Failure in taskpool initialization.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@munna0908 Exception
is a parent type for Defect
and CatchableError
, the latter of which all other "catchable" error types are derived from. Defect is not catchable, and therefore we should not be using except Expection
where we can avoid it (despite its usage in other places in the codebase).
In this particular case, using except CatchableError
and converting to a Defect
is a valid pattern, and it also means that if a Defect was raised in the try block, it would not be caught in the except, and would be bubbled up the stack.
See https://status-im.github.io/nim-style-guide/errors.exceptions.html for more info on exception handling best practices (TL;DR using Result is recommended)
@gmega I agree that we should probably return a Result[void]
type here, read the Result in the layer above, and then quit on error. However, the layer above currently quits on exception, and there are a fair few instances of raised Defects in the new proc, so converting everything to Result would require a decent amount of cascading changes. For the scope of this PR, I think that Defect is suitable since there is no recovery opportunity here.
So my suggestion would be something like:
try:
# start taskpool
except CatchableError as parent:
raise newException(Defect, "Failure in taskpool initialisation: " & parent.msg, parent)
codex/erasure/erasure.nim
Outdated
blockSize, ecK, ecM: int, | ||
data: ref seq[seq[byte]], | ||
parity: ptr seq[seq[byte]], | ||
): Future[Result[void, string]] {.async.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 This is completely correct, just highlighting that another option would be to continue the pattern in this module of returning ?!void
, which is an alias forResult[void, CatchableError]
, from the questionable
lib. Then, you can return success()
or failure("something")
. Again, just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit-link
codex/erasure/erasure.nim
Outdated
@@ -87,6 +90,21 @@ type | |||
# provided. | |||
minSize*: NBytes | |||
|
|||
EncodeTask = object | |||
result: Atomic[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result
is generally a reserved word. Not sure if we should use something else? @dryajov?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emizzle Good point, I will change this to success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit-link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine inside a struct/object, I don't think it will be a problem, but might be prudent to use a different name, res
/ok
?
codex/erasure/erasure.nim
Outdated
blockSize, ecK, ecM: int, | ||
data, parity: ref seq[seq[byte]], | ||
recovered: ptr seq[seq[byte]], | ||
): Future[Result[void, string]] {.async.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea to tell the compiler we don't want any exceptions leaking out of this routine, so we don't have any issues with crashing thread tasks
): Future[Result[void, string]] {.async.} = | |
): Future[Result[void, string]] {.async: (raises:[]).} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for encodeAsync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3d8d133
to
f1b8dfc
Compare
codex/codex.nim
Outdated
if config.numThreads == 0: | ||
taskpool = Taskpool.new(numThreads = min(countProcessors(), 16)) | ||
elif config.numThreads < 2: | ||
raise newException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a raiseAssert
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is better checked in the conf.nim
so that the invalid value doesn't get passed the cmd validation.
codex/codex.nim
Outdated
info "Threadpool started", numThreads = taskpool.numThreads | ||
except CatchableError as parent: | ||
raise | ||
newException(Defect, "Failure in taskpool initialization:" & parent.msg, parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably do a raiseAssert
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes, Rahul. Apart from the minor suggestions (raiseAssert
instead of raise newException(Defect...
), this LGTM. So, I'll approve now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the CancelledError changes, which brought to mind that maybe we need to consider gracefully terminating the TaskPool
in the stop
routine.
Sorry for the premature approval before!
codex/erasure/erasure.nim
Outdated
warn "Async thread error", err = exc.msg | ||
return failure(exc.msg) | ||
except CancelledError: | ||
return failure("Thread cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, do we want to return a failure? I thought about this a bit further... this proc executes in the master thread, correct? If so, maybe we should propagate CancelledErrors
. So the raises annotation would be:
{.async: (raises: [CacncelledError]).}
and the except block would re-raise:
except CancelledError as e:
raise e
This brings up another question -- should the TaskPool
be gracefully terminated in stop
? Or is there no point to wait for worker threads to join master? IOW, would stopping the master thread kill the worker at the OS-level and possibly produce an out-of-band error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to properly cleanup everything, propagate the cancelation, and then make sure we don't exit prematurely at the callsite of both asyncEncode/asyncDecode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we can't simply exit on cancellation, we need to wait for the worker to finish, so wait threadPtr.wait()
should still be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, couldn't that create a deadlock situation where the task is cancelled with cancelAndWait
and but cancellation has wait on the thread to finish its operation before cancelling? Is there a way to forcefully kill the task and discard any errors or results?
codex/erasure/erasure.nim
Outdated
warn "Async thread error", err = exc.msg | ||
return failure(exc.msg) | ||
except CancelledError: | ||
return failure("Thread cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reviewing this changes, so lets not merge this just yet.
codex/erasure/erasure.nim
Outdated
parity: ptr seq[seq[byte]], | ||
): Future[?!void] {.async: (raises: []).} = | ||
## Create an encoder backend instance | ||
let encoder = self.encoderProvider(blockSize, ecK, ecM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backend needs to be created in the worker thread, this creates the backend on the main thread, it should also be released once done.
codex/erasure/erasure.nim
Outdated
warn "Async thread error", err = exc.msg | ||
return failure(exc.msg) | ||
except CancelledError: | ||
return failure("Thread cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to properly cleanup everything, propagate the cancelation, and then make sure we don't exit prematurely at the callsite of both asyncEncode/asyncDecode.
codex/erasure/erasure.nim
Outdated
return failure("Thread cancelled") | ||
finally: | ||
# release the encoder | ||
t.encoder.release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to release the ThreadSignalPtr. This was actually a fun one, because it took me some time to figure out what was wrong, as I was seeing things fail arbitrarily. The issue is that, ThreadSignalPtr
uses pipes undereat to communicate between threads and if you don't dispose of it properly, you'll eventually hit a limit on opened filehandles. So, you need to call threadPtr.close()
here.
codex/erasure/erasure.nim
Outdated
warn "Async thread error", err = exc.msg | ||
return failure(exc.msg) | ||
except CancelledError: | ||
return failure("Thread cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we can't simply exit on cancellation, we need to wait for the worker to finish, so wait threadPtr.wait()
should still be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job for a first pass @munna0908!
Let me summarize what needs to be changed:
- Rework how buffers are passed arround from
ptr seq[seq[]]
toptr UncheckedArray[ptr UncheckedArray[byte]]
- Dispose of the thread signal
- Handle cancelations
- Add tests to explicitely test the multithreaded functionality in
testerasure.nim
ce80d68
to
374a815
Compare
@dryajov I have accommodated all suggested changes including the test cases |
return failure(asyncExc.msg) | ||
finally: | ||
if exc of CancelledError: | ||
raise (ref CancelledError) exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to cast it here, just re-raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dryajov, this proc should only raise CancelledError, but exc
is of type CatchableError
.
codex/erasure/erasure.nim
Outdated
else: | ||
return failure(exc.msg) | ||
finally: | ||
threadPtr.close().expect("closing once works") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, this can be in a defer
block instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit-link
de59003
to
1c859a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
except CatchableError as exc: | ||
try: | ||
await threadFut | ||
except AsyncError as asyncExc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why AsyncError here and not CatchableError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
This PR introduces the following changes:
TaskPool
to spawn separate threads for the erasure encoding and decoding.num-threads
flag, enabling users to specify the number of CPU threads available for TaskPool workers.Docs PR