Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: multithreading support for erasure coding #1087

Merged
merged 18 commits into from
Feb 12, 2025
Merged

Conversation

munna0908
Copy link
Contributor

@munna0908 munna0908 commented Jan 23, 2025

This PR introduces the following changes:

  • Utilizes TaskPool to spawn separate threads for the erasure encoding and decoding.
  • Adds a new num-threads flag, enabling users to specify the number of CPU threads available for TaskPool workers.

Docs PR

@munna0908 munna0908 marked this pull request as draft January 23, 2025 17:20
@munna0908 munna0908 requested review from gmega and dryajov January 24, 2025 12:25
@munna0908 munna0908 self-assigned this Jan 24, 2025
@munna0908 munna0908 marked this pull request as ready for review January 24, 2025 12:26
@munna0908 munna0908 added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Jan 24, 2025
Copy link
Member

@gmega gmega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, well done. I see one minor issue with the exception handling but otherwise looks great. We should wait for @dryajov's review on this one anyhow though.

codex/codex.nim Outdated
taskpool = Taskpool.new(numThreads = config.numThreads)
info "Threadpool started", numThreads = taskpool.numThreads
except Exception:
raise newException(Defect, "Failure in taskpool initialization.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Not sure we want to be casting everything as Defect and bubbling it up this way? I think we'd use a result type for this instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munna0908 Exception is a parent type for Defect and CatchableError, the latter of which all other "catchable" error types are derived from. Defect is not catchable, and therefore we should not be using except Expection where we can avoid it (despite its usage in other places in the codebase).

In this particular case, using except CatchableError and converting to a Defect is a valid pattern, and it also means that if a Defect was raised in the try block, it would not be caught in the except, and would be bubbled up the stack.

See https://status-im.github.io/nim-style-guide/errors.exceptions.html for more info on exception handling best practices (TL;DR using Result is recommended)

@gmega I agree that we should probably return a Result[void] type here, read the Result in the layer above, and then quit on error. However, the layer above currently quits on exception, and there are a fair few instances of raised Defects in the new proc, so converting everything to Result would require a decent amount of cascading changes. For the scope of this PR, I think that Defect is suitable since there is no recovery opportunity here.

So my suggestion would be something like:

try:
  # start taskpool
except CatchableError as parent:
  raise newException(Defect, "Failure in taskpool initialisation: " & parent.msg, parent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emizzle Commit-link

codex/conf.nim Show resolved Hide resolved
codex/erasure/erasure.nim Outdated Show resolved Hide resolved
codex/erasure/erasure.nim Outdated Show resolved Hide resolved
Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job Rahul! I left mostly suggestions, just one question around requiring threading to be turned on and the configuration requirements/validation.

codex/codex.nim Outdated
taskpool = Taskpool.new(numThreads = config.numThreads)
info "Threadpool started", numThreads = taskpool.numThreads
except Exception:
raise newException(Defect, "Failure in taskpool initialization.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munna0908 Exception is a parent type for Defect and CatchableError, the latter of which all other "catchable" error types are derived from. Defect is not catchable, and therefore we should not be using except Expection where we can avoid it (despite its usage in other places in the codebase).

In this particular case, using except CatchableError and converting to a Defect is a valid pattern, and it also means that if a Defect was raised in the try block, it would not be caught in the except, and would be bubbled up the stack.

See https://status-im.github.io/nim-style-guide/errors.exceptions.html for more info on exception handling best practices (TL;DR using Result is recommended)

@gmega I agree that we should probably return a Result[void] type here, read the Result in the layer above, and then quit on error. However, the layer above currently quits on exception, and there are a fair few instances of raised Defects in the new proc, so converting everything to Result would require a decent amount of cascading changes. For the scope of this PR, I think that Defect is suitable since there is no recovery opportunity here.

So my suggestion would be something like:

try:
  # start taskpool
except CatchableError as parent:
  raise newException(Defect, "Failure in taskpool initialisation: " & parent.msg, parent)

codex/erasure/erasure.nim Show resolved Hide resolved
blockSize, ecK, ecM: int,
data: ref seq[seq[byte]],
parity: ptr seq[seq[byte]],
): Future[Result[void, string]] {.async.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 This is completely correct, just highlighting that another option would be to continue the pattern in this module of returning ?!void, which is an alias forResult[void, CatchableError], from the questionable lib. Then, you can return success() or failure("something"). Again, just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit-link

@@ -87,6 +90,21 @@ type
# provided.
minSize*: NBytes

EncodeTask = object
result: Atomic[bool]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is generally a reserved word. Not sure if we should use something else? @dryajov?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emizzle Good point, I will change this to success

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit-link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine inside a struct/object, I don't think it will be a problem, but might be prudent to use a different name, res/ok?

blockSize, ecK, ecM: int,
data, parity: ref seq[seq[byte]],
recovered: ptr seq[seq[byte]],
): Future[Result[void, string]] {.async.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea to tell the compiler we don't want any exceptions leaking out of this routine, so we don't have any issues with crashing thread tasks

Suggested change
): Future[Result[void, string]] {.async.} =
): Future[Result[void, string]] {.async: (raises:[]).} =

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for encodeAsync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emizzle Commit-link

@munna0908 munna0908 requested review from gmega and emizzle January 28, 2025 14:54
@munna0908 munna0908 force-pushed the feature/async-erasure branch from 3d8d133 to f1b8dfc Compare January 28, 2025 14:55
codex/codex.nim Outdated
if config.numThreads == 0:
taskpool = Taskpool.new(numThreads = min(countProcessors(), 16))
elif config.numThreads < 2:
raise newException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a raiseAssert here instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is better checked in the conf.nim so that the invalid value doesn't get passed the cmd validation.

codex/codex.nim Outdated
info "Threadpool started", numThreads = taskpool.numThreads
except CatchableError as parent:
raise
newException(Defect, "Failure in taskpool initialization:" & parent.msg, parent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably do a raiseAssert as well.

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes, Rahul. Apart from the minor suggestions (raiseAssert instead of raise newException(Defect...), this LGTM. So, I'll approve now.

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the CancelledError changes, which brought to mind that maybe we need to consider gracefully terminating the TaskPool in the stop routine.

Sorry for the premature approval before!

warn "Async thread error", err = exc.msg
return failure(exc.msg)
except CancelledError:
return failure("Thread cancelled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, do we want to return a failure? I thought about this a bit further... this proc executes in the master thread, correct? If so, maybe we should propagate CancelledErrors. So the raises annotation would be:

{.async: (raises: [CacncelledError]).}

and the except block would re-raise:

except CancelledError as e:
    raise e

This brings up another question -- should the TaskPool be gracefully terminated in stop? Or is there no point to wait for worker threads to join master? IOW, would stopping the master thread kill the worker at the OS-level and possibly produce an out-of-band error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to properly cleanup everything, propagate the cancelation, and then make sure we don't exit prematurely at the callsite of both asyncEncode/asyncDecode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we can't simply exit on cancellation, we need to wait for the worker to finish, so wait threadPtr.wait() should still be called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, couldn't that create a deadlock situation where the task is cancelled with cancelAndWait and but cancellation has wait on the thread to finish its operation before cancelling? Is there a way to forcefully kill the task and discard any errors or results?

warn "Async thread error", err = exc.msg
return failure(exc.msg)
except CancelledError:
return failure("Thread cancelled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@emizzle emizzle self-requested a review January 28, 2025 23:38
dryajov
dryajov approved these changes Jan 29, 2025
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing this changes, so lets not merge this just yet.

parity: ptr seq[seq[byte]],
): Future[?!void] {.async: (raises: []).} =
## Create an encoder backend instance
let encoder = self.encoderProvider(blockSize, ecK, ecM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend needs to be created in the worker thread, this creates the backend on the main thread, it should also be released once done.

warn "Async thread error", err = exc.msg
return failure(exc.msg)
except CancelledError:
return failure("Thread cancelled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to properly cleanup everything, propagate the cancelation, and then make sure we don't exit prematurely at the callsite of both asyncEncode/asyncDecode.

return failure("Thread cancelled")
finally:
# release the encoder
t.encoder.release()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to release the ThreadSignalPtr. This was actually a fun one, because it took me some time to figure out what was wrong, as I was seeing things fail arbitrarily. The issue is that, ThreadSignalPtr uses pipes undereat to communicate between threads and if you don't dispose of it properly, you'll eventually hit a limit on opened filehandles. So, you need to call threadPtr.close() here.

warn "Async thread error", err = exc.msg
return failure(exc.msg)
except CancelledError:
return failure("Thread cancelled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we can't simply exit on cancellation, we need to wait for the worker to finish, so wait threadPtr.wait() should still be called.

dryajov
dryajov approved these changes Jan 30, 2025
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job for a first pass @munna0908!

Let me summarize what needs to be changed:

  • Rework how buffers are passed arround from ptr seq[seq[]] to ptr UncheckedArray[ptr UncheckedArray[byte]]
  • Dispose of the thread signal
  • Handle cancelations
  • Add tests to explicitely test the multithreaded functionality in testerasure.nim

@munna0908 munna0908 force-pushed the feature/async-erasure branch from ce80d68 to 374a815 Compare February 7, 2025 11:45
@munna0908
Copy link
Contributor Author

@dryajov I have accommodated all suggested changes including the test cases

@munna0908 munna0908 requested a review from dryajov February 10, 2025 03:17
return failure(asyncExc.msg)
finally:
if exc of CancelledError:
raise (ref CancelledError) exc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to cast it here, just re-raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dryajov, this proc should only raise CancelledError, but exc is of type CatchableError.

else:
return failure(exc.msg)
finally:
threadPtr.close().expect("closing once works")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, this can be in a defer block instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit-link

@munna0908 munna0908 force-pushed the feature/async-erasure branch from de59003 to 1c859a8 Compare February 11, 2025 17:11
Copy link
Member

@gmega gmega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

except CatchableError as exc:
try:
await threadFut
except AsyncError as asyncExc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why AsyncError here and not CatchableError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dryajov AsyncError would be more appropriate as threadPtr.wait() raises an AsyncError, ref

Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@gmega gmega added this pull request to the merge queue Feb 12, 2025
Merged via the queue into master with commit c651488 Feb 12, 2025
19 checks passed
@gmega gmega deleted the feature/async-erasure branch February 12, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants