Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly cleanup SqlTask on worker when no longer needed #24729

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

losipiuk
Copy link
Member

Currently SqlTask objects are removed from SqlTaskManager.tasks map (cache) after timeout (15 minutes by default). Even though the object is not huge, we observed increased memory pressure up to OOM on busy clusters.

With this PR entries are dropped form SqlTaskManager as soon as they are no longer needed, when coordinator will no longer query for the information

Note: I this code is prone to race condition if last worker to acknowlege task results retransmits acknowlegement. There is a chance that a this point task is already cleaned by cleanup request from coordinator, and entry will get resurected in tasks cache map. This should be extremly rare though, and not cause any problems.

@POST
@Path("{taskId}/cleanup")
public void cleanupTask(
@PathParam("taskId") TaskId taskId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrapping not needed

@@ -124,6 +128,7 @@ public synchronized void updateDynamicFiltersVersionAndFetchIfNecessary(long new
private synchronized void stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is not called normally. only in erronous situations. You cannot rely on it to be executed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external entry point for DynamicFiltersFetcher request is really updateDynamicFiltersVersionAndFetchIfNecessary.
So the conditions for DynamicFiltersFetcher being completed are:

  • there is no pending query
  • all places which could call updateDynamicFiltersVersionAndFetchIfNecessary are stopped (currenty it's continuous fetcher)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 @raunaqmorarka Yeah - you are right. Do you have quick clue if there is natural place where we know that DynFilFetcher finished its work. The way it is called is kinda convoluted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

  • there is no pending request from DynamicFilterFetcher
  • all places which could call updateDynamicFiltersVersionAndFetchIfNecessary are stopped (it's only ContinuousTaskStatusFetcher atm IIRC)

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add testing (with DF) to TestHttpRemoteTask

import static java.util.Objects.requireNonNull;
import static java.util.concurrent.TimeUnit.MILLISECONDS;

public class RemoteTaskCleaner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate service for this? Cannot HttpRemoteTask do cleanup itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can - just having separate class is cleaner I think. I can put that as inner class too but I do not like that too much.

Currently SqlTask objects are removed from SqlTaskManager.tasks map
(cache) after timeout (15 minutes by default). Even though the object is
not huge, we observed increased memory pressure up to OOM on busy
clusters.

With this PR entries are dropped form SqlTaskManager as soon as they are
no longer needed, when coordinator will no longer query for the
information
@losipiuk
Copy link
Member Author

@sopel39 PTAL

@losipiuk
Copy link
Member Author

@sopel39 PTAL

@sopel39 PTAL

public void cleanupTask(TaskId taskId)
{
requireNonNull(taskId, "taskId is null");
SqlTask sqlTask = tasks.getIfPresent(taskId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could there ever be a race condition, e.g: task was not created yet?

@@ -121,6 +124,8 @@ public synchronized void stop()
future.cancel(true);
future = null;
}
remoteTaskCleaner.markTaskStatusFetcherStopped(taskStatus.get().getState());
dynamicFiltersFetcher.noMoreUpdateRequests();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be just DynamicFiltersFetcher#stop method?

@@ -253,7 +258,11 @@ void updateTaskStatus(TaskStatus newValue)
onFail.accept(new TrinoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, HostAddress.fromUri(getTaskStatus().getSelf()))));
}

dynamicFiltersFetcher.updateDynamicFiltersVersionAndFetchIfNecessary(newValue.getDynamicFiltersVersion());
synchronized (this) {
if (running) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen in non-aborted query?

I think that's still wrong, because last DF update could be potentially received after ContinuousTaskStatusFetcher#stop was called. We need that last request for DFs to happen, otherwise we loose filtering opportunity.

@@ -113,17 +118,38 @@ public synchronized void start()

public synchronized void updateDynamicFiltersVersionAndFetchIfNecessary(long newDynamicFiltersVersion)
{
if (noMoreUpdateRequests) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, we need that last update to happen to fetch final DFs.

Just make dynamic filters fetcher report when it's finished.

@@ -121,6 +124,8 @@ public synchronized void stop()
future.cancel(true);
future = null;
}
remoteTaskCleaner.markTaskStatusFetcherStopped(taskStatus.get().getState());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would postpone that only after pending request is finished, so that last DF update can be processed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants