Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] A patch for sync down logs #4036

Open
wants to merge 114 commits into
base: master
Choose a base branch
from

Conversation

root-hbx
Copy link

@root-hbx root-hbx commented Oct 5, 2024

A patch of #3063

  • fix all previous merge conflicts in #3063
  • migrate log components to sky/serve/core.py, as required by #4046

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sync down when only launch logs exists, and the two replicas are in PROVISIONING state
    • sync down across two READY replicas
    • sync down when one replica READY, and another replica terminated (e.g. through sky cancel on controller)
    • validate_log_file unittest is finished to check whether log file exists and replica ID matches.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

David Tran and others added 30 commits January 31, 2024 18:35
Comment on lines 788 to 805
if (sync_down_all_components or
service_component == serve_utils.ServiceComponent.CONTROLLER):
logger.info(
'Starting the process to prepare and download controller logs...')
runner = controller_handle.get_command_runners()
controller_log_file_name = (
serve_utils.generate_remote_controller_log_file_name(service_name))
logger.info('Downloading the controller logs...')
runner.rsync(source=controller_log_file_name,
target=os.path.join(
target_directory,
serve_constants.CONTROLLER_LOG_FILE_NAME),
up=False,
stream_logs=False)
if not sync_down_all_components:
single_file_synced = True
target_directory = os.path.join(
target_directory, serve_constants.CONTROLLER_LOG_FILE_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that we keep a list of source & target and use loop to rsync them down?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, just to confirm:

In fact, sync_down_logs() for the controller/load balancer only deals with one instance. Unlike replicas, which have multiple instances where using a loop is more convenient.

So, the reason for "keeping a list of source & target and using a loop to rsync them down" for the controller/load balancer is mainly for the sake of reducing code duplication and simplifying the code logic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Currently it has a lot of duplication. Please also double check other places in the code.

# Get record from serve state.
service_record = serve_state.get_service_from_name(service_name)
if service_record is None:
raise ValueError(f'Service {service_name!r} does not exist.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add ux_utils.print_exception_no_traceback like #4111.


# These copies of logs may still be continuously updating.
# We need to synchronize the latest log data from the remote server.
for replica in service_record['replica_info']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this included the terminal replicas? should we continue for them?

Comment on lines 830 to 832
except exceptions.CommandError as e:
logger.info('Failed to download the logs: '
f'{common_utils.format_exception(e)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets directly raise those errors. we should not transparently let the error happen

@root-hbx
Copy link
Author

Thanks for the comments @cblmemo, fixed and ready for next round :)

@root-hbx root-hbx requested a review from cblmemo October 27, 2024 18:50
Comment on lines 102 to 106
def _sync_log_file(runner,
source,
target_directory,
target_name=None,
run_timestamp=None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add type annotation here

single_file_synced = True
target_directory = os.path.join(target_directory, log_file_constant)

return runner, target_directory, single_file_synced
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to construct runner inside this function?

Comment on lines 190 to 193
_sync_log_file(runner,
remote_service_dir_name,
sky_logs_directory,
run_timestamp=run_timestamp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we prepare all the logs (the replica log, controller log & lb log), construct all targets to download, and loop them once, and then cleanup?

Comment on lines 159 to 163
def _sync_down_replica_logs(service_name, run_timestamp, replica_id,
controller_handle, sky_logs_directory) -> None:
"""Helper function of sync_down_logs.
- prepare and download replica log file.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupled with the comment before, lets inline this function.

"""
try:
subprocess_utils.handle_returncode(returncode, command, error_message)
logger.info(f'{error_message.split(".")[0]} successfully.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to print out error message if runs successfully?>

'download on the controller.')


def _run_command_with_error_handling(returncode, command,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels too shallow for me. Lets inline this.

Comment on lines 783 to 785
def sync_normal_replica_logs(service_name: str, replica: dict,
dir_for_download: str,
target_replica_id: Optional[int]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is a "normal" replica. Do you mean nonterminal replica?
Also, this function is only called once. Lets inline it to reduce fragmentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it means nonterminal replica here

In fact, I'm a little bit confused about the relationship between inlining and redundancy here. If all these helper functions above are put back in their original place, it will make the main function sync_down_logs() / prepare_replica_logs_for_download() quite long and the logic rather complex. I'm not sure if I should inline them?

@root-hbx
Copy link
Author

Thanks for the detailed suggestions! @cblmemo

Before I proceed with further changes, I'd like to confirm a few details:

  1. Inlining functions that are only used once, even though it may make the main function somewhat longer.
  2. For _sync_log_file(), adding type annotations, preparing all logs (the replica log, controller log, and lb log), constructing all download targets, looping through them once, and then performing cleanup.

Last week, you mentioned there was redundancy elsewhere, but I didn't fully understand, I encapsulated many processes into functions to simplify the logic of the main function. In fact, after handling it this way, I didn't find any other redundancies.

If I proceed with the changes as described, actually, I’d only need to reset the current PR back to here and then modify the _sync_log_file() function based on that.

Would any redundancy remain if I proceed this way? I would truly appreciate any insights and guidance provided.

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 31, 2024

Thanks for asking @root-hbx ! Those are good questions.

General ideas to keep in mind:

  1. Define a function for repeatedly used code blocks.
  2. Avoid very shallow functions (e.g. this), except that you are intentionally keeping everything aligned (e.g. accessing the same file path in multiple places in the code).

Several suggestions:

  1. Merging if branches with the same conditions (especially when the condition is long). e.g. both for (caller of the _prepare_and_sync_log function)[https://github.com/root-hbx/skypilot/blob/6f83693b6ffda98589035405f44972c831312214/sky/serve/core.py#L837-L842] and in the function itself, we are checking the component (which is a very lengthy condition). Is it possible to construct the log file name outside the function and pass it in?
  2. Could we double-check in what case it will happen? e.g. here, in what case would those replica log not exist?

Also, could you please add type annotation when writing the code? It will help the type checker to check your code :))

@root-hbx
Copy link
Author

Thanks for your detailed advice and guidance @cblmemo . That's super helpful! I am currently refactoring the code according to the aforementioned standards :D

@root-hbx
Copy link
Author

root-hbx commented Nov 2, 2024

Hi, @cblmemo ! I’ve refactored the corresponding code based on your feedback and enhanced the unit test for serve_utils. PTAL when you have a moment 🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants