-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] A patch for sync down logs #4036
base: master
Are you sure you want to change the base?
Conversation
…s method (avoid circular import)
sky/serve/core.py
Outdated
if (sync_down_all_components or | ||
service_component == serve_utils.ServiceComponent.CONTROLLER): | ||
logger.info( | ||
'Starting the process to prepare and download controller logs...') | ||
runner = controller_handle.get_command_runners() | ||
controller_log_file_name = ( | ||
serve_utils.generate_remote_controller_log_file_name(service_name)) | ||
logger.info('Downloading the controller logs...') | ||
runner.rsync(source=controller_log_file_name, | ||
target=os.path.join( | ||
target_directory, | ||
serve_constants.CONTROLLER_LOG_FILE_NAME), | ||
up=False, | ||
stream_logs=False) | ||
if not sync_down_all_components: | ||
single_file_synced = True | ||
target_directory = os.path.join( | ||
target_directory, serve_constants.CONTROLLER_LOG_FILE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that we keep a list of source & target and use loop to rsync them down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, just to confirm:
In fact, sync_down_logs()
for the controller/load balancer only deals with one instance. Unlike replicas, which have multiple instances where using a loop is more convenient.
So, the reason for "keeping a list of source & target and using a loop to rsync them down" for the controller/load balancer is mainly for the sake of reducing code duplication and simplifying the code logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Currently it has a lot of duplication. Please also double check other places in the code.
sky/serve/serve_utils.py
Outdated
# Get record from serve state. | ||
service_record = serve_state.get_service_from_name(service_name) | ||
if service_record is None: | ||
raise ValueError(f'Service {service_name!r} does not exist.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add ux_utils.print_exception_no_traceback like #4111.
|
||
# These copies of logs may still be continuously updating. | ||
# We need to synchronize the latest log data from the remote server. | ||
for replica in service_record['replica_info']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this included the terminal replicas? should we continue
for them?
sky/serve/serve_utils.py
Outdated
except exceptions.CommandError as e: | ||
logger.info('Failed to download the logs: ' | ||
f'{common_utils.format_exception(e)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets directly raise those errors. we should not transparently let the error happen
Thanks for the comments @cblmemo, fixed and ready for next round :) |
sky/serve/core.py
Outdated
def _sync_log_file(runner, | ||
source, | ||
target_directory, | ||
target_name=None, | ||
run_timestamp=None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add type annotation here
sky/serve/core.py
Outdated
single_file_synced = True | ||
target_directory = os.path.join(target_directory, log_file_constant) | ||
|
||
return runner, target_directory, single_file_synced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to construct runner inside this function?
sky/serve/core.py
Outdated
_sync_log_file(runner, | ||
remote_service_dir_name, | ||
sky_logs_directory, | ||
run_timestamp=run_timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we prepare all the logs (the replica log, controller log & lb log), construct all targets to download, and loop them once, and then cleanup?
sky/serve/core.py
Outdated
def _sync_down_replica_logs(service_name, run_timestamp, replica_id, | ||
controller_handle, sky_logs_directory) -> None: | ||
"""Helper function of sync_down_logs. | ||
- prepare and download replica log file. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coupled with the comment before, lets inline this function.
sky/serve/core.py
Outdated
""" | ||
try: | ||
subprocess_utils.handle_returncode(returncode, command, error_message) | ||
logger.info(f'{error_message.split(".")[0]} successfully.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to print out error message if runs successfully?>
sky/serve/core.py
Outdated
'download on the controller.') | ||
|
||
|
||
def _run_command_with_error_handling(returncode, command, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function feels too shallow for me. Lets inline this.
sky/serve/serve_utils.py
Outdated
def sync_normal_replica_logs(service_name: str, replica: dict, | ||
dir_for_download: str, | ||
target_replica_id: Optional[int]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is a "normal" replica. Do you mean nonterminal replica?
Also, this function is only called once. Lets inline it to reduce fragmentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it means nonterminal replica here
In fact, I'm a little bit confused about the relationship between inlining and redundancy here. If all these helper functions above are put back in their original place, it will make the main function sync_down_logs()
/ prepare_replica_logs_for_download()
quite long and the logic rather complex. I'm not sure if I should inline them?
Thanks for the detailed suggestions! @cblmemo Before I proceed with further changes, I'd like to confirm a few details:
Last week, you mentioned there was redundancy elsewhere, but I didn't fully understand, I encapsulated many processes into functions to simplify the logic of the main function. In fact, after handling it this way, I didn't find any other redundancies. If I proceed with the changes as described, actually, I’d only need to reset the current PR back to here and then modify the Would any redundancy remain if I proceed this way? I would truly appreciate any insights and guidance provided. |
784c764
to
70a7bd6
Compare
Thanks for asking @root-hbx ! Those are good questions. General ideas to keep in mind:
Several suggestions:
Also, could you please add type annotation when writing the code? It will help the type checker to check your code :)) |
Thanks for your detailed advice and guidance @cblmemo . That's super helpful! I am currently refactoring the code according to the aforementioned standards :D |
Hi, @cblmemo ! I’ve refactored the corresponding code based on your feedback and enhanced the unit test for |
A patch of #3063
sky/serve/core.py
, as required by #4046Tested (run the relevant ones):
bash format.sh
PROVISIONING
stateREADY
replicasREADY
, and another replica terminated (e.g. through sky cancel on controller)pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh