Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lab1 training failed at estimator.fit #13

Open
csxwin opened this issue Mar 19, 2024 · 0 comments
Open

Lab1 training failed at estimator.fit #13

csxwin opened this issue Mar 19, 2024 · 0 comments

Comments

@csxwin
Copy link

csxwin commented Mar 19, 2024

I'm running lab1 on SageMaker.
Image: Pytorch 1.13 Python 3.9 CPU optimized
Kernel: Python3.9
Instance: ml.t3.medium

Here's the error message when running estimator.fit

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
Cell In[17], line 3
      1 # Passing True will halt your kernel, passing False will not. Both create a training job.
      2 # here we are defining the name of the input train channel. you can use whatever name you like! up to 20 channels per job.
----> 3 estimator.fit(wait=True, inputs = {'train':s3_train_path})

File /opt/conda/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)
    342         return context
    344     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1341, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
   1339 self.jobs.append(self.latest_training_job)
   1340 if wait:
-> 1341     self.latest_training_job.wait(logs=logs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:2680, in _TrainingJob.wait(self, logs)
   2678 # If logs are requested, call logs_for_jobs.
   2679 if logs != "None":
-> 2680     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2681 else:
   2682     self.sagemaker_session.wait_for_job(self.job_name)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:5766, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
   5745 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
   5746     """Display logs for a given training job, optionally tailing them until job is complete.
   5747 
   5748     If the output is a tty or a Jupyter cell, it will be color-coded
   (...)
   5764         exceptions.UnexpectedStatusException: If waiting and the training job fails.
   5765     """
-> 5766     _logs_for_job(self, job_name, wait, poll, log_type, timeout)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:7995, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
   7992             last_profiler_rule_statuses = profiler_rule_statuses
   7994 if wait:
-> 7995     _check_job_status(job_name, description, "TrainingJobStatus")
   7996     if dot:
   7997         print()

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:8048, in _check_job_status(job, desc, status_key_name)
   8042 if "CapacityError" in str(reason):
   8043     raise exceptions.CapacityError(
   8044         message=message,
   8045         allowed_statuses=["Completed", "Stopped"],
   8046         actual_status=status,
   8047     )
-> 8048 raise exceptions.UnexpectedStatusException(
   8049     message=message,
   8050     allowed_statuses=["Completed", "Stopped"],
   8051     actual_status=status,
   8052 )

UnexpectedStatusException: Error for Training job shuxucao-ddp-mnist-2024-03-19-03-40-53-406: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "TypeError: Descriptors cannot be created directly.
 If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
 If you cannot immediately regenerate your protos, some other possible workarounds are
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
 
 More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
 File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
 File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
 File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
 # may not use this file except in compliance with the License. A copy of
 File "<frozen importlib._bootstrap>", line 991, in _find_and_load
 File "<frozen zipimport>", line 259, in load_module
 File 

The installed pip package protobuf is 3.20.2. Should I run this lab at python3.8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant