-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] After train, how to test our own environment? #306
Comments
Yes, the location where the experimental results are saved is |
Thank you, to test, I also have tried the file Q1: When I use Method2: run the file LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21 (1) For the same LOG_DIR, why are the results different? In theory, to evaluate the same model, the results shoule be the same. Q2: To run the file Q3: In (1) For the cost function, our goal is to satisfy the equation limits self.P_d+self.P_EL+self.P_EB+self.P_ES=self.P_FC+self.P_PV+self.P_buy Why the self.P_error is still very large after traing to convergence? I think the cost should tend to 0 in theory. Or how to design the cost? (2) For the terminated and truncated function, the step should stop following truncated?
|
Q1 SEED=5 # for example
from omnisafe.utils.tools import seed_all
seed_all(seed=SEED) Then in the method self._env.set_seed(seed=SEED) Please note, that to ensure the rigor of the evaluation, use a different random seed for the evaluation than the one used during training. Q2 Q3
|
THANK YOU VERY MUCH! |
I believe this is due to the issue with the environment's random seed mechanism. The environment currently supported by OmniSafe is Safety-Gymnasium, which is based on Gymnasium, commonly used in the reinforcement learning community. In the random seed setting mechanism of Gymnasium, the environment generates a series of random numbers based on the initial random seed, which are used as seeds for subsequent resets, instead of using the same seed for every seed. For more details please refer to here: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/utils/seeding.py |
Thank you, but In our environment, we reference the seed seeting in simple-env.py as following:
Also there is seed=0 in CPO.yaml, how it work? |
You can set a simple seed logic, such as adding 10 each time. That is, when you first reset the environment, the seed you pass in is 0, and for subsequent resets, you only need to pass in None, allowing the environment to automatically reset with seeds 10, 20, 30, and so on. Similarly, when your initial seed is 5, the environment will automatically reset in the order of 15, 25, 35, and so on. This logic can be easily implemented in the reset function. |
There are two places related to seed, so which seed plays the effect?
According your advice, we modified the above, as a result, we found that although seed is different in each episode such as seed=10,20,30,40,50 responding to five episodes, but when we run next time, seed is the same seed=10,20,30,40,50 , so the train results is still is "Episode reward: 17159.260873794556' that are the same as seed=None or other numbers. So we think the seed setting in def reset() does not play effect. The following is related to the def set_seed(), but we do not find solve methods.
(3)In omnisafe\adapter\onpolicy_adapter.py,
we modified the above as:
or delete the seed_all(self._seed) as following:
we find that the train results can be differnt. We think the second seed place plays effect, that is the seed setting in CPO.yaml plays effect, is it right? |
I think I need to clarify the meaning of the seed mechanism:
|
Thank you, I unsderstand your meaning.
|
The training reward should increase, cost should decrease. Q1: Why is reward in a downward trend? Our goal is to minimize economic costs, so we set a negative value, such as Q2: Does CPO only support one constraint? Our setting is Thank you for your reply! |
I'm sorry, but I'm not an expert in applying SafeRL to trading transactions. You need to focus on whether maximizing reward and minimizing cost can coexist simultaneously. For instance, in the Safety-Gymnasium supported by OmniSafe, specifically in SafetyPointGoal1-v0, maximizing reward (reaching the goal) and minimizing cost (avoiding collisions) can coexist, meaning the agent can choose a safe path to the goal. If the environment is designed to meet this condition, then it might be because the default parameters of CPO are not well suited to your task, and you can use OmniSafe's CPO currently does not support multiple constraints. You can try to handle this by summing up the two cost functions or taking their average, depending on their actual meanings. |
Looking forward to your reply about the above problems, thank you~ |
I apologize for the late reply. I will address your questions one by one: sub_figures[1].set_ylim(COST_LOWER, COST_UPPER)
sub_figures[0].set_ylim(REWARD_LOWER,REWARD_UPPER) (2) a. The original design intention of b. If your evaluation results are very inconsistent with training, you might consider changing the deterministic strategy to a stochastic strategy, like: act = self._actor.predict(
obs.reshape(
-1,
obs.shape[-1], # to make sure the shape is (1, obs_dim)
),
deterministic=False,
).reshape(
-1, # to make sure the shape is (act_dim,)
) or carefully check whether the environment imported during |
Required prerequisites
Questions
Thank you for your work. After I successfully run the following train code
how to test next?
Method1:
we modified to
omnisafe eval ./examples/runs/CPO-{Custom0-v0}
Method2: run the file ./examples/evaluate_saved_policy.py
LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21
So which is right or how do is right to test the trained model, what is the difference between Method1 and Method2? Is the trained model saved in examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21? Thank you~
The text was updated successfully, but these errors were encountered: