-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing max_depth and planning_time for POMCP #32
Comments
I’m not following. Pseudocode would help. What exactly goes on in the for loop?
Same comment as above.
I think I get what you’re trying to do. You’d like to change POMCP’s hyper parameters such as depth or sim time between planning steps. Currently, POMCP/POUCT doesn’t support changing hyper parameters after creation. But as I commented in the other thread, it is not costly to create these instances. The search tree is saved in the agent, not in the planner. |
I think you understood, but let me provide pseudocode anyway: for trial_n in range(n_trials):
# Get information about true state and apply transition
next_label = int(y_test[trial_n])
true_state = TDState(next_label, 0) # (state, time_step)
bci_problem.env.apply_transition(true_state)
for step_n in range(total_steps):
# Here I would like to change the max_depth hyperparam
remaining_steps = pomdp_steps - step_n
POMCP._max_depth = remaining_steps
# Same for planning time
if remaining steps == pomdp_steps:
planning_time = 0.5
else:
planning_time = 0.1
POMCP._planning_time = planning_time
Regarding this, I can show you a print of the belief I am doing for every trial at every time-step:
Ah, I see! So the planner just reads and writes the tree from/to the agent. Then if I understood correctly, I can just create an instance of the planner whenever I want to change the parameters? As in for trial_n in range(n_trials):
# Get information about true state and apply transition
next_label = int(y_test[trial_n])
true_state = TDState(next_label, 0) # (state, time_step)
bci_problem.env.apply_transition(true_state)
for step_n in range(total_steps):
# Here I would like to change the max_depth hyperparameter
remaining_steps = pomdp_steps - step_n
# Same for planning time
if remaining steps == pomdp_steps:
planning_time = 0.5
else:
planning_time = 0.1
planner = POMCP(max_depth=remaining_steps, discount_factor=gamma,
planning_time=planning_time, exploration_const=110,
rollout_policy=agent.policy_model)
action = planner.plan(problem.agent) Would this suffice to have the problem modeled as finite-horizon? Or do I forcefully have to add a terminal state to the model? |
Yes in this case you would be planning with a finite horizon. You can give the last code block a try. Btw, if you haven’t checked out, the tree debugger feature should be very helpful in this case for you to inspect the search tree and see if it is doing what you expect. |
Thank you. It is nice that you bring up the tree debugger, as you mentioned it also on #27 when I asked about offline planning (i.e. planning only at the beginning of the trial and then using the tree for the rest of the trial). I have checked the documentation but I am not really sure what to check in this case. I guess after each |
Yes, you can definitely check that. You can also use it to debug / trace down why a certain decision was made. |
Could you explain further how to do this? |
I'm mostly referring to traversing the tree through indexing, explained here. You can find the definition of the search tree in Figure 1 of the POMCP paper. But to explain it in a nutshell, the search tree contains two types of nodes, VNode and QNode. Each VNode corresponds to some history |
Thanks you for the explanation. After exploring the tree, I realized almost all simulations were done on the 'wait' action (the equivalent of 'listen') for my model. As a result, the value for a given action was only higher than wait on the last step of the model, even when all the observations are consistent with that action. I changed the exploration constant to the default value (it was 110 before from the Tiger example), and that took care of the issue with the belief staying at 1.0 for several time steps. On a separate note, I am now experimenting with how the confusion matrix is penalized in different time steps. Since each time step obtains observations that are based on more data, I wanted to smooth the confusion matrix as done in Park and Kim, 2012 (bottom of page 7, the equation is not numbered). I am using the same q0 parameter (0.3) for the last time step, increasing by 0.05 for each previous time step. With this modification, trials that receive incorrect observations on the beginning time steps now take longer to increase the belief on the corresponding state, giving the model time to (hopefully), receive the correct observations at later time steps and avoid false positives. After I did that and explored the tree, I noticed the trials that produce false positives do so with a belief of 0.85. Could this be related to the exploration contant as well? Since the default value is a small number. |
This sounds like a good thing.
I don't follow what "false positives" mean in this context. Are you receiving "incorrect" observation when the belief of the true environment state is 0.85? Did you generate the observation based on a sampled state from the belief, or from the environment's true state? Regarding exploration constant, I can't really comment about your case. I can say that setting exploration constant too high will result in visiting every node equally often, while setting it too low may lead to unable to find a solution or finding a highly suboptimal solution. There is a lot of literature on UCB1 exploration constant. The heuristic from the POMCP paper is to set it to be R_hi - R_lo (but of course this is not a rule). |
Thank you! The only thing I don't like is that the values for the smoothing are arbitrary. I will also try to penalize the matrix row-wise, so instead of an heuristic prior on how much later steps are more precise than earlier steps, I let the uncertainty observed by the observation model decide how much each class needs to be penalized at each step.
Sorry. It means to decide on taking any action that does not correspond with the true state of the environment. In the tiger example, it would be equivalent of the agent choosing 'open-left' when the tiger is on the right, or vice-versa.
I am receiving incorrect observations for several time steps in succession, and that causes the belief for the state the observation corresponds with to reach 0.85. My intuition is that, even if that happens in earlier time steps, later time steps should (or are more likely to) receive the correct observation.
According to what I read from the POMCP paper, this requires running any given problem with an exploration constant of 0 to obtain one of the two parameters. That is not really practical in the case I am studying, so I will turn to the literature to see if I find any methods that are based on the observation model. |
Hello!
I am working on the same problem discussed in #27 and, as the problem includes a limit of time-steps for each trial, I am trying to model it as a finite-horizon POMDP. As such, I would like to initialize
POMCP
with amax_depth
equal to the maximum number of time-steps for the trial, and then change it every time step so each simulation takes into account the horizon when planning online. I noticed that the belief for a given state reaches 1 at times and the agent does not make a decision until the last or second-to-last time-step, and I thought that may be a potential cause.However, I am getting an error saying that
POMCP
has no attributemax_depth
(or_max_depth
). What can I do?On a similar fashion, I would like to change the planning time. At the beginning of each trial, the model has 0.5 seconds to plan, then 0.1 seconds at each time-step.
The text was updated successfully, but these errors were encountered: