observation_and_action_constraint_splitter behavior #292

alfoudari · 2020-01-23T08:36:04Z

I have been trying to use observation_and_action_constraint_splitter to mask illegal actions. This is my function:

def illegal_moves(obs):
    if tf.executing_eagerly():
        obs_flat = np.reshape(obs, (9,))
        zero_vector = np.zeros(shape=(9,), dtype=np.int32)
        mask = [np.equal(obs_flat, zero_vector)]
    else:
        mask = np.zeros(shape=(9,), dtype=np.int32)
        print("graph", obs, mask)

    return obs, mask

There are two main issues I can't wrap my head around:

Eager Execution vs Graph Execution

You can see in my function I'm differentiating between the two. The reason is I'm getting the following output when I train my agent:

$ python -m agent.train
agent.train() start
graph Tensor("Squeeze_3:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_3:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
agent.train() end
agent.train() start
agent.train() end
agent.train() start
...

agent.train() is executing in graph mode for the first iteration and then executes eagerly for the subsequent iterations. Why is that? Should I process that batch of tensors, and if so what session do I evaluate it with?

Effectiveness to Policy

illegal_moves function is being used for both the agent definition:

agent = dqn_agent.DqnAgent(
        train_env.time_step_spec(),
        train_env.action_spec(),
        q_network=q_net,
        observation_and_action_constraint_splitter=illegal_moves,
        optimizer=optimizer,
        td_errors_loss_fn=common.element_wise_squared_loss,
        train_step_counter=train_step_counter)

and for policies:

random_policy = random_tf_policy.RandomTFPolicy(
    time_step_spec=train_env.time_step_spec(), 
    action_spec=train_env.action_spec(),
    observation_and_action_constraint_splitter=illegal_moves)

Digging deep into the DQN agent code, I saw that agent.policy is already applying illegal_moves I'm passing to the agent: here.

After training the agent and saving the policy, I ran a few tests and found that the policy is still picking up illegal moves. So this arises the question, is illegal_moves being applied to agent.policy?

Bonus question: @tf.function

I have tried decorating illegal_moves with @tf.function which for some reason made it ineffective during training; i.e. illegal moves were still being picked up. Any explanation for this?

train.py is available in this gist. Thanks!

The text was updated successfully, but these errors were encountered:

ebrevdo · 2020-01-29T18:05:05Z

@nealwu ptal.

nealwu · 2020-01-29T18:50:46Z

Hi @abstractpaper, glad to hear that you're interested in action masking. In order to reproduce this, are you able to provide a more minimal example where the policy still generates illegal actions? The ideal example would be in the form of a unit test, something similar to https://github.com/tensorflow/agents/blob/master/tf_agents/policies/q_policy_test.py#L202.

One thing worth pointing out is that in the mask, 1s represent valid actions and 0s represent invalid actions. Your mask of all zeros in the graph case is actually invalid, since none of the actions are legal, so the policy will end up just choosing one randomly.

In order to ensure you get eager execution, I believe you should enable eager execution as early as possible in your program (for example at the beginning of your main function). It should be enabled by default in TF 2.0 though.

alfoudari · 2020-01-31T11:52:22Z

@nealwu while attempting to write a unit test for this issue I ran into another issue this unit test depends on #295. One problem at a time :D

ebrevdo assigned nealwu Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observation_and_action_constraint_splitter behavior #292

observation_and_action_constraint_splitter behavior #292

alfoudari commented Jan 23, 2020 •

edited

Loading

ebrevdo commented Jan 29, 2020

nealwu commented Jan 29, 2020

alfoudari commented Jan 31, 2020

observation_and_action_constraint_splitter behavior #292

observation_and_action_constraint_splitter behavior #292

Comments

alfoudari commented Jan 23, 2020 • edited Loading

Eager Execution vs Graph Execution

Effectiveness to Policy

Bonus question: @tf.function

ebrevdo commented Jan 29, 2020

nealwu commented Jan 29, 2020

alfoudari commented Jan 31, 2020

alfoudari commented Jan 23, 2020 •

edited

Loading