Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observation_and_action_constraint_splitter behavior #292

Open
alfoudari opened this issue Jan 23, 2020 · 3 comments
Open

observation_and_action_constraint_splitter behavior #292

alfoudari opened this issue Jan 23, 2020 · 3 comments
Assignees

Comments

@alfoudari
Copy link

alfoudari commented Jan 23, 2020

I have been trying to use observation_and_action_constraint_splitter to mask illegal actions. This is my function:

def illegal_moves(obs):
    if tf.executing_eagerly():
        obs_flat = np.reshape(obs, (9,))
        zero_vector = np.zeros(shape=(9,), dtype=np.int32)
        mask = [np.equal(obs_flat, zero_vector)]
    else:
        mask = np.zeros(shape=(9,), dtype=np.int32)
        print("graph", obs, mask)

    return obs, mask

There are two main issues I can't wrap my head around:

Eager Execution vs Graph Execution

You can see in my function I'm differentiating between the two. The reason is I'm getting the following output when I train my agent:

$ python -m agent.train
agent.train() start
graph Tensor("Squeeze_3:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_3:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
graph Tensor("Squeeze_8:0", shape=(64, 3, 3), dtype=int32) [0 0 0 0 0 0 0 0 0]
agent.train() end
agent.train() start
agent.train() end
agent.train() start
...

agent.train() is executing in graph mode for the first iteration and then executes eagerly for the subsequent iterations. Why is that? Should I process that batch of tensors, and if so what session do I evaluate it with?

Effectiveness to Policy

illegal_moves function is being used for both the agent definition:

agent = dqn_agent.DqnAgent(
        train_env.time_step_spec(),
        train_env.action_spec(),
        q_network=q_net,
        observation_and_action_constraint_splitter=illegal_moves,
        optimizer=optimizer,
        td_errors_loss_fn=common.element_wise_squared_loss,
        train_step_counter=train_step_counter)

and for policies:

random_policy = random_tf_policy.RandomTFPolicy(
    time_step_spec=train_env.time_step_spec(), 
    action_spec=train_env.action_spec(),
    observation_and_action_constraint_splitter=illegal_moves)

Digging deep into the DQN agent code, I saw that agent.policy is already applying illegal_moves I'm passing to the agent: here.

After training the agent and saving the policy, I ran a few tests and found that the policy is still picking up illegal moves. So this arises the question, is illegal_moves being applied to agent.policy?

Bonus question: @tf.function

I have tried decorating illegal_moves with @tf.function which for some reason made it ineffective during training; i.e. illegal moves were still being picked up. Any explanation for this?

train.py is available in this gist. Thanks!

@ebrevdo
Copy link
Contributor

ebrevdo commented Jan 29, 2020

@nealwu ptal.

@nealwu
Copy link
Contributor

nealwu commented Jan 29, 2020

Hi @abstractpaper, glad to hear that you're interested in action masking. In order to reproduce this, are you able to provide a more minimal example where the policy still generates illegal actions? The ideal example would be in the form of a unit test, something similar to https://github.com/tensorflow/agents/blob/master/tf_agents/policies/q_policy_test.py#L202.

One thing worth pointing out is that in the mask, 1s represent valid actions and 0s represent invalid actions. Your mask of all zeros in the graph case is actually invalid, since none of the actions are legal, so the policy will end up just choosing one randomly.

In order to ensure you get eager execution, I believe you should enable eager execution as early as possible in your program (for example at the beginning of your main function). It should be enabled by default in TF 2.0 though.

@alfoudari
Copy link
Author

@nealwu while attempting to write a unit test for this issue I ran into another issue this unit test depends on #295. One problem at a time :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants