Solving "Battleship" with POMDPS.jl/BasicPOMCP (legal action spaces that change overtime) #336

sreejank · 2021-02-09T23:19:36Z

sreejank
Feb 9, 2021

Hi,
Fantastic job with this library. Looks really nice. I am trying to implement a POMDP problem that is pretty much a scaled down version of the battleship problem in the original Silver & Veness POMCP paper (https://papers.nips.cc/paper/2010/hash/edfbe1afcf9246bb0d40eb4d8027d90f-Abstract.html).

Each state in the state space is a 3x3 grid (or an array of 9 numbers) of one's and zeros that denote the location of the battleships, where 1 are "hit" tiles and 0 are "miss" tiles. There are 9 actions corresponding to each of the tiles in the 3x3 grid. There are two observations, "hit" or "miss."

This should be a simple POMDP to implement, but the catch is that the agent cannot click on a tile twice until the state has changed (i.e. as long as the agent is in state s, each action can only be chosen once). I've implemented this as a field of the POMDP struct called "board" which is an array of -1 (tile not chosen by agent yet), 0 (tile chosen but miss), and 1 (tile chosen and hit). I am using the gen function to implement the transition so that I can generate the next state, observation, and reward simultaneously. In the gen function, once the agent has gotten all hit tiles, the board is re-cleared into -1's and the state changes. The code is pretty short and can be seen below.

I tried using BasicPOMCP on this problem, but I am noticing that the board isnt getting cleared across state changes. This is probably because of the rather "hack-y" way I have implemented checking if an agent has taken a specific action or not.

What would be the best way to implement this problem? I briefly considered making the entire board the observation, but I am not sure if I am able to pass the current observation into the gen function, only the state and action...

The larger question here would be what is the best way to implement constraints on the action space of the agent based on the current history/observation?

Thanks for the help in advance!

struct GridClickPOMDP <:POMDP{Array{Int64},Int64,Bool}
	board::Array{Int64}
	discount::Float64
	GridClickPOMDP()=new([-1 for n in 1:9],1.0)
end

state_matrix=npzread("data/state_space.npy")
state_space=[r[:] for r in eachrow(state_matrix)]
state_idxs=Dict()
for i=1:length(state_space)
	state_idxs[state_space[i]]=i
end



state_probs=npzread("data/state_probs.npy")

POMDPs.states(pomdp::GridClickPOMDP)=state_space 
POMDPs.stateindex(pomdp::GridClickPOMDP,s::Array{Int64,1})=state_idxs[s] 

POMDPs.actions(pomdp::GridClickPOMDP)=collect(1:9)
POMDPs.actionindex(pomdp::GridClickPOMDP,a::Int64)=a  



function POMDPs.gen(pomdp::GridClickPOMDP,s,a,rng)
	if pomdp.board[a]!=-1
		return(sp=s,o=false,r=-200)
	else
		pomdp.board[a]=s[a]
		obs=(s[a]==1)
		if length(pomdp.board[pomdp.board.==1])==length(s[s.==1])
			s_next=state_space[rand(Categorical(state_probs))]
			for i=1:9
				pomdp.board[i]=-1
			end
			return(sp=s_next,o=obs,r=10)
		else
			if obs
				rew=1
			else
				rew=-1
			end
			return(sp=s,o=obs,r=rew)
		end
	end
end
POMDPs.initialstate_distribution(pomdp::GridClickPOMDP)=SparseCat(state_space,state_probs)
POMDPs.discount(pomdp::GridClickPOMDP)=1.0

zsunberg · 2021-02-10T00:24:30Z

zsunberg
Feb 10, 2021
Maintainer

Going to make this a discussion - hope we can provide some help!

0 replies

Shushman · 2021-02-10T02:55:48Z

Shushman
Feb 10, 2021
Maintainer

I haven't had time to think on this in detail; I might have some cycles later this week. But your larger question at the end reminds me vaguely of a use-case I had looked into a couple of years ago
#226

0 replies

lassepe · 2021-02-10T09:20:31Z

lassepe
Feb 10, 2021
Maintainer

I suspect that the main problem here is that "clicked" events are recorded in the global pomdp model via the board property. This causes problems because you can't be sure in which order the pomdp model is used inside the solver; i.e. your model is now not re-entrant because it has side effects.

A fairly common solution in such situation is to have the "clicked" event be part of the state. That is, the state s represents the "game state", not the physical configuration of the board. Then you can have POMDPs.actions(m, s) return only valid actions at each state rather than discouraging to choose these actions via large negative rewards.

15 replies

zsunberg Feb 12, 2021
Maintainer

Is there a way to ask POMCP, after training on the game for a while, to get the value of an action for a specific belief state?

Unfortunately, the short answer for this is "no". POMCP, like other online solvers, is optimized for finding a good action from the current belief, not learning a value function for a general belief.

POMDPs.jl would definitely be more suitable than CommonRLInterface for doing this type of thing because it would be better for dealing with beliefs. Unfortunately, to be honest, I don't know what the best way to approach it is. SARSOP would probably be the most suitable solver, but it will probably not be able to handle a big Battleship board state space. This might be an area for further research - deep alpha-function approximation or something.

sreejank Feb 12, 2021
Author

Right, maybe I should reframe my questions more in terms of the capabilities of POMCP.

As I understand, POMCP assigns q-values to all the potential actions at the moment and uses a max over these to choose what action it takes next. There must be some way to: run a game with a specific set of actions until you get to a certain board and then extract these q-values, right? Or possibly take a specific h and interrogate POMCP's tree to get the child nodes' values?

zsunberg Feb 12, 2021
Maintainer

Yes, PO-UCT calculates Q-values for history-action pairs (a history is a sequence of actions and observations starting from a current belief). So, if you want to know the Q-values for a history that happens to be in the tree, then you can get that.

zsunberg Feb 12, 2021
Maintainer

but usually the tree is very sparse, so the history that you seek will likely not be in the tree.

sreejank Feb 16, 2021
Author

Thanks a lot! This is helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solving "Battleship" with POMDPS.jl/BasicPOMCP (legal action spaces that change overtime) #336

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 15 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Solving "Battleship" with POMDPS.jl/BasicPOMCP (legal action spaces that change overtime) #336

sreejank Feb 9, 2021

Replies: 3 comments · 15 replies

zsunberg Feb 10, 2021 Maintainer

Shushman Feb 10, 2021 Maintainer

lassepe Feb 10, 2021 Maintainer

zsunberg Feb 12, 2021 Maintainer

sreejank Feb 12, 2021 Author

zsunberg Feb 12, 2021 Maintainer

zsunberg Feb 12, 2021 Maintainer

sreejank Feb 16, 2021 Author

sreejank
Feb 9, 2021

Replies: 3 comments 15 replies

zsunberg
Feb 10, 2021
Maintainer

Shushman
Feb 10, 2021
Maintainer

lassepe
Feb 10, 2021
Maintainer

zsunberg Feb 12, 2021
Maintainer

sreejank Feb 12, 2021
Author

zsunberg Feb 12, 2021
Maintainer

zsunberg Feb 12, 2021
Maintainer

sreejank Feb 16, 2021
Author