Defining Exo-MDPs with External Data #564
-
I'm trying to define an Exo-MDP (Markov Decision Process with Exogenous Inputs). The Exo-MDP is defined by two assumptions:
To begin, I'm trying to implement a purely exogenous MDP (no endogenous state). Here is the part I'm stuck on. In my case I want to train a policy using exogenous state transitions sampled from historical/realized data. In theory this should be possible, but I'm not sure if POMDPs.jl is the right tool for this. The practical concern is how to inject external data. Should I use a Channel in the MDP object? Should I use global, outer scoped variables, or closures? Are any of these bad or suggested ideas for working with solvers? Another way would be to somehow use a solver directly with sampled trajectories. This would be the cleanest way, but I don't think POMDPs is built to work like this. I guess my biggest question is if POMDPs is fundamentally not designed to work the kind of problem I'm trying to define. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
For reference, I got a minimal example working with closures, heap allocated data, and The MDP I implemented was not actually a pure Exo-MDP because there was some other external data also needed during the reward calculation. This made it a little more complicated. If you just needed to sample the initial state from external data (no other exogenous state requirement), using channels would probably be best. If anyone has ideas or comments, they're appreciated. |
Beta Was this translation helpful? Give feedback.
For reference, I got a minimal example working with closures, heap allocated data, and
@eval
(with MCTS). It is straightforward. Basically, you share data using closures for each part of the interface (eg initialstate, transition, ...) that requires exogenous state, using heap allocated data for communication between methods if necessary. I'm not using any multithreading right now for simulations or solvers so communicating this way should be fine.The MDP I implemented was not actually a pure Exo-MDP because there was some other external data also needed during the reward calculation. This made it a little more complicated. If you just needed to sample the initial state from external dat…