Note: Some articles may be missing at the bottom of this preview page (due to length). Open README.md to get all the articles!
In this document, I would like to share some personal notes about the latest exciting trends in research about decision making for autonomous driving. I keep on updating it 👷 🚧 😃
Template:
"title"
[ Year
]
[📝 (paper)]
[ (code)]
[🎞️ (video)]
[ 🎓 University X
]
[ 🚗 company Y
]
[ related
, concepts
]
Categories:
- Architecture and Map
- Behavioural Cloning, End-To-End and Imitation Learning
- Inverse Reinforcement Learning, Inverse Optimal Control and Game Theory
- Prediction and Manoeuvre Recognition
- Rule-based Decision Making
- Model-Free Reinforcement Learning
- Model-Based Reinforcement Learning
- Planning and Monte Carlo Tree Search
Besides, I reference additional publications in some parallel works:
- Hierarchical Decision-Making for Autonomous Driving
- Educational application of Hidden Markov Model to Autonomous Driving
- My 10 takeaways from the 2019 Intelligent Vehicle Symposium
Looking forward your reading suggestions!
"BARK : Open Behavior Benchmarking in Multi-Agent Environments"
-
[
2020
] [📝] [] [] [ 🎓Technische Universität München
] [ 🚗Fortiss
,AID
] -
[
behavioural models
,robustness
,open-loop simulation
,behavioural simulation
,interactive human behaviors
]
Click to expand
The ObservedWorld model, reflects the world that is perceived by an agent. Occlusions and sensor noise can be introduced in it. The simultaneous movement makes simulator planning cycles entirely deterministic. Source. |
Two evaluations. Left: Robustness of the planning model against the transition function . The scenario's density is increased by reducing the time headway IDM parameters of interacting vehicles. Inaccurate prediction model impacts the performance of an MCTS (2k , 4k , and 8k search iterations) and RL -based (SAC ) planner. Right: an agent from the dataset is replaced with various agent behaviour models. Four different parameter sets for the IDM . Agent sets A0 , A1 , A2 , A6 are not replaced with the IDM since this model cannot change lane. Maintaining a specific order is key for merging , but without fine-tuning model parameters, most behaviour models fail to coexist next to replayed agents. Source. |
Authors: Bernhard, J., Esterle, K., Hart, P., & Kessler, T.
-
BARK
is an acronym for Behaviour BenchmARK and is open-source under theMIT
license. -
Motivations:
1-
Focus on driving behaviour models forplanning
,prediction
, andsimulation
.-
"
BARK
offers a behavior model-centric simulation framework that enables fast-prototyping and the development of behavior models. Behavior models can easily be integrated — either usingPython
orC++
. Various behavior models are available ranging from machine learning to conventional approaches."
-
2-
Benchmark interactive behaviours.-
"To model interactivity, planners must employ some kind of
prediction
model of other agents."
-
-
Why existing simulation frameworks are limiting?
-
"Most simulations rely on datasets and simplistic behavior models for traffic participants and do not cover the full variety of real-world, interactive human behaviors. However, existing frameworks for simulating and benchmarking behavior models rarely provide sophisticated behavior models for other agents."
CommonRoad
: only pre-recorded data are used for the other agents, i.e. only enabling non-interactive behaviour planning.CARLA
: A CARLA-BARK interface is available.-
"Being based on the
Unreal Game Engine
, problems like non-determinism and timing issues are introduced, that we consider undesirable when developing and comparing behavior models."
-
SUMO
: Microscopic traffic simulators can model flow but neglect interactions with other vehicles and does not track the accurate motion of each agent.
-
-
Concept of
simultaneous movement
.- Motivation: Make simulator planning cycles entirely deterministic. This enables the simulation and experiments to be reproducible.
-
"
BARK
models the world as a multi-agent system with agents performing simultaneous movements in the simulated world." -
"At fixed, discrete world time-steps, each agent plans using an agent-specific behavior model in a cloned world – the agent’s observed world."
- Hence the other agents can actively interact with the ego vehicle.
-
Implemented behaviour models:
IDM
+MOBIL
.RL
(SAC
).-
"The reward
r
is calculated usingEvaluators
. These modules are available in our Machine Learning module. As it integrates the standardOpenAi Gym
-interface, various popularRL
libraries, such asTF-Agents
can be easily integrated used withBARK
."
-
MCTS
. Single-agent or multi-agent.-
[multi-agent] "Adapted to interactive driving by using
information sets
assuming simultaneous, multi-agent movements of traffic participants. They apply it to the context of cooperative planning, meaning that they introduce a cooperative cost function, which minimizes the costs for all agents."
-
- Dataset Tracking Model.
- The agent model tracks recorded trajectories as close as possible.
-
Two evaluations (benchmark) of the behavioural models.
-
"
Prediction
(a discriminative task) deals with what will happen, whereassimulation
(often a generative task) deals with what could happen. Put another way,prediction
is a tool for forecasting the development of a given situation, whereassimulation
is a tool for exploring a wide range of potential situations, often with the goal of probing the robot’s planning and control stack for weaknesses that can be addressed by system developers."(Brown, Driggs-Campbell, & Kochenderfer, 2020)
. 1-
Behaviourprediction
:- What is the effect of an inaccurate
prediction model
on the performance of anMCTS
andRL
-basedplanner
? MCTS
requires an explicit generative model for eachtransition
. This prediction model used internally is evaluated here.-
[Robustness also tested for
RL
] "RL
can be considered as an offline planning algorithm – not relying on a prediction model but requiring a training environment to learn an optimal policy beforehand. The inaccuracy of prediction relates to the amount of behavior model inaccuracy betweentraining
andevaluation
."
- What is the effect of an inaccurate
2-
Behavioursimulation
.- How planners perform when replacing human drivers in recorded traffic scenarios?
- Motivation: combine
data-driven
(recorded -> fixed trajectories) andinteractive
(longitudinally controlled) scenarios. -
"A planner is inserted into recorded scenarios. Others keep the behavior as specified in the dataset, yielding an open-loop simulation."
- The
INTERACTION Dataset
is used since it provides maps, which are essential for most on-road planning approaches.
-
-
Results and future works.
-
[
RL
] "When the other agent’s behavior is different from that used intraining
, the collision rate rises more quickly." -
"We conclude that current rule-based models (
IDM
,MOBIL
) perform poorly in highly dense, interactive scenarios, as they do not model obstacle avoidance based onprediction
or futureinteraction
.MCTS
can be used, but without an accurate model of the prediction, it also leads to crashes." -
"A combination of classical and learning-based methods is computationally fast and achieves safe and comfortable motions."
- The authors find imitation learning also promising.
-
"LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving"
Click to expand
A bridge is selected based on the user AD stack’s runtime framework: Autoware.AI and Autoware.Auto , which run on ROS and ROS2 , can connect through standard open source ROS and ROS2 bridges, while for Baidu’s Apollo platform, which uses a custom runtime framework called Cyber RT , a custom bridge is provided to the simulator. Source. |
Authors: Boise, E., Uhm, G., Gerow, M., Mehta, S., Agafonov, E., Kim, T. H., … Kim, S.
- Motivations (Yet another simulator?):
-
"The LGSVL Simulator is a simulator that facilitates testing and development of autonomous driving software systems."
- The main use case seems to be the integration to AD stacks:
Autoware.AI
,Autoware.Auto
,Apollo 5.0
,Apollo 3.0
. - Compared to
CARLA
for instance, it seems more focused on development rather than research.
-
- The simulation engine serves three functions:
- Environment simulation
- Sensor simulation
- Vehicle dynamics and control simulation.
- Miscellaneous:
LGSVL
= LG Silicon Valley Lab.- Based on
Unity
engine. - A
openAI-gym
environment is provided for reinforcement learning:gym-lgsvl
.- Default
action
space:steering
andbraking
/throttle
. - Default
observation
space: single camera image from the front camera. Can be enriched.
- Default
- For perception training,
kitti_parser.py
enables to generate labelled data inKITTI
format. - A custom License is defined.
-
"You may not sell or otherwise transfer or make available the Licensed Material, any copies of the Licensed Material, or any information derived from the Licensed Material in any form to any third parties for commercial purposes."
- It makes it hard to compare to other simulators and AD software: for instance
Carla
,AirSim
andDeepDrive
are all underMIT License
while code forAutoware
andApollo
is protected by theApache 2 License
.
-
"Overview of Tools Supporting Planning for Automated Driving"
-
[
2020
] [📝] [ 🚗virtual vehicle research
] -
[
development tools
]
Click to expand
The authors group tools that support planning in sections: maps , communication , traffic rules , middleware , simulators and benchmarks . Source. |
About simulators and dataset . And how to couple between tools, either with co-simulation software or open interfaces. Source. |
About maps. ''The planning tasks with different targets entail map models with different level of details. HD map provides the most sufficient information and can be generally categorized into three layers: road model, lane model and localization model''. Source. |
Authors: Tong, K., Ajanovic, Z., & Stettinger, G.
-
Motivations:
1-
Help researchers to make full use of open-source resources and reduce effort of setting up a software platform that suites their needs.-
[example] "It is a good option to choose open-source
Autoware
as software stack along withROS
middleware, asAutoware
can be further transferred to a real vehicle. During the development, he or she can useCarla
as a simulator, to get its benefits of graphic rendering and sensor simulation. To make the simulation more realistic, he or she might adopt commercial softwareCarMaker
for sophisticated vehicle dynamics and open-sourceSUMO
for large-scale traffic flow simulation.OpenDRIVE
map can be used as a source and converted into the map format ofAutoware
,Carla
andSUMO
. Finally,CommonRoad
can be used to evaluate the developed algorithm and benchmark it against other approaches."
-
2-
Avoid reinventing the wheel.- Algorithms are available/adapted from robotics.
- Simulators are available/adapted from gaming.
-
Mentioned software libraries for
motion planning
:ROS
(from robotics):- Open Motion Planning Library (
OMPL
) MoveIt
- navigation package
teb
local planner.
- Open Motion Planning Library (
Python
:PythonRobotics
CPP
:CppRobotics
-
How to express traffic rules in a form understandable by an algorithm?
1-
Traffic rules can be formalized in higher order logic (e.g. using the Isabelle theorem prover) to check the compliance of traffic rules unambiguously and formally for trajectory validation.2-
Another approach is to represent traffic rules geometrically as obstacles in a configuration space of motion planning problem.-
"In some occasions, it is necessary to violate some rules during driving for achieving higher goals (i.e. avoiding collision) [... e.g. with] a rule book with hierarchical arrangement of different rules."
-
About data visualization?
-
What is missing for the research community?
- Evaluation tools for quantitative comparison.
- Evaluation tools incorporating human judgment, not only from the vehicle occupants but also from other road users.
- A standard format for motion datasets.
-
I am surprised
INTERACTION
dataset was not mentioned.
"Decision-making for automated vehicles using a hierarchical behavior-based arbitration scheme"
-
[
2020
] [📝] [ 🎓FZI
,KIT
] -
[
hierarchical behavioural planning
,cost-based arbitration
,behaviour components
]
Click to expand
Both urban and highway behaviour options are combined using a cost-based arbitrator . Together with Parking and AvoidCollisionInLastResort , these four arbitrators and the SafeStop fallback are composed together to the top-most priority-based AutomatedDriving arbitrator. Source. |
Top-right: two possible options. The arbitrator generally prefers the follow lane behaviour as long as it matches the route. Here, a lane change is necessary and selected by the cost-based arbitration: ChangeLaneRight has lower cost than FollowEgoLane , mainly due to the routing term in the cost expression. Bottom: the resulting behaviour selection over time. Source. |
Authors: Orzechowski, P. F., Burger, C., & Lauer, M.
-
Motivation:
- Propose an alternative to
FSM
s (finite state machines) andbehaviour
-based systems (e.g. voting systems) in hierarchical architectures. - In particular,
FSM
s can suffer from:- poor interpretability: why is one
behaviour
executed? - maintainability: effort to refine existing behaviour.
- scalability: effort to achieve a high number of
behaviour
s and to combine a large variety of scenarios. - options selection: "multiple behaviour options are applicable but have no clear and consistent priority against each other."
-
"How and when should an automated vehicle switch from a regular
ACC
controller to a lane change, cooperative zip merge or parking planner?"
-
- multiple planners: Each
behaviour
component can compute its manoeuvre command with any preferred state-of-the-art method.-
"How can we support
POMDPs
,hybrid A*
and any other planning method in our behaviour generation?".
-
- poor interpretability: why is one
- Propose an alternative to
-
Main idea:
cost-based arbitration
between so-called "behaviour
components".- The modularity of these components brings several advantages:
- Various scenarios can be handled within a single framework:
four-way intersections
,T-junctions
,roundabout
,multilane bypass roads
,parking
, etc. - Hierarchically combining behaviours, complex
behaviour
emerges from simple components. - Good efficiency: the atomic structure allows to evaluate
behaviour
options in parallel.
- Various scenarios can be handled within a single framework:
-
About
arbitration
:-
"An
arbitrator
contains a list of behavior options to choose from. A specific selection logic determines which option is chosen based on abstract information, e.g., expected utility or priority." -
[about
cost
] "Thecost
-basedarbitrator
selects the behavior option with the lowest expected cost." - Each behaviour option is evaluated based on its expected average travel velocity, incorporating routing costs and penalizing lane changes.
- The resulting behaviour can thus be well explained:
-
"The selection logic of arbitrators is comprehensive."
- About hierarchy:
-
"To generate even more complex
behaviour
s, anarbitrator
can also be abehaviour
option of a hierarchically higherarbitrator
."
-
-
-
About
behaviour
components.- There are the smallest building blocks, representing basic tactical driving manoeuvres.
- Example of atomic behaviour components for simple tasks in urban scenarios:
FollowLead
CrossIntersection
ChangeLane
- They can be specialized:
- Dense scenarios behaviours:
ApproachGap
,IndicateIntention
andMergeIntoGap
to refineChangeLane
(multi-phase behaviour).- Note: an alternative could be to use one single integrated interaction-aware behaviour such as
POMDP
.
- Note: an alternative could be to use one single integrated interaction-aware behaviour such as
- Highway behaviours (structured but high speed):
MergeOntoHighway
,FollowHighwayLane
,ChangeHighwayLane
,ExitFromHighway
. - Parking behaviours:
LeaveGarage
,ParkNearGoal
. - Fail-safe emergency behaviours:
EmergenyStop
,EvadeObject
,SafeStop
.
- Dense scenarios behaviours:
- For a
behaviour
to be selected, it should be applicable. Hence abehaviour
is defined together with:invocation
condition: when does it become applicable.-
"[example:] The
invocation
condition ofCrossIntersection
is true as long as the current ego lane intersects other lanes within its planning horizon."
-
commitment
condition: when does it stay applicable.
- This reminds me the concept of
macro actions
, sometimes defined by a tuple <applicability condition
,termination condition
,primitive policy
>. - It also makes me think of
MODIA
framework and other scene-decomposition approaches.
-
A
mid-to-mid
approach:-
"[input] The input is an abstract environment model that contains a fused, tracked and filtered representation of the world."
- [output] The selected high-level decision is passed to a trajectory planner or controller.
- What does the "decision" look like?
- One-size-fits-all is not an option.
- It is distinguished between maneuvers in a structured or unstructured environment:
1- unstructured
: a trajectory, directly passed to a trajectory planner.2- structured
: a corridor-based driving command, i.e. a tuple <maneuver corridor
,reference line
,predicted objects
,maneuver variant
>. It requires both a trajectory planner and a controller.
-
-
One distinction:
1-
top-downknowledge
-based systems.-
"The
action selection
in a centralized, in a top-down manner using a knowledge database." -
"The engineer designing the action selection module (in
FSM
s the state transitions) has to be aware of the conditions, effects and possible interactions of all behaviors at hand."
-
2-
bottom-upbehaviour
-based systems.-
"Decouple actions into atomic simple
behaviour
components that should be aware of their conditions and effects." - E.g.
voting systems
.
-
- Here the authors combine atomic behaviour components (
bottom
/down
) with more complex behaviours using generic arbitrators (top
/up
).
"A Review of Motion Planning for Highway Autonomous Driving"
- [
2019
] [📝] [ 🎓French Institute of Science and Technology for Transport
] [ 🚗VEDECOM Institute
]
Click to expand
The review divides motion-planning into five parts. The decision-making part contains risk evaluation , criteria minimization , and constraint submission . In the last part, a low-level reactive planner deforms the generated motion from the high-level planner. Source. |
The review offers two detailed tools for comparing methods for motion planning for highway scenarios. Criteria for the generated motion include: feasible , safe , optimal , usable , adaptive , efficient , progressive and interactive . The authors stressed the importance of spatiotemporal consideration and emphasize that highway motion-planning is highly structured. Source. |
Contrary to solve-algorithms methods, set-algorithm methods require a complementary algorithm should be added to find the feasible motion. Depending on the importance of the generation (iv ) and deformation (v ) part, approaches are more or less reactive or predictive . Finally, based on their work on AI-based algorithms, the authors define four subfamilies to compare to human: logic , heuristic , approximate reasoning , and human-like . Source. |
The review also offers overviews for possible space configurations, i.e. the choices for decomposition of the evolution space (sampling-based decomposition , connected cells decomposition and lattice representation ) as well as path-finding algorithms (e.g. Dijkstra , A* , and RRT ). attractive and repulsive forces , parametric and semi-parametric curves , numerical optimization and artificial intelligence are also developed. Source. |
Authors: Claussmann, L., Revilloud, M., Gruyer, D., & Glaser, S.
"A Survey of Deep Learning Applications to Autonomous Vehicle Control"
- [
2019
] [📝] [ 🎓University of Surrey
] [ 🚗Jaguar Land Rover
]
Click to expand
Challenges for learning-based control methods. Source. |
Authors: Kuutti, S., Bowden, R., Jin, Y., Barber, P., & Fallah, S.
- Three categories are examined:
lateral
control alone.longitudinal
control alone.longitudinal
andlateral
control combined.
- Two quotes:
-
"While lateral control is typically achieved through
vision
, the longitudinal control relies on measurements of relative velocity and distance to the preceding/following vehicles. This means that ranging sensors such asRADAR
orLIDAR
are more commonly used in longitudinal control systems.". -
"While lateral control techniques favour
supervised learning
techniques trained on labelled datasets, longitudinal control techniques favourreinforcement learning
methods which learn through interaction with the environment."
-
"Longitudinal Motion Planning for Autonomous Vehicles and Its Impact on Congestion: A Survey"
- [
2019
] [📝] [ 🎓Georgia Institute of Technology
]
Click to expand
mMP refers to machine learning methods for longitudinal motion planning. Source. |
Authors: Zhou, H., & Laval, J.
- This review has been completed at a school of "civil and environmental engineering".
- It does not have any scientific contribution, but offers a quick overview about some current trends in
decision-making
. - The authors try to look at industrial applications (e.g.
Waymo
,Uber
,Tesla
), i.e. not just focussing on theoretical research. Since companies do no communicate explicitly about their approaches, most of their publications should be considered as research side-projects, rather than "actual state" of the industry.
- It does not have any scientific contribution, but offers a quick overview about some current trends in
- One focus of the review: the machine learning approaches for decision-making for longitudinal motion.
- About the architecture and representation models. They mention the works of
DeepDriving
and (H. Xu, Gao, Yu, & Darrell, 2016).Mediated perception
approaches parse an entire scene to make a driving decision.Direct perception
approaches first extract affordance indicators (i.e. only the information that are important for driving in a particular situation) and then map them to actions.-
"Only a small portion of detected objects are indeed related to the real driving reactions so that it would be meaningful to reduce the number of key perception indicators known as learning
affordances
".
-
Behavioural reflex
approaches directly map an input image to a driving action by a regressor.- This
end-to-end
paradigm can be extended with auxiliary tasks such as learning semantic segmentation (this "side task" should further improves the model), leading toPrivileged training
.
- This
- About the learning methods:
BC
,RL
,IRL
andGAIL
are considered.- The authors argue that their
memory
andprediction
abilities should make them stand out from the rule-based approaches. -
"Both
BC
andIRL
algorithms implicitly assume that the demonstrations are complete, meaning that the action for each demonstrated state is fully observable and available." -
"We argue that adopting
RL
transforms the problem of learnt longitudinal motion planning from imitating human demonstrations to searching for a policy complying a hand-crafted reward rule [...] No studies have shown that a genuine reward function for human driving really exists."
- About the architecture and representation models. They mention the works of
- About congestion:
-
"The AV industry has been mostly focusing on the long tail problem caused by
corner errors
related to safety, while the impact of AVs on traffic efficiency is almost ignored." - It reminds me the finding of (Kellett, J., Barreto, R., Van Den Hengel, A. & Vogiatzis. N., 2019) in "How Might Autonomous Vehicles Impact the City? The Case of Commuting to Central Adelaide": driverless cars could lead to more traffic congestion.
-
"Design Space of Behaviour Planning for Autonomous Driving"
- [
2019
] [📝] [ 🎓University of Waterloo
]
Click to expand
Some figures:
The focus is on the BP module, together with its predecessor (environment ) and its successor (LP ) in a modular architecture. Source. |
Classification for Question 1 - environment representation. A combination is possible. In black, my notes giving examples. Source. |
Classification for Question 2 - on the architecture. Source. |
Classification for Question 3 - on the decision logic representation. Source. |
Authors: Ilievski, M., Sedwards, S., Gaurav, A., Balakrishnan, A., Sarkar, A., Lee, J., Bouchard, F., De Iaco, R., & Czarnecki K.
The authors divide their review into three sections:
- Question
1
: How to represent the environment? (relation with predecessor ofBP
)- Four representations are compared:
raw data
,feature-based
,grid-based
andlatent representation
.
- Four representations are compared:
- Question
2
: How to communicate with other modules, especially the local planner (LP
)? (relation with successor (LP
) ofBP
)- A first sub-question is the relevance of separation
BP
/LP
.- A complete separation (top-down) can lead to computational redundancy (both have a collision checker).
- One idea, close to sampling techniques, could be to invert the traditional architecture for planning, i.e. generate multiple possible local paths (
~LP
) then selects the best manoeuvre according to a given cost function (~BP
). But this exasperates the previous point.
- A second sub-question concerns prediction: Should the
BP
module have its own dedicated prediction module?- First, three kind of prediction are listed, depending on what should be predicted (marked with
->
):- Physics-based (
->
trajectory). - Manoeuvre-based (
->
low-level motion primitives). - Interaction-aware (
->
intent).
- Physics-based (
- Then, the authors distinguish between explicitly-defined and implicitly-defined prediction models:
- Explicitly-defined can be:
- Integrated with the motion planning process (called Internal prediction models) such as belief-based decision making (e.g.
POMDP
). Ideal for planning under uncertainty. - Decoupled from the planning process (called External prediction models). There is a clear interface between prediction and planning, which aids modularity.
- Integrated with the motion planning process (called Internal prediction models) such as belief-based decision making (e.g.
- Implicitly-defined, such as
RL
techniques.
- Explicitly-defined can be:
- First, three kind of prediction are listed, depending on what should be predicted (marked with
- A first sub-question is the relevance of separation
- Question
3
: How to makeBP
decisions? (BP
itself)- A first distinction in representation of decision logic is made depending based on non-learnt / learnt:
- Using a set of explicitly programmed production rules can be divided into:
Imperative
approaches, e.g. state machines.Declarative
approaches often based on some probabilistic system.- The decision-tree structure and the (
PO
)MDP
formulation makes it more robust to uncertainty. - Examples include MCTS and online POMDP solvers.
- The decision-tree structure and the (
- Logic representation can also rely on mathematical models with parameters learned a priori.
- A distinction is made depending on "where does the training data come from and when is it created?".
- In other words, one could think of supervised learning (learning from example) versus reinforcement learning (learning from interaction).
- The combination of both seems beneficial:
- An initial behaviour is obtained through imitation learning (learning from example). Also possible with
IRL
. - But improvements are made through interaction with a simulated environment (learning from interaction).
- By the way, the learning from interaction techniques raise the question of the origin of the experience (e.g. realism of the simulator) and its sampling efficiency.
- An initial behaviour is obtained through imitation learning (learning from example). Also possible with
- Another promising direction is hierarchical RL where the MDP is divided into sub-problems (the lower for
LP
and the higher forBP
)- The lowest level implementation of the hierarchy approximates a solution to the control and LP problem ...
- ... while the higher level selects a manoeuvre to be executed by the lower level implementations.
- Using a set of explicitly programmed production rules can be divided into:
- As mentioned in my the section on Scenarios and Datasets, the authors mention the lack of benchmark to compare and evaluate the performance of BP technologies.
- A first distinction in representation of decision logic is made depending based on non-learnt / learnt:
One quote about the representation of decision logic:
- As identified in my notes about IV19, the combination of learnt and non-learnt approaches looks the most promising.
-
"Without learning, traditional robotic solutions cannot adequately handle complex, dynamic human environments, but ensuring the safety of learned systems remains a significant challenge."
-
"Hence, we speculate that future high performance and safe behaviour planning solutions will be hybrid and heterogeneous, incorporating modules consisting of learned systems supervised by programmed logic."
"A Behavioral Planning Framework for Autonomous Driving"
-
[
2014
] [📝] [ 🎓Carnegie Mellon University
] [ 🚗General Motor
] -
[
behavioural planning
,sampling-based planner
,decision under uncertainty
,TORCS
]
Click to expand
Some figures:
Comparison and fusion of the hierarchical and parallel architectures. Source. |
The PCB algorithm implemented in the BP module. Source. |
Related work by (Xu, Pan, Wei, & Dolan, 2014) - Grey ellipses indicate the magnitude of the uncertainty of state. Source. |
Authors: Wei, J., Snider, J. M., & Dolan, J. M.
Note: I find very valuable to get insights from the CMU (Carnegie Mellon University) Team, based on their experience of the DARPA Urban Challenges.
- Related works:
- A prediction- and cost function-based algorithm for robust autonomous freeway driving. 2010 by (Wei, Dolan, & Litkouhi, 2010).
- They introduced the "Prediction- and Cost-function Based (
PCB
) algorithm" used. - The idea is to
generate
-forward_simulate
-evaluate
a set of manoeuvres. - The planner can therefore take surrounding vehicles’ reactions into account in the cost function when it searches for the best strategy.
- At the time, the authors rejected the option of a
POMDP
formulation (computing the control policy over the space of the belief state, which is a probability distribution over all the possible states) deemed as computationally expensive. Improvements in hardware and algorithmic have been made since 2014.
- They introduced the "Prediction- and Cost-function Based (
- Motion planning under uncertainty for on-road autonomous driving. 2014 by (Xu, Pan, Wei, & Dolan, 2014).
- An extension of the framework to consider uncertainty (both for environment and the others participants) in the decision-making.
- The prediction module is using a Kalman Filter (assuming constant velocity).
- For each candidate trajectory, the uncertainty can be estimated using a Linear-Quadratic Gaussian (
LQG
) framework (based on the noise characteristics of the localization and control). - Their Gaussian-based method gives some probabilistic safety guaranty (e.g. likelihood
2%
of collision to occur).
- A prediction- and cost function-based algorithm for robust autonomous freeway driving. 2010 by (Wei, Dolan, & Litkouhi, 2010).
- Proposed architecture for decision-making:
- First ingredient: Hierarchical architecture.
- The hierarchy
mission
->
manoeuvre
->
motion
3M
concept makes it very modular but can raise limitations: -
"the higher-level decision making module usually does not have enough detailed information, and the lower-level layer does not have authority to re-evaluate the decision."
- The hierarchy
- Second ingredient: Parallel architecture.
- This is inspired from ADAS engineering.
- The control modules (
ACC
,Merge Assist
,Lane Centreing
) are relatively independent and work in parallel. - In some complicated cases needing cooperation, this framework may not perform well.
- This probably shows that just extending the common ADAS architectures cannot be enough to reach the level-
5
of autonomy.
- This probably shows that just extending the common ADAS architectures cannot be enough to reach the level-
- Idea of the proposed framework: combine the strengths of the hierarchical and parallel architectures.
- This relieves the path planner and the control module (the search space is reduced).
- Hence the computational cost shrinks (by over 90% compared to a sample-based planner in the spatio-temporal space).
- First ingredient: Hierarchical architecture.
- One module worth mentioning: Traffic-free Reference Planner.
- Its input: lane-level sub-missions from the Mission Planning.
- Its output: kinematically and dynamically feasible paths and a speed profile for the Behavioural Planner (
BP
).- It assumes there is no traffic on the road, i.e. ignores dynamic obstacles.
- It also applies traffic rules such as speed limits.
- This guides the
BP
layer which considers both static and dynamic obstacles to generate so-called "controller directives" such as:- The lateral driving bias.
- The desired leading vehicle to follow.
- The aggressiveness of distance keeping.
- The maximum speed.
"Action-based Representation Learning for Autonomous Driving"
-
[
representation learning
,affordances
,self-supervised
,pre-training
]
Click to expand
So--called direct perception approach uses affordances to select low-level driving commands. Examples of affordances include is there a car in my lane within 10m? or relative angles deviation between my car's heading and the lane to control the car. This approach offers interpretability. The question here is ''how to efficiently extract these affordances ?''. For this classification / regression supervised learning task, the encoder is first pre-trained on another task (proxy tasks) involving learning from action demonstration (e.g. behavioural cloning ). The intuition is that for a learnt BC model able to take good driving decisions most of the time, relevant information should have been captured in its encoder. Source. |
Different self-supervised learning tasks based on action prediction can be used to produce the encoder. An other idea is to use models trained with supervised learning. For instance ResNet , whose encoder performs worse. Probably because of the synthetic images? Source. |
Four affordances are predicted from images. They represent the explicit detection of hazards involving pedestrians and vehicles, respecting traffic lights and considering the heading of the vehicle within the current lane. PID controllers convert these affordances into throttle , brake and steering commands. Source. |
Authors: Xiao, Y., Codevilla, F., Pal, C., & López, A. M.
-
One sentence:
-
"Expert demonstrations can act as an effective
action
-based representation learning technique."
-
-
Motivations:
1-
Leverage driving demonstration data that can be easily obtained by simply recording theactions
of good drivers.2-
Be more interpretable than pureend-to-end
imitation methods.3-
Be less annotation dependent, i.e. rely preferably on self-supervision and try to reduce data supervision (i.e., human annotation).-
"Our method uses much less densely annotated data and does not use dataset aggregation (
DAgger
)."
-
-
Main idea:
- Use both manually annotated data (
supervised learning
) and expert demonstrations (imitation learning
) to learn to extractaffordances
from images (representations learning
). - This combination is beneficial, compared to taking each approach separately:
1-
Pureend-to-end
imitation methods such asbehavioural cloning
could be used to directly predictcontrol
actions (throttle
,brake
,steering
).-
"In this pure data-centered approach, the supervision required to train deep
end-to-end
driving models does not come from human annotation; instead, the vehicle’s state variables are used as self-supervision (e.g.speed
,steering
,acceleration
,braking
) since these can be automatically collected from fleets of human-driven vehicles." - But it lacks interpretability, can have difficulty dealing with spurious correlations and training may be unstable.
-
2-
Learning to extract theaffordances
from scratch would not be very efficient.- One could use a pre-trained backbone such as
ResNet
. But it does not necessary capture the relevant information of the driving scene to make decisions. -
"Action-based pre-training (
Forward
/Inverse
/BC
) outperforms all the other reported pre-training strategies (e.g.ImageNet
). However, we see that the action-based pre-training is mostly beneficial to help on reliably estimating the vehicle’s relative angle with respect to the road."
- One could use a pre-trained backbone such as
- Use both manually annotated data (
-
About
driving affordances
anddirect
perception approach:-
"A different paradigm, conceptually midway between pure modular and end-to-end driving ones, is the so-called
direct perception
approach, which focuses on learning deep models to predict drivingaffordances
, from which an additional controller can maneuver the AV. In general, suchaffordances
can be understood as a relatively small set of interpretable variables describing events that are relevant for an agent acting in an environment. Drivingaffordances
bring interpretability while only requiring weak supervision, in particular, human annotations just at the image level (i.e., not pixel-wise)." - Four
affordances
used by a rule-based controller to selectthrottle
,brake
andsteering
commands:1-
Pedestrian hazard (bool
): Is there is a pedestrian in our lane at a distance lower than10m
?2-
Vehicle hazard (bool
): Is there is a vehicle in our lane at a distance lower than10m
?3-
Red traffic light (bool
): Is there is a traffic light in red affecting our lane at a distance lower than10m
?4-
Relative heading angle (float
): relative angle of the longitudinal vehicle axis with respect to the lane in which it is navigating.
- Extracting these
affordances
from the sensor data is called "representation learning". - Main idea here:
- Learn to extract these
affordances
(supervised learning), with the encoder being pre-trained on the task ofend-to-end
driving, e.g.BC
.
- Learn to extract these
-
-
Two stages with two datasets: A self-supervised dataset and a weakly-supervised dataset.
1-
Train anend-to-end
driving model (e.g.BC
) from (e.g. expert) demonstrations.- The final layers for
action
prediction are discarded. - Only the first part, i.e. the encoder, is used for the second stage.
- It is called "self-supervised" since it does not require external manual annotations.
- The final layers for
2-
Use this pre-trained encoder together with a multi-layer perceptron (MLP
) to predictaffordances
.- The pre-training (stage
1
) is beneficial because all the relevant information for driving should have been extracted by the encoder. - It is called "weakly-supervised" since it only requires image-level
affordance
annotations, i.e. not pixel-level.
- The pre-training (stage
-
Why is it called "
action
-based" method?- So far, I mention
behaviour cloning
(BC
) as a learning method that focus on predicting the controlactions
and whose learnt encoder can be used. - For instance,
inverse dynamics
models.-
"Predicting the next states of an agent, or the action between state transitions, yields useful representations."
-
-
[Is it a good idea to use random
actions
?] "We show that learning from expert data in our approach leads to better representations compared to traininginverse dynamics
models. This shows that expert driving data (i.e. coming from human drivers) is an important source for representation learning."
- So far, I mention
"Learning to drive by imitation: an overview of deep behavior cloning methods"
-
[
2020
] [📝] [ 🎓University of Moncton
] -
[
overview
]
Click to expand
Some simulators and datasets for supervised learning of end-to-end driving. Source. |
Instead of just single front-view camera frames (top and left), other sensor modalities can be used as inputs, for instance event-based cameras (bottom-right). Source. |
The temporal evolution of the scene can be captured by considering a sequence of past frames. Source. |
Other approaches also address the longitudinal control (top and right), while some try to exploit intermediate representations (bottom-left). Source. |
Source. |
Authors: Ly, A. O., & Akhloufi, M.
-
Motivation:
- An overview of the current state of the art deep
behaviour cloning
methods for lane stable driving. -
[No
RL
] "Byend-to-end
, we mean supervised methods that map raw visual inputs to low-level (steering
angle,speed
, etc) or high-level (drivingpath
, drivingintention
, etc.) of actuation commands using almost only deep networks."
- An overview of the current state of the art deep
-
Five classes of methods:
-
1-
Pure imitative methods that make use of vanillaCNNs
and take standard camera frames only as input.- The loss can be computed using the Mean Squared Error (
MSE
) between predictions andsteering
labels. -
"Recovery from mistakes is made possible by adding synthesized data during training via simulations of car deviation from the center of the lane."
-
"Data augmentation was performed using a basic viewpoint transformation with the help of the
left
andright
cameras."
- The loss can be computed using the Mean Squared Error (
-
2-
Models that use other types of perceptual sensors such asevents
orfisheye
cameras etc.-
"A more realistic label augmentation is achieved with the help of the wide range of captures from the front fisheye camera compared to previous methods using shearing with side (right and left) cameras."
-
"
Events
based cameras consist of independent pixels that record intensity variation in an asynchronous way. Thus, giving more information in a time interval than traditional video cameras where changes taking place between two consecutive frames are not captured."
-
-
3-
Methods that consider previous driving history to estimate future driving commands. -
4-
Systems that predicts bothlateral
andlongitudinal
control commands.-
"It outputs the vehicle
curvature
instead of thesteering angle
as generally found in the literature, which is justified by the fact thatcurvature
is more general and does not vary from vehicle to vehicle."
-
-
5-
Techniques that leverage the power ofmid-level
representations for transfer learning or give more explanation in regards to taken actions.-
"The motivation behind using a
VAE
architecture is to automatically mitigate the bias issue which occurs because generally the driving scenes in the datasets does not have the same proportions. In previous methods, this issue is solved by manually reducing the over represented scenes such asstraight driving
orstops
."
-
-
-
Some take-aways:
-
"Models making use of non-standard cameras or intermediate representations are showing a lot of potential in comparison to pure imitative methods that takes conventional video frames as input."
-
"The diversity in metrics and datasets used for reporting the results makes it very hard to strictly weigh the different models against each other."
- Explainability and transparency of taken decisions is important.
-
"A common approach in the literature is to analyse the pixels that lead to the greatest activation of neurons."
-
-
"Advisable Learning for Self-driving Vehicles by Internalizing Observation-to-Action Rules"
-
[
2020
] [📝] [ 🎓UC Berkeley
] [] -
[
attention
,advisability
]
Click to expand
Source. |
Source. |
Authors: Kim, J., Moon, S., Rohrbach, A., Darrell, T., & Canny, J.
-
Related PhD thesis: Explainable and Advisable Learning for Self-driving Vehicles, (Kim. J, 2020)
-
Motivation:
- An
end-to-end
model should be explainable, i.e. provide easy-to-interpret rationales for its behaviour: 1-
Summarize / the visual observations (input) in natural language, e.g. "light is red".Visual attention
is not enough, verbalizing is needed.
2-
Predict an appropriate action response, e.g. "I see a pedestrian crossing, so I stop".- I.e. Justify the decisions that are made and explain why they are reasonable in a human understandable manner, i.e., again, in natural language.
3-
Predict a control signal, accordingly.- The command is conditioned on the predicted high-level action command, e.g. "maintain a slow speed".
- The output is a sequence of waypoints, hence
end-to-mid
.
- An
-
About the dataset:
- Berkeley
DeepDrive-eXplanation
(BDD-X
) dataset (by the first author). - Together with camera front-views and IMU signal, the dataset provides:
1-
Textual descriptions of the vehicle's actions: what the driver is doing.2-
Textual explanations for the driver's actions: why the driver took that action from the point of view of a driving instructor.- For instance the pair: (
"the car slows down"
,"because it is approaching an intersection"
).
- For instance the pair: (
- Berkeley
"Feudal Steering: Hierarchical Learning for Steering Angle Prediction"
-
[
2020
] [📝] [ 🎓Rutgers University
] [ 🚗Lockheed Martin
] -
[
hierarchical learning
,temporal abstraction
,t-SNE embedding
]
Click to expand
Feudal learning for steering prediction. The worker decides the next steering angle conditioned on a goal (subroutine id ) determined by the manager. The manager learns to predict these subroutine ids from a sequence of past states (break , steer , throttle ). The ground truth subroutine ids are the centres of centroids obtained by unsupervised clustering. They should contain observable semantic meaning in terms of driving tasks. Source. |
Authors: Johnson, F., & Dana, K.
-
Note: Although terms and ideas from hierarchical reinforcement learning (
HRL
) are used, noRL
is applied here! -
Motivation: Temporal abstraction.
- Problems in
RL
: delayed rewards and sparse credit assignment. - Some solutions: intrinsic rewards and temporal abstraction.
- The idea of temporal abstraction is to break down the problem into more tractable pieces:
-
"At all times, human drivers are paying attention to two levels of their environment. The first level goal is on a finer grain: don’t hit obstacles in the immediate vicinity of the vehicle. The second level goal is on a coarser grain: plan actions a few steps ahead to maintain the proper course efficiently."
-
- Problems in
-
The idea of feudal learning is to divide the task into:
1-
A manager network.- It operates at a lower temporal resolution and produces
goal
vectors that it passes to the worker network. - This
goal
vector should encapsulate a temporally extended action called asubroutine
,skill
,option
, ormacro-action
. - Input: Sequence of previous
steering
. - Output:
goal
.
- It operates at a lower temporal resolution and produces
2-
A worker network: conditioned on thegoal
decided by the manager.- Input:
goal
decided by the manager,previous own prediction
, sequence offrames
. - Output:
steering
.
- Input:
- The
subroutine ids
(manager net) and thesteering angle
prediction (worker net) are jointly learnt.
-
What are the ground truth
goal
used to train the manager?- They are ids of the centres of centroids formed by clustering (unsupervised learning) all the training data:
1-
Data:Steering
,braking
, andthrottle
data are concatenated everym=10
time steps to make a vector of length3m=30
.2-
Encoding: projected in at-SNE
2d
-space.3-
Clustering:K-means
.- The
2d
-coordinates of centroids of the clusters are thesubroutine ids
, i.e. the possiblegoals
.- How do they convert the
2d
-coordinates into a single scalar?
- How do they convert the
-
"We aim to classify the
steering
angles into their temporally abstracted subroutines, also calledoptions
ormacro-actions
, associated with highway driving such asfollow the sharp right bend
,bumper-to-bumper traffic
,bear left slightly
."
- They are ids of the centres of centroids formed by clustering (unsupervised learning) all the training data:
-
What are the decision frequencies?
- The worker considers the last
10
actions to decide thegoal
. - It seems like a smoothing process, where a window is applied?
- It should be possible to achieve that with a recurrent net, shouldn't it?
- The worker considers the last
-
About
t-SNE
:-
"
t
-Distributed Stochastic Neighbor Embedding (t-SNE
) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms,t-SNE
gives you a feel or intuition of how the data is arranged in a high-dimensional space [fromtowardsdatascience
]." - Here it is used as an embedding space for the driving data and as the subroutine ids themselves.
-
"A Survey of End-to-End Driving: Architectures and Training Methods"
-
[
2020
] [📝] [ 🎓University of Tartu
] -
[
review
,distribution shift problem
,domain adaptation
,mid-to-mid
]
Click to expand
Left: example of end-to-end architecture with key terms. Right: difference open-loop / close-loop evaluation. Source. |
Source. |
Source. |
Authors: Tampuu, A., Semikin, M., Muhammad, N., Fishman, D., & Matiisen, T.
-
A rich literature overview and some useful reminders about general
IL
andRL
concepts with focus toAD
applications.- It constitutes a good complement to the "Related trends in research" part of my video "From RL to Inverse Reinforcement Learning: Intuitions, Concepts + Applications to Autonomous Driving".
-
I especially like the structure of the document: It shows what one should consider when starting an
end-to-end
/IL
project forAD
:- I have just noted here some ideas I find interesting. In no way an exhaustive summary!
-
1-
Learning methods: working withrewards
(RL
) or withlosses
(behavioural cloning
).- About
distribution shift problem
inbehavioural cloning
:-
"If the driving decisions lead to unseen situations (not present in the training set), the model might no longer know how to behave".
- Most solutions try to diversify the training data in some way - either by collecting or generating additional data:
data augmentation
: e.g. one can place two additional cameras pointing forward-left and forward-right and associate the images with commands toturn right
andturn left
respectively.data diversification
: addition of temporally correlated noise and synthetic trajectory perturbations. Easier on "semantic" inputs than on camera inputs.on-policy learning
: recovery annotation andDAgger
. The expert provides examples how to solve situations the model-driving leads to. Also "Learning by cheating" by (Chen et al. 2019).balancing the dataset
: by upsampling the rarely occurring angles, downsampling the common ones or by weighting the samples.-
"Commonly, the collected datasets contain large amounts of repeated traffic situations and only few of those rare events."
- The authors claim that only the joint distribution of
inputs
andoutputs
defines the rarity of a data point. -
"Using more training data from
CARLA
Town1
decreases generalization ability inTown2
. This illustrates that more data without more diversity is not useful." - Ideas for augmentation can be taken from the field of
supervised Learning
where it is already an largely-addressed topic.
-
-
- About
RL
:- Policies can be first trained with IL and then fine-tuned with RL methods.
-
"This approach reduces the long training time of RL approaches and, as the RL-based fine-tuning happens online, also helps overcome the problem of IL models learning off-policy".
- About
domain adaptation
andtransfer
from simulation to real world (sim2real
).- Techniques from
supervised
learning, such asfine tuning
, i.e. adapting the driving model to the new distribution, are rarely used. - Instead, one can instead adapt the incoming data and keep the driving model fixed.
- A first idea is to transform real images into simulation-like images (the opposite - generating real-looking images - is challenging).
- One can also extract the semantic segmentation of the scene from both the real and the simulated images and use it as the input for the driving policy.
- Techniques from
- About
-
2-
Inputs.- In short:
Vision
is key.Lidar
andHD-map
are nice to have but expensive / tedious to maintain.- Additional inputs from independent modules (
semantic segmentation
,depth map
,surface normals
,optical flow
andalbedo
) can improve the robustness.
- About the
inertia problem
/causal confusion
when for instance predicting the nextego-speed
.-
"As in the vast majority of samples the current [observed] and next speeds [to be predicted] are highly correlated, the model learns to base its speed prediction exclusively on current speed. This leads to the model being reluctant to change its speed, for example to start moving again after stopping behind another car or a at traffic light."
-
- About
affordances
:-
"Instead of parsing all the objects in the driving scene and performing robust localization (as modular approach), the system focuses on a small set of crucial indicators, called
affordances
."
-
- In short:
-
3-
Outputs.-
"The outputs of the model define the level of understanding the model is expected to achieve."
- Also related to the
time horizon
:-
"When predicting instantaneous low-level commands, we are not explicitly forcing the model to plan a long-term trajectory."
-
- Three types of predictions:
3-1
Low-level commands.-
"The majority of end-to-end models yield as output the
steering angle
andspeed
(oracceleration
andbrake
commands) for the next timestep". - Low-level commands may be car-specific. For instance vehicles answer differently to the same
throttle
/steering
commands.-
"The function between steering wheel angle and the resulting turning radius depends on the car's geometry, making this measure specific to the car type used for recording."
-
-
[About the regression loss] "Many authors have recently optimized
speed
andsteering
commands usingL1
loss (mean absolute error,MAE
) instead ofL2
loss (mean squared error,MSE
)".
-
3-2
Future waypoints or desired trajectories.- This higher-level output modality is independent of car geometry.
3-3
Cost map, i.e. information about where it is safe to drive, leaving the trajectory generation to another module.
- About multitask learning and auxiliary tasks:
- The idea is to simultaneously train a separate set of networks to predict for instance
semantic segmentation
,optical flow
,depth
and other human-understandable representations from the camera feed. -
"Based on the same extracted visual features that are fed to the decision-making branch (main task), one can also predict
ego-speed
,drivable area
on the scene, andpositions
andspeeds
of other objects". - It offers more learning signals - at least for the shared layers.
- And can also help understand the mistakes a model makes:
-
"A failure in an auxiliary task (e.g. object detection) might suggest that necessary information was not present already in the intermediate representations (layers) that it shared with the main task. Hence, also the main task did not have access to this information and might have failed for the same reason."
-
- The idea is to simultaneously train a separate set of networks to predict for instance
-
-
4-
Evaluation: the difference betweenopen-loop
andclose-loop
.4-1
open-loop
: like insupervised
learning:- one question = one answer.
- Typically, a dataset is split into training and testing data.
- Decisions are compared with the recorded actions of the demonstrator, assumed to be the ground-truth.
4-2
close-loop
: like in decision processes:- The problem consists in a multi-step interaction with some environment.
- It directly measures the model's ability to drive on its own.
- Interesting facts: Good
open-loop
performance does not necessarily lead to good driving ability inclosed-loop
settings.-
"Mean squared error (
MSE
) correlates withclosed-loop
success rate only weakly (correlation coefficientr = 0.39
), soMAE
,quantized classification error
orthresholded relative error
should be used instead (r > 0.6
for all three)." - About the
balanced-MAE
metric foropen-loop
evaluation, which correlates better withclosed-loop
performance than simpleMAE
.-
"
Balanced-MAE
is computed by averaging the mean values of unequal-length bins according tosteering angle
. Because most data lies in the region around steering angle0
, equally weighting the bins grows the importance of rarely occurring (higher)steering angle
s."
-
-
-
5-
Interpretability:5-1
Either on thetrained
model ...-
"Sensitivity analysis aims to determine the parts of an input that a model is most sensitive to. The most common approach involves computing the gradients with respect to the input and using the magnitude as the measure of sensitivity."
VisualBackProp
: which input pixels influence the cars driving decision the most.
-
5-2
... or already duringtraining
.-
"
visual attention
is a built-in mechanism present already when learning. Where to attend in the next timestep (the attention mask), is predicted as additional output in the current step and can be made to depend on additional sources of information (e.g. textual commands)."
-
-
About
end-to-end
neural nets and humans:-
"[
StarCraft
,Dota 2
,Go
andChess
solved withNN
]. Many of these solved tasks are in many aspects more complex than driving a car, a task that a large proportion of people successfully perform even when tired or distracted. A person can later recollect nothing or very little about the route, suggesting the task needs very little conscious attention and might be a simple behavior reflex task. It is therefore reasonable to believe that in the near future anend-to-end
approach is also capable to autonomously control a vehicle."
-
"Efficient Latent Representations using Multiple Tasks for Autonomous Driving"
-
[
2020
] [📝] [ 🎓Aalto University
] -
[
latent space representation
,multi-head decoder
,auxiliary tasks
]
Click to expand
The latent representation is enforced to predict the trajectories of both the ego vehicle and other vehicles in addition to the input image, using a multi-head network structure. Source. |
Authors: Kargar, E., & Kyrki, V.
- Motivations:
1-
Reduce the dimensionality of thefeature representation
of the scene - used as input to someIL
/RL
policy.- This is to improve most
mid-to-x
approaches that encode and process a vehicle’s environment as multi-channel and quite high-dimensional bird view images. ->
The idea here is to learn anencoder-decoder
.- The latent space has size
64
(way smaller than common64 x 64 x N
bird-views).
- This is to improve most
2-
Learn alatent representation
faster / with fewer data.- A single head decoder would just consider
reconstruction
. ->
The idea here is to use have multiple heads in the decoder, i.e. make prediction of multiple auxiliary application relevant factors.-
"The multi-head model can reach the single-head model’s performance in
20
epochs, one-fifth of training time of the single-head model, with full dataset." -
"In general, the multi-heal model, using only
6.25%
of the dataset, converges faster and perform better than single head model trained on the full dataset."
- A single head decoder would just consider
3-
Learn apolicy
faster / with fewer data.
- Two components to train:
1-
Anencoder-decoder
learns to produce a latent representation (encoder
) coupled with a multiple-prediction-objective (decoder
).2-
Apolicy
use the latent representation to predict low-level controls.
- About the
encoder-decoder
:inputs
: bird-view image containing:- Environment info, built from
HD Maps
andperception
. - Ego trajectory:
10
past poses. - Other trajectory:
10
past poses. - It forms a
256 x 256
image, which is resized to64 x 64
to feed them into the models
- Environment info, built from
outputs
: multiple auxiliary tasks:1-
Reconstruction head: reconstructing the input bird-view image.2-
Prediction head:1s
-motion-prediction for other agents.3-
Planning head:1s
-motion-prediction for the ego car.
- About the
policy
:- In their example, the authors implement
behaviour cloning
, i.e. supervised learning to reproduce the decision ofCARLA autopilot
. 1-
steering
prediction.2-
acceleration
classification -3
classes.
- In their example, the authors implement
- How to deal with the unbalanced dataset?
- First, the authors note that no manual labelling is required to collect training data.
- But the recorded
steering
angle is zero most of the time - leading to a highly imbalanced dataset. - Solution (no further detail):
-
"Create a new dataset and balance it using sub-sampling".
-
"Robust Imitative Planning : Planning from Demonstrations Under Uncertainty"
-
[
2019
] [📝] [ 🎓University of Oxford
,UC Berkeley
,Carnegie Mellon University
] -
[
epistemic uncertainty
,risk-aware decision-making
,CARLA
]
Click to expand
Illustration of the state distribution shift in behavioural cloning (BC ) approaches. The models (e.g. neural networks) usually fail to generalize and instead extrapolate confidently yet incorrectly, resulting in arbitrary outputs and dangerous outcomes. Not to mention the compounding (or cascading) errors, inherent to the sequential decision making. Source. |
Testing behaviours on scenarios such as roundabouts that are not present in the training set. Source. |
Above - in their previous work, the authors introduced Deep imitative models (IM ). The imitative planning objective is the log posterior probability of a state trajectory, conditioned on satisfying some goal G . The state trajectory that has the highest likelihood w.r.t. the expert model q (S given φ ; θ ) is selected, i.e. maximum a posteriori probability (MAP ) estimate of how an expert would drive to the goal. This captures any inherent aleatoric stochasticity of the human behaviour (e.g., multi-modalities), but only uses a point-estimate of θ , thus q (s given φ ;θ ) does not quantify model (i.e. epistemic ) uncertainty. φ denotes the contextual information (3 previous states and current LIDAR observation) and s denotes the agent’s future states (i.e. the trajectory). Bottom - in this works, an ensemble of models is used: q (s given φ ; θk ) where θk denotes the parameters of the k -th model (neural network). The Aggregation Operator operator is applied on the posterior p(θ given D ). The previous work is one example of that, where a single θi is selected. Source. |
To save computation and improve runtime to real-time, the authors use a trajectory library: they perform K-means clustering of the expert plan’s from the training distribution and keep 128 of the centroids, allegedly reducing the planning time by a factor of 400 . During optimization, the trajectory space is limited to only that trajectory library. It makes me think of templates sometimes used for path-planning. I also see that as a way restrict the search in the trajectory space, similar to injecting expert knowledge about the feasibility of cars trajectories. Source. |
Estimating the uncertainty is not enough. One should then forward that estimate to the planning module. This reminds me an idea of (McAllister et al., 2017) about the key benefit of propagating uncertainty throughout the AV framework. Source. |
Authors: Tigas, P., Filos, A., Mcallister, R., Rhinehart, N., Levine, S., & Gal, Y.
-
Previous work:
"Deep Imitative Models for Flexible Inference, Planning, and Control"
(see below).- The idea was to combine the benefits of
imitation learning
(IL
) andgoal-directed planning
such asmodel-based RL
(MBRL
).- In other words, to complete planning based on some imitation prior, by combining generative modelling from demonstration data with planning.
- One key idea of this generative model of expert behaviour: perform context-conditioned density estimation of the distribution over future expert trajectories, i.e. score the "expertness" of any plan of future positions.
- Limitations:
- It only uses a point-estimate of
θ
. Hence it fails to capture epistemic uncertainty in the model’s density estimate. -
"Plans can be risky in scenes that are out-of-training-distribution since it confidently extrapolates in novel situations and lead to catastrophes".
- It only uses a point-estimate of
- The idea was to combine the benefits of
-
Motivations here:
1-
Develop a model that captures epistemic uncertainty.2-
Estimating uncertainty is not a goal at itself: one also need to provide a mechanism for taking low-risk actions that are likely to recover in uncertain situations.- I.e. both
aleatoric
andepistemic
uncertainty should be taken into account in the planning objective. - This reminds me the figure of (McAllister et al., 2017) about the key benefit of propagating uncertainty throughout the AV framework.
- I.e. both
-
One quote about behavioural cloning (
BC
) that suffers from state distribution shift (co-variate shift
):-
"Where high capacity parametric models (e.g. neural networks) usually fail to generalize, and instead extrapolate confidently yet incorrectly, resulting in arbitrary outputs and dangerous outcomes".
-
-
One quote about model-free
RL
:-
"The specification of a reward function is as hard as solving the original control problem in the first place."
-
-
About
epistemic
andaleatoric
uncertainties:-
"Generative models can provide a measure of their uncertainty in different situations, but robustness in novel environments requires estimating
epistemic uncertainty
(e.g., have I been in this state before?), where conventional density estimation models only capturealeatoric uncertainty
(e.g., what’s the frequency of times I ended up in this state?)."
-
-
How to capture uncertainty about previously unseen scenarios?
- Using an ensemble of density estimators and aggregate operators over the models’ outputs.
-
"By using demonstration data to learn density models over human-like driving, and then estimating its uncertainty about these densities using an ensemble of imitative models".
-
- The idea it to take the disagreement between the models into consideration and inform planning.
-
"When a trajectory that was never seen before is selected, the model’s high
epistemic
uncertainty pushes us away from it. During planning, the disagreement between the most probable trajectories under the ensemble of imitative models is used to inform planning."
-
- Using an ensemble of density estimators and aggregate operators over the models’ outputs.
"End-to-end Interpretable Neural Motion Planner"
-
[
2019
] [📝] [ 🎓University of Toronto
] [ 🚗Uber
] -
[
interpretability
,trajectory sampling
]
Click to expand
The visualization of 3D detection, motion forecasting as well as learned cost-map volume offers interpretability. A set of candidate trajectories is sampled, first considering the geometrical path and then then speed profile. The trajectory with the minimum learned cost is selected. Source. |
Source. |
Authors: Zeng W., Luo W., Suo S., Sadat A., Yang B., Casas S. & Urtasun R.
-
Motivation is to bridge the gap between the
traditional engineering stack
and theend-to-end driving
frameworks.1-
Develop a learnable motion planner, avoiding the costly parameter tuning.2-
Ensure interpretability in the motion decision. This is done by offering an intermediate representation.3-
Handle uncertainty. This is allegedly achieved by using a learnt, non-parametric cost function.4-
Handle multi-modality in possible trajectories (e.gchanging lane
vskeeping lane
).
-
One quote about
RL
andIRL
:-
"It is unclear if
RL
andIRL
can scale to more realistic settings. Furthermore, these methods do not produce interpretable representations, which are desirable in safety critical applications".
-
-
Architecture:
Input
: raw LIDAR data and a HD map.1st intermediate result
: An "interpretable" bird’s eye view representation that includes:3D
detections.- Predictions of future trajectories (planning horizon of
3
seconds). - Some spatio-temporal cost volume defining the goodness of each position that the self-driving car can take within the planning horizon.
2nd intermediate result
: A set of diverse physically possible trajectories (candidates).- They are
Clothoid
curves being sampled. First building thegeometrical path
. Then thespeed profile
on it. -
"Note that
Clothoid
curves can not handle circle and straight line trajectories well, thus we sample them separately."
- They are
Final output
: The trajectory with the minimum learned cost.
-
Multi-objective:
1-
Perception
Loss - to predict the position of vehicles at every time frame.- Classification: Distinguish a vehicle from the background.
- Regression: Generate precise object bounding boxes.
2-
Planning
Loss.-
"Learning a reasonable cost volume is challenging as we do not have ground-truth. To overcome this difficulty, we minimize the
max-margin
loss where we use the ground-truth trajectory as a positive example, and randomly sampled trajectories as negative examples." - As stated, the intuition behind is to encourage the demonstrated trajectory to have the minimal cost, and others to have higher costs.
- The model hence learns a cost volume that discriminates good trajectories from bad ones.
-
"Learning from Interventions using Hierarchical Policies for Safe Learning"
- [
2019
] [📝] [ 🎓University of Rochester, University of California San Diego
] - [
hierarchical
,sampling efficiency
,safe imitation learning
]
Click to expand
The main idea is to use Learning from Interventions (LfI ) in order to ensure safety and improve data efficiency, by intervening on sub-goals rather than trajectories. Both top-level policy (that generates sub-goals) and bottom-level policy are jointly learnt. Source. |
Authors: Bi, J., Dhiman, V., Xiao, T., & Xu, C.
- Motivations:
1-
Improve data-efficiency.2-
Ensure safety.
- One term: "Learning from Interventions" (
LfI
).- One way to classify the "learning from expert" techniques is to use the frequency of expert’s engagement.
High frequency
-> Learning from Demonstrations.Medium frequency
-> learning from Interventions.Low frequency
-> Learning from Evaluations.
- Ideas of
LfI
:-
"When an undesired state is detected, another policy is activated to take over actions from the agent when necessary."
- Hence the expert overseer only intervenes when it suspects that an unsafe action is about to be taken.
-
- Two issues:
1-
LfI
(as forLfD
) learn reactive behaviours.-
"Learning a supervised policy is known to have 'myopic' strategies, since it ignores the temporal dependence between consecutive states".
- Maybe one option could be to stack frames or to include the current speed in the
state
. But that makes the state space larger.
-
2-
The expert only signals after a non-negligible amount of delay.
- One way to classify the "learning from expert" techniques is to use the frequency of expert’s engagement.
- One idea to solve both issues: Hierarchy.
- The idea is to split the policy into two hierarchical levels, one that generates
sub-goals
for the future and another that generatesactions
to reach those desired sub-goals. - The motivation is to intervene on sub-goals rather than trajectories.
- One important parameter:
k
- The top-level policy predicts a sub-goal to be achieved
k
steps ahead in the future. - It represents a trade-off between:
- The ability for the
top-level
policy to predict sub-goals far into the future. - The ability for the
bottom-level
policy to follow it correctly.
- The ability for the
- The top-level policy predicts a sub-goal to be achieved
- One question: How to deal with the absence of ground- truth sub-goals ?
- One solution is "Hindsight Experience Replay", i.e. consider an achieved goal as a desired goal for past observations.
- The authors present additional interpolation techniques.
- They also present a
Triplet Network
to train goal-embeddings (I did not understand everything).
- The idea is to split the policy into two hierarchical levels, one that generates
"Urban Driving with Conditional Imitation Learning"
Click to expand
The encoder is trained to reconstruct RGB , depth and segmentation , i.e. to learn scene understanding. It is augmented with optical flow for temporal information. As noted, such representations could be learned simultaneously with the driving policy, for example, through distillation. But for efficiency, this was pre-trained (Humans typically also have ~30 hours of driver training before taking the driving exam. But they start with huge prior knowledge). Interesting idea: the navigation command is injected as multiple locations of the control part. Source. |
Driving data is inherently heavily imbalanced, where most of the captured data will be driving near-straight in the middle of a lane. Any naive training will collapse to the dominant mode present in the data. No data augmentation is performed. Instead, during training, the authors sample data uniformly across lateral and longitudinal control dimensions. Source. |
Authors: Hawke, J., Shen, R., Gurau, C., Sharma, S., Reda, D., Nikolov, N., Mazur, P., Micklethwaite, S., Griffiths, N., Shah, A. & Kendall, A.
- Motivations:
1-
Learn bothsteering
andspeed
via Behavioural Cloning.2-
Use raw sensor (camera) inputs, rather than intermediate representations.3-
Train and test on dense urban environments.
- Why "conditional"?
- A route command (e.g.
turn left
,go straight
) resolves the ambiguity of multi-modal behaviours (e.g. when coming at an intersection). -
"We found that inputting the command multiple times at different stages of the network improves robustness of the model".
- A route command (e.g.
- Some ideas:
- Provide wider state observability through multiple camera views (single camera disobeys navigation interventions).
- Add temporal information via optical flow.
- Another option would be to stack frames. But it did not work well.
- Train the primary shared encoders and auxiliary independent decoders for a number of computer vision tasks.
-
"In robotics, the
test
data is the real-world, not a static dataset as is typical in mostML
problems. Every time our cars go out, the world is new and unique."
-
- One concept: "Causal confusion".
- A good video about Causal Confusion in Imitation Learning showing that "access to more information leads to worse generalisation under distribution shift".
-
"Spurious correlations cannot be distinguished from true causes in the demonstrations. [...] For example, inputting the current speed to the policy causes it to learn a trivial identity mapping, making the car unable to start from a static position."
- Two ideas during training:
- Using flow features to make the model use explicit motion information without learning the trivial solution of an identity mapping for speed and steering.
- Add random noise and use dropout on it.
- One alternative is to explicitly maintain a causal model.
- Another alternative is to learn to predict the speed, as detailed in "Exploring the Limitations of Behavior Cloning for Autonomous Driving".
- Output:
- The model decides of a "motion plan", i.e. not directly the low-level control?
- Concretely, the network gives one prediction and one slope, for both
speed
andsteering
, leading to two parameterised lines.
- Two types of tests:
1-
Closed-loop (i.e. go outside and drive).- The number and type of safety-driver interventions.
2-
Open-loop (i.e., evaluating on an offline dataset).- The weighted mean absolute error for
speed
andsteering
.- As noted, this can serve as a proxy for real world performance.
- The weighted mean absolute error for
-
"As discussed by [34] and [35], the correlation between
offline
open-loop
metrics andonline
closed-loop
performance is weak."
- About the training data:
- As stated, they are two levers to increase the performance:
1-
Algorithmic innovation.2-
Data.
- For this
IL
approach,30
hours of demonstrations. -
"Re-moving a quarter of the data notably degrades performance, and models trained with less data are almost undriveable."
- As stated, they are two levers to increase the performance:
- Next steps:
- I find the results already impressive. But as noted:
-
"The learned driving policies presented here need significant further work to be comparable to human driving".
-
- Ideas for improvements include:
- Add some predictive long-term planning model. At the moment, it does not have access to long-term dependencies and cannot reason about the road scene.
- Learn not only from demonstration, but also from mistakes.
- This reminds me the concept of
ChauffeurNet
about "simulate the bad rather than just imitate the good".
- This reminds me the concept of
- Continuous learning: Learning from corrective interventions would also be beneficial.
- The last point goes in the direction of adding learning signals, which was already done here.
- Imitation of human expert drivers (
supervised
learning). - Safety driver intervention data (
negative reinforcement
learning) and corrective action (supervised
learning). - Geometry, dynamics, motion and future prediction (
self-supervised
learning). - Labelled semantic computer vision data (
supervised
learning). - Simulation (
supervised
learning).
- Imitation of human expert drivers (
- I find the results already impressive. But as noted:
"Application of Imitation Learning to Modeling Driver Behavior in Generalized Environments"
Click to expand
The IL models were trained on a straight road and tested on roads with high curvature. PS-GAIL is effective only while surrounded by other vehicles, while the RAIL policy remained stably within the bounds of the road thanks to the additional rewards terms included into the learning process.. Source. |
Authors: Lange, B. A., & Brannon, W. D.
- One motivation: Compare the robustness (domain adaptation) of three
IL
techniques: - One take-away: This student project builds a good overview of the different
IL
algorithms and why these algorithms came out.- Imitation Learning (
IL
) aims at building an (efficient) policy using some expert demonstrations. - Behavioural Cloning (
BC
) is a sub-class ofIL
. It treatsIL
as a supervised learning problem: a regression model is fit to thestate
/action
space given by the expert.-
Issue of distribution shift: "Because data is not infinite nor likely to contain information about all possible
state
/action
pairs in a continuousstate
/action
space,BC
can display undesirable effects when placed in these unknown or not well-known states." -
"A cascading effect is observed as the time horizon grows and errors expand upon each other."
-
- Several solutions (not exhaustive):
1-
DAgger
: Ask the expert to say what should be done in some encountered situations. Thus iteratively enriching the demonstration dataset.2-
IRL
: Human driving behaviour is not modelled inside a policy, but rather capture into a reward/cost function.- Based on this reward function, an (optimal) policy can be derived with classic
RL
techniques. - One issue: It can be computationally expensive.
- Based on this reward function, an (optimal) policy can be derived with classic
3-
GAIL
(I still need to read more about it):-
"It fits distributions of states and actions given by an expert dataset, and a cost function is learned via Maximum Causal Entropy
IRL
." -
"When the
GAIL
-policy driven vehicle was placed in a multi-agent setting, in which multiple agents take over the learned policy, this algorithm produced undesirable results among the agents."
-
PS-GAIL
is therefore introduced for multi-agent driving models (agents share a single policy learnt withPS-TRPO
).-
"Though
PS-GAIL
yielded better results in multi-agent simulations thanGAIL
, its results still led to undesirable driving characteristics, including unwanted trajectory deviation and off-road duration."
-
RAIL
offers a fix for that: the policy-learning process is augmented with two types of reward terms:- Binary penalties: e.g. collision and hard braking.
- Smoothed penalties: "applied in advance of undesirable actions with the theory that this would prevent these actions from occurring".
- I see that technique as a way to incorporate knowledge.
- Imitation Learning (
- About the experiment:
- The three policies were originally trained on the straight roadway: cars only consider the lateral distance to the edge.
- In the "new" environment, a road curvature is introduced.
- Findings:
-
"None of them were able to fully accommodate the turn in the road."
PS-GAIL
is effective only while surrounded by other vehicles.- The smoothed reward augmentation helped
RAIL
, but it was too late to avoid off-road (the car is already driving too fast and does not dare ahard brake
which is strongly penalized). - The reward function should therefore be updated (back to reward engineering 😅), for instance adding a harder reward term to prevent the car from leaving the road.
-
"Learning by Cheating"
-
[
2019
] [📝] [] [ 🎓UT Austin
] [ 🚗Intel Labs
] -
[
on-policy supervision
,DAgger
,conditional IL
,mid-to-mid
,CARLA
]
Click to expand
The main idea is to decompose the imitation learning (IL ) process into two stages: 1- Learn to act. 2- Learn to see. Source. |
mid-to-mid learning: Based on a processed bird’s-eye view map , the privileged agent predicts a sequence of waypoints to follow. This desired trajectory is eventually converted into low-level commands by two PID controllers. It is also worth noting how this privileged agent serves as an oracle that provides adaptive on-demand supervision to train the sensorimotor agent across all possible commands. Source. |
Example of privileged map supplied to the first agent. And details about the lateral PID controller that produces steering commands based on a list of target waypoints. Source. |
Authors: Chen, D., Zhou, B., Koltun, V. & Krähenbühl, P
- One motivation: decomposing the imitation learning (
IL
) process into two stages:Direct IL
(from expert trajectories to vision-based driving) conflates two difficult tasks:1-
Learning to see.2-
Learning to act.
- One term: "Cheating".
1-
First, train an agent that has access to privileged information:-
"This privileged agent cheats by observing the ground-truth layout of the environment and the positions of all traffic participants."
- Goal: The agent can focus on learning to act (it does not need to learn to see because it gets direct access to the environment’s state).
-
2-
Then, this privileged agent serves as a teacher to train a purely vision-based system (abundant supervision
).- Goal: Learning to see.
1-
First agent (privileged
agent):- Input: A processed
bird’s-eye view map
(with ground-truth information about lanes, traffic lights, vehicles and pedestrians) together with high-levelnavigation command
andcurrent speed
. - Output: A list of waypoints the vehicle should travel to.
- Hence
mid-to-mid
learning approach. - Goal: imitate the expert trajectories.
- Training: Behaviour cloning (
BC
) from a set of recorded expert driving trajectories.- Augmentation can be done offline, to facilitate generalization.
- The agent is thus placed in a variety of perturbed configurations to learn how to recover
- E.g. facing the sidewalk or placed on the opposite lane, it should find its way back onto the road.
- Input: A processed
2-
Second agent (sensorimotor
agent):- Input: Monocular
RGB
image, currentspeed
, and a high-levelnavigation command
. - Output: A list of waypoints.
- Goal: Imitate the privileged agent.
- Input: Monocular
- One idea: "White-box" agent:
- The internal state of the
privileged
agent can be examined at will.- Based on that, one could test different high-level commands: "What would you do now if the command was [
follow-lane
] [go left
] [go right
] [go straight
]".
- Based on that, one could test different high-level commands: "What would you do now if the command was [
- This relates to
conditional IL
: all conditional branches are supervised during training.
- The internal state of the
- Another idea: "online learning" and "on-policy supervision":
-
"“On-policy” refers to the sensorimotor agent rolling out its own policy during training."
- Here, the decision of the second agents are directly implemented (
close-loop
). - And an oracle is still available for the newly encountered situation (hence
on-policy
), which also accelerates the training. - This is an advantage of using a simulator: it would be difficult/impossible in the physical world.
- Here, the decision of the second agents are directly implemented (
- Here, the second agent is first trained
off-policy
(on expert demonstration) to speed up the learning (offline BC
), and only then goon-policy
:-
"Finally, we train the sensorimotor agent
on-policy
, using the privileged agent as an oracle that provides adaptive on-demand supervision in any state reached by the sensorimotor student." - The
sensorimotor
agent can thus be supervised on all its waypoints and across all commands at once.
-
- It resembles the Dataset aggregation technique of
DAgger
:-
"This enables automatic
DAgger
-like training in which supervision from the privileged agent is gathered adaptively via online rollouts of the sensorimotor agent."
-
-
- About the two benchmarks:
- Another idea: Do not directly output low-level commands.
- Instead, predict waypoints and speed targets.
- And rely on two
PID
controllers to implement them.-
1-
"We fit a parametrized circular arc to all waypoints using least-squares fitting and then steer towards a point on the arc." -
2-
"A longitudinalPID
controller tries to match a target velocity as closely as possible [...] We ignore negative throttle commands, and only brake if the predicted velocity is below some threshold (2 km/h
)."
-
"Deep Imitative Models for Flexible Inference, Planning, and Control"
-
[
2019
] [📝] [🎞️] [🎞️] [] [ 🎓Carnegie Mellon University
,UC Berkeley
] -
[
conditional IL
,model-based RL
,CARLA
]
Click to expand
The main motivation is to combine the benefits of IL (to imitate some expert demonstrations) and goal-directed planning (e.g. model-based RL ). Source. |
φ represents the scene. It consists of the current lidar scan , previous states in the trajectory as well as the current traffic light state . Source. |
From left to right: Point , Line-Segment and Region (small and wide) Final State Indicators used for planning. Source. |
Comparison of features and implementations. Source. |
Authors: Rhinehart, N., McAllister, R., & Levine, S.
-
Main motivation: combine the benefits of
imitation learning
(IL
) andgoal-directed planning
such asmodel-based RL
(MBRL
).- Especially to generate interpretable, expert-like plans with offline learning and no reward engineering.
- Neither
IL
norMBRL
can do so. - In other words, it completes planning based on some imitation prior.
-
One concept: "Imitative Models"
- They are "probabilistic predictive models able to plan interpretable expert-like trajectories to achieve new goals".
- As for
IL
-> use expert demonstration:- It generates expert-like behaviors without reward function crafting.
- The model is learnt "offline" also means it avoids costly online data collection (contrary to
MBRL
). - It learns dynamics desirable behaviour models (as opposed to learning the dynamics of possible behaviour done by
MBRL
).
- As for
MBRL
-> use planning:- It achieves new goals (goals that were not seen during training). Therefore, it avoids the theoretical drift shortcomings (distribution shift) of vanilla behavioural cloning (
BC
). - It outputs (interpretable)
plan
to them at test-time, whichIL
cannot. - It does not need goal labels for training.
- It achieves new goals (goals that were not seen during training). Therefore, it avoids the theoretical drift shortcomings (distribution shift) of vanilla behavioural cloning (
- Binding
IL
andplanning
:- The learnt
imitative model
q(S|φ)
can generate trajectories that resemble those that the expert might generate.- These manoeuvres do not have a specific goal. How to direct our agent to goals?
- General tasks are defined by a set of goal variables
G
.- At test time, a route planner provides waypoints to the imitative planner, which computes expert-like paths for each candidate waypoint.
- The best plan is chosen according to the planning objective (e.g. prefer routes avoiding potholes) and provided to a low-level
PID
-controller in order to producesteering
andthrottle
actions. - In other words, the derived plan (list of set-points) should be:
- Maximizing the similarity to the expert demonstrations (term with
q
) - Maximizing the probability of reaching some general goals (term with
P(G)
).
- Maximizing the similarity to the expert demonstrations (term with
- How to represent goals?
dim=0
- with points:Final-State Indicator
.dim=1
- with lines:Line-Segment Final-State Indicator
.dim=2
- with areas (regions):Final-State Region Indicator
.
- The learnt
-
How to deal with traffic lights?
- The concept of
smart waypointer
is introduced. -
"It removes far waypoints beyond
5
meters from the vehicle when a red light is observed in the measurements provided by CARLA". -
"The planner prefers closer goals when obstructed, when the vehicle was already stopped, and when a red light was detected [...] The planner prefers farther goals when unobstructed and when green lights or no lights were observed."
- The concept of
-
About interpretability and safety:
-
"In contrast to black-box one-step
IL
that predicts controls, our method produces interpretable multi-step plans accompanied by two scores. One estimates the plan’sexpertness
, the second estimates its probability to achieve the goal."- The
imitative model
can produce some expert probability distribution function (PDF
), hence offering superior interpretability to one-stepIL
models. - It is able to score how likely a trajectory is to come from the expert.
- The probability to achieve a goal is based on some "Goal Indicator methods" (using "Goal Likelihoods"). I must say I did not fully understand that part
- The
- The safety aspect relies on the fact that experts were driving safely and is formalized as a "plan reliability estimation":
-
"Besides using our model to make a best-effort attempt to reach a user-specified goal, the fact that our model produces explicit likelihoods can also be leveraged to test the reliability of a plan by evaluating whether reaching particular waypoints will result in human-like behavior or not."
- Based on this idea, a classification is performed to recognize safe and unsafe plans, based on the planning criterion.
-
-
-
About the baselines:
- Obviously, the proposed approach is compared to the two methods it aims at combining.
- About
MBRL
:1-
First, aforward dynamics model
is learnt using given observed expert data.- It does not imitate the expert preferred actions, but only models what is physically possible.
2-
The model then is used to plan a reachability tree through the free-space up to the waypoint while avoiding obstacles:- Playing with the
throttle
action, the search expands each state node and retains the50
closest nodes to the target waypoint. - The planner finally opts for the lowest-cost path that ends near the goal.
- Playing with the
-
"The task of evoking expert-like behavior is offloaded to the reward function, which can be difficult and time-consuming to craft properly."
- About
IL
: It used Conditional terms on States, leading toCILS
.S
forstate
: Instead of emitting low-level control commands (throttle
,steering
), it outputs set-points for somePID
-controller.C
forconditional
: To navigate at intersections, waypoints are classified into one of several directives: {Turn left
,Turn right
,Follow Lane
,Go Straight
}.- This is inspired by "End-to-end driving via conditional imitation learning" - (Codevilla et al. 2018) - detailed below.
"Conditional Vehicle Trajectories Prediction in CARLA Urban Environment"
Click to expand
Some figures:
End-to- Mid approach: 3 inputs with different levels of abstraction are used to predict the future positions on a fixed 2s -horizon of the ego vehicle and the neighbours. The ego trajectory is be implemented by an external PID controller - Therefore, not end-to- end . Source. |
The past 3D-bounding boxes of the road users in the current reference are projected back in the current camera space. The past positions of ego and other vehicles are projected into some grid-map called proximity map . The image and the proximity map are concatenated to form context feature vector C . This context encoding is concatenated with the ego encoding, then fed into branches corresponding to the different high-level goals - conditional navigation goal . Source. |
Illustration of the distribution shift in imitation learning. Source. |
VisualBackProp highlights the image pixels which contributed the most to the final results - Traffic lights and their colours are important, together with highlights lane markings and curbs when there is a significant lateral deviation. Source. |
Authors: Buhet, T., Wirbel, E., & Perrotton, X.
- Previous works:
- "Imitation Learning for End to End Vehicle Longitudinal Control with Forward Camera" - (George, Buhet, Wirbel, Le-Gall, & Perrotton, 2018).
- "End to End Vehicle Lateral Control Using a Single Fisheye Camera" (Toromanoff, M., Wirbel, E., Wilhelm, F., Vejarano, C., Perrotton, X., & Moutarde, F. 2018).
- One term: "End-To-Middle".
- It is opposed to "End-To-End", i.e. it does not output "end" control signals such as
throttle
orsteering
but rather some desired trajectory, i.e. a mid-level representation.- Each trajectory is described by two polynomial functions (one for
x
, the other fory
), therefore the network has to predict a vector (x0
, ...,x4
,y0
, ...,y4
) for each vehicle. - The desired ego-trajectory is then implemented by an external controller (
PID
). Therefore, notend-to-end
.
- Each trajectory is described by two polynomial functions (one for
- Advantages of
end-to-mid
: interpretability for the control part + less to be learnt by the net. - This approach is also an instance of "Direct perception":
-
"Instead of commands, the network predicts hand-picked parameters relevant to the driving (distance to the lines, to other vehicles), which are then fed to an independent controller".
-
- Small digression: if the raw perception measurements were first processed to form a mid-level input representation, the approach would be said
mid-to-mid
. An example is ChauffeurNet, detailed on this page as well.
- It is opposed to "End-To-End", i.e. it does not output "end" control signals such as
- About Ground truth:
- The expert demonstrations do not come from human recordings but rather from
CARLA
autopilot. 15
hours of driving inTown01
were collected.- As for human demonstrations, no annotation is needed.
- The expert demonstrations do not come from human recordings but rather from
- One term: "Conditional navigation goal".
- Together with the RGB images and the past positions, the network takes as input a navigation command to describe the desired behaviour of the ego vehicle at intersections.
- Hence, the future trajectory of the ego vehicle is conditioned by a navigation command.
- If the ego-car is approaching an intersection, the goal can be
left
,right
orcross
, else the goal is tokeep lane
. - That means
lane-change
is not an option.
- If the ego-car is approaching an intersection, the goal can be
-
"The last layers of the network are split into branches which are masked with the current navigation command, thus allowing the network to learn specific behaviours for each goal".
- Three ingredients to improve vanilla end-to-end imitation learning (
IL
):1
- Mix ofhigh
andlow
-level input (i.e. hybrid input):- Both raw signal (images) and partial environment abstraction (navigation commands) are used.
2
- Auxiliary tasks:- One head of the network predicts the future trajectories of the surrounding vehicles.
- It differs from the primary task which should decide the
2s
-ahead trajectory for the ego car. - Nevertheless, this secondary task helps: "Adding the neighbours prediction makes the ego prediction more compliant to traffic rules."
- It differs from the primary task which should decide the
- This refers to the concept of "Privileged learning":
-
"The network is partly trained with an auxiliary task on a ground truth which is useful to driving, and on the rest is only trained for IL".
-
- One head of the network predicts the future trajectories of the surrounding vehicles.
3
- Label augmentation:- The main challenge of
IL
is the difference between train and online test distributions. This is due to the difference betweenOpen-loop
control: decisions are not implemented.Close-loop
control: decisions are implemented, and the vehicle can end in a state absent from the train distribution, potentially causing "error accumulation".
- Data augmentation is used to reduce the gap between train and test distributions.
- Classical randomization is combined with
label augmentation
: data similar to failure cases is generated a posteriori.
- Classical randomization is combined with
- Three findings:
-
"There is a significant gap in performance when introducing the augmentation."
-
"The effect is much more noticeable on complex navigation tasks." (Errors accumulate quicker).
-
"Online test is the real significant indicator for
IL
when it is used for active control." (The common offline evaluation metrics may not be correlated to the online performance).
- The main challenge of
- Baselines:
- Conditional Imitation learning (
CIL
): "End-to-end driving via conditional imitation learning" video - (Codevilla, F., Müller, M., López, A., Koltun, V., & Dosovitskiy, A. 2017).CIL
produces instantaneous commands.
- Conditional Affordance Learning (
CAL
): "Conditional affordance learning for driving in urban environments" video - (Sauer, A., Savinov, N. & Geiger, A. 2018).CAL
produces "affordances" which are then given to a controller.
- Conditional Imitation learning (
- One word about the choice of the simulator.
- A possible alternative to CARLA could be DeepDrive or the
LGSVL
simulator developed by the Advanced Platform Lab at the LG Electronics America R&D Centre. This looks promising.
- A possible alternative to CARLA could be DeepDrive or the
"Uncertainty Quantification with Statistical Guarantees in End-to-End Autonomous Driving Control"
Click to expand
One figure:
The trust or uncertainty in one decision can be measured based on the probability mass function around its mode. Source. |
The measures of uncertainty based on mutual information can be used to issue warnings to the driver and perform safety / emergency manoeuvres. Source. |
As noted by the authors: while the variance can be useful in collision avoidance, the wide variance of HMC causes a larger proportion of trajectories to fall outside of the safety boundary when a new weather is applied. Source. |
Authors: Michelmore, R., Wicker, M., Laurenti, L., Cardelli, L., Gal, Y., & Kwiatkowska, M
- One related work:
- NVIDIA’s
PilotNet
[DAVE-2
] where expert demonstrations are used together with supervised learning to map from images (front camera) to steering command. - Here, human demonstrations are collected in the CARLA simulator.
- NVIDIA’s
- One idea: use distribution in weights.
- The difference with
PilotNet
is that the neural network applies the "Bayesian" paradigm, i.e. each weight is described by a distribution (not just a single value). - The authors illustrate the benefits of that paradigm, imagining an obstacle in the middle of the road.
- The Bayesian controller may be uncertain on the steering angle to apply (e.g. a
2
-tail orM
-shape distribution). - A first option is to sample angles, which turns the car either right or left, with equal probability.
- Another option would be to simply select the mean value of the distribution, which aims straight at the obstacle.
- The motivation of this work is based on that example: "derive some precise quantitative measures of the BNN uncertainty to facilitate the detection of such ambiguous situation".
- The Bayesian controller may be uncertain on the steering angle to apply (e.g. a
- The difference with
- One definition: "real-time decision confidence".
- This is the probability that the BNN controller is certain of its decision at the current time.
- The notion of trust can therefore be introduced: the idea it to compute the probability mass in a
ε
−ball around the decisionπ
(observation
) and classify it as certain if the resulting probability is greater than a threshold.- It reminds me the concept of
trust-region
optimisation in RL. - In extreme cases, all actions are equally distributed,
π
(observation
) has a very high variance, the agent does not know what to do (no trust) and will randomly sample an action.
- It reminds me the concept of
- How to get these estimates? Three Bayesian inference methods are compared:
- Monte Carlo dropout (
MCD
). - Mean-field variational inference (
VI
). - Hamiltonian Monte Carlo (
HMC
).
- Monte Carlo dropout (
- What to do with this information?
-
"This measure of uncertainty can be employed together with commonly employed measures of uncertainty, such as mutual information, to quantify in real time the degree that the model is confident in its predictions and can offer a notion of trust in its predictions."
- I did not know about "mutual information" and liked the explanation of Wikipedia about the link of
MI
toentropy
andKL-div
.- I am a little bit confused: in what I read, the
MI
is function of two random variables. What are they here? The authors rather speak about the predictive distribution exhibited by the predictive distribution.
- I am a little bit confused: in what I read, the
- I did not know about "mutual information" and liked the explanation of Wikipedia about the link of
- Depending on the uncertainty level, several actions are taken:
mutual information warnings
slow down the vehicle.standard warnings
slow down the vehicle and alert the operator of potential hazard.severe warnings
cause the car to safely brake and ask the operator to take control back.
-
- Another definition: "probabilistic safety", i.e. the probability that a BNN controller will keep the car "safe".
- Nice, but what is "safe"?
- It all relies on the assumption that expert demonstrations were all "safe", and measures the how much of the trajectory belongs to this "safe set".
- I must admit I did not fully understand the measure on "safety" for some continuous trajectory and discrete demonstration set:
- A car can drive with a large lateral offset from the demonstration on a wide road while being "safe", while a thin lateral shift in a narrow street can lead to an "unsafe" situation.
- Not to mention that the scenario (e.g. configuration of obstacles) has probably changed in-between.
- This leads to the following point with an interesting application for scenario coverage.
- One idea: apply changes in scenery and weather conditions to evaluate model robustness.
- To check the generalization ability of a model, the safety analysis is re-run (offline) with other weather conditions.
- As noted in conclusion, this offline safety probability can be used as a guide for active learning in order to increase data coverage and scenario representation in training data.
"Exploring the Limitations of Behavior Cloning for Autonomous Driving"
-
[
distributional shift problem
,off-policy data collection
,CARLA
,conditional imitation learning
,residual architecture
,reproducibility issue
,variance caused by initialization and sampling
]
Click to expand
One figure:
Conditional Imitation Learning is extended with a ResNet architecture and Speed prediction (CILRS ). Source. |
Authors: Codevilla, F., Santana, E., Antonio, M. L., & Gaidon, A.
- One term: “CILRS” = Conditional Imitation Learning extended with a ResNet architecture and Speed prediction.
- One Q&A: How to include in E2E learning information about the destination, i.e. to disambiguate imitation around multiple types of intersections?
- Add a high-level
navigational command
(e.g. take the next right, left, or stay in lane) to the tuple <observation
,expert action
> when building the dataset.
- Add a high-level
- One idea: learn to predict the ego speed (
mediated perception
) to address the inertia problem stemming from causal confusion (biased correlation between low speed and no acceleration - when the ego vehicle is stopped, e.g. at a red traffic light, the probability it stays static is indeed overwhelming in the training data).- A good video about Causal Confusion in Imitation Learning showing that "access to more information leads to worse generalisation under distribution shift".
- Another idea: The off-policy (expert) driving demonstration is not produced by a human, but rather generated from an omniscient "AI" agent.
- One quote:
"The more common the vehicle model and color, the better the trained agent reacts to it. This raises ethical challenges in automated driving".
"Conditional Affordance Learning for Driving in Urban Environments"
-
[
2018
] [📝] [🎞️] [🎞️] [] [ 🎓CVC, UAB, Barcelona
] [ 🚗Toyota
] -
[
CARLA
,end-to-mid
,direct perception
]
Click to expand
Some figures:
Examples of affordances, i.e. attributes of the environment which limit the space of allowed actions. A1 , A2 and A3 are predefined observation areas. Source. |
The presented direct perception method predicts a low-dimensional intermediate representation of the environment - affordance - which is then used in a conventional control algorithm. The affordance is conditioned for goal-directed navigation, i.e. before each intersection, it receives an instruction such as go straight , turn left or turn right . Source. |
The feature maps produced by a CNN feature extractor are stored in a memory and consumed by task-specific layers (one affordance has one task block). Every task block has its own specific temporal receptive field - it decides how much of the memory it needs. This figure also illustrates how the navigation command is used as switch between trained submodules. Source. |
Authors: Sauer, A., Savinov, N., & Geiger, A.
- One term: "Direct perception" (
DP
):- The goal of
DP
methods is to predict a low-dimensional intermediate representation of the environment which is then used in a conventional control algorithm to manoeuvre the vehicle. - With this regard,
DP
could also be saidend-to-
mid
. The mapping to learn is less complex thanend-to-
end
(from raw input to controls). DP
is meant to combine the advantages of two other commonly-used approaches: modular pipelinesMP
andend-to-end
methods such as imitation learningIL
or model-freeRL
.- Ground truth affordances are collected using
CARLA
. Several augmentations are performed.
- The goal of
- Related work on affordance learning and direct perception.
Deepdriving
: Learning affordance for direct perception in autonomous driving by (Chen, Seff, Kornhauser, & Xiao, 2015).Deepdriving
works on highway.- Here, the idea is extended to urban scenarios (with traffic signs, traffic lights, junctions) considering a sequence of images (not just one camera frame) for temporal information.
- One term: "Conditional Affordance Learning" (
CAL
):- "Conditional": The actions of the agent are conditioned on a high-level command given by the navigation system (the planner) prior to intersections. It describes the manoeuvre to be performed, e.g.,
go straight
,turn left
,turn right
. - "Affordance": Affordances are one example of
DP
representation. They are attributes of the environment which limit the space of allowed actions. Only6
affordances are used forCARLA
urban driving:Distance to vehicle
(continuous).Relative angle
(continuous and conditional).Distance to centre-line
(continuous and conditional).Speed Sign
(discrete).Red Traffic Light
(discrete - binary).Hazard
(discrete - binary).- The
Class Weighted Cross Entropy
is the loss used for discrete affordances to put more weights on rare but important occurrences (hazard
occurs rarely compared totraffic light red
).
- The
- "Learning": A single neural network trained with multi-task learning (
MTL
) predicts all affordances in a single forward pass (~50ms
). It only takes a single front-facing camera view as input.
- "Conditional": The actions of the agent are conditioned on a high-level command given by the navigation system (the planner) prior to intersections. It describes the manoeuvre to be performed, e.g.,
- About the controllers: The path-velocity decomposition is applied. Hence two controllers are used in parallel:
- 1-
throttle
andbrake
- Based on the predicted affordances, a state is "rule-based" assigned among:
cruising
,following
,over limit
,red light
, andhazard stop
(all are mutually exclusive). - Based on this state, the longitudinal control signals are derived, using
PID
or threshold-predefined values. - It can handle traffic lights, speed signs and smooth car-following.
- Note: The Supplementary Material provides details insights on controller tuning (especially
PID
) forCARLA
.
- Based on the predicted affordances, a state is "rule-based" assigned among:
- 2-
steering
is controlled by a Stanley Controller, based on two conditional affordances:distance to centreline
andrelative angle
.
- 1-
- One idea: I am often wondering what timeout I should set when testing a scenario with
CARLA
. The author computes this time based on the length of the pre-defined path (which is actually easily accessible):-
"The time limit equals the time needed to reach the goal when driving along the optimal path at
10 km/h
"
-
- Another idea: Attention Analysis.
- For better understanding on how affordances are constructed, the attention of the
CNN
using gradient-weighted class activation maps (Grad-CAMs
). - This "visual explanation" reminds me another technique used in
end-to-end
approaches,VisualBackProp
, that highlights the image pixels which contributed the most to the final results.
- For better understanding on how affordances are constructed, the attention of the
- Baselines and results:
- Compared to
CARLA
-based Modular Pipeline (MP
), Conditional Imitation Learning (CIL
) and Reinforcement Learning (RL
),CAL
particularly excels in generalizing to the new town.
- Compared to
- Where to provide the high-level navigation conditions?
- The authors find that "conditioning in the network has several advantages over conditioning in the controller".
- In addition, in the net, it is preferable to use the navigation command as switch between submodules rather than an input:
-
"We observed that training specialized submodules for each directional command leads to better performance compared to using the directional command as an additional input to the task networks".
-
"Variational Autoencoder for End-to-End Control of Autonomous Driving with Novelty Detection and Training De-biasing"
-
[
VAE
,uncertainty estimation
,sampling efficiency
,augmentation
]
Click to expand
Some figures:
One particular latent variable ^y is explicitly supervised to predict steering control. Another interesting idea: augmentation is based on domain knowledge - if a method used to the middle-view is given some left-view image, it should predict some correction to the right. Source. |
For each new image, empirical uncertainty estimates are computed by sampling from the variables of the latent space. These estimates lead to the D statistic that indicates whether an observed image is well captured by our trained model, i.e. novelty detection . Source. |
In a subsequent work, the VAE is conditioned onto the road topology. It serves multiple purposes such as localization and end-to-end navigation. The routed or unrouted map given as additional input goes toward the mid-to-end approach where processing is performed and/or external knowledge is embedded. Source. See this video temporal for evolution of the predictions. |
Authors: Amini, A., Schwarting, W., Rosman, G., Araki, B., Karaman, S., & Rus, D.
- One issue raised about vanilla
E2E
:- The lack a measure of associated confidence in the prediction.
- The lack of interpretation of the learned features.
- Having said that, the authors present an approach to both understand and estimate the confidence of the output.
- The idea is to use a Variational Autoencoder (
VAE
), taking benefit of its intermediate latent representation which is learnt in an unsupervised way and provides uncertainty estimates for every variable in the latent space via their parameters.
- One idea for the
VAE
: one particular latent variable is explicitly supervised to predict steering control.- The loss function of the
VAE
has therefore3
parts:- A
reconstruction
loss:L1
-norm between the input image and the output image. - A
latent
loss:KL
-divergence between the latent variables and a unit Gaussian, providing regularization for the latent space. - A
supervised latent
loss:MSE
between the predicted and actual curvature of the vehicle’s path.
- A
- The loss function of the
- One contribution: "Detection of novel events" (which have not been sufficiently trained for).
- To check if an observed image is well captured by the trained model, the idea is to propagate the
VAE
’s latent uncertainty through the decoder and compare the result with the original input. This is done by sampling (empirical uncertainty estimates). - The resulting pixel-wise expectation and variance are used to compute a sort of loss metric
D
(x
,ˆx
) whose distribution for the training-set is known (approximated with a histogram). - The image
x
is classified asnovel
if this statistic is outside of the95th
percentile of the training distribution and the prediction can finally be "untrusted to produce reliable outputs". -
"Our work presents an indicator to detect novel images that were not contained in the training distribution by weighting the reconstructed image by the latent uncertainty propagated through the network. High loss indicates that the model has not been trained on that type of image and thus reflects lower confidence in the network’s ability to generalize to that scenario."
- To check if an observed image is well captured by the trained model, the idea is to propagate the
- A second contribution: "Automated debiasing against learned biases".
- As for the novelty detection, it takes advantage of the latent space distribution and the possibility of sampling from the most representative regions of this space.
- Briefly said, the idea it to increase the proportion of rarer datapoints by dropping over-represented regions of the latent space to accelerate the training (sampling efficiency).
- This debiasing is not manually specified beforehand but based on learned latent variables.
- One reason to use single frame prediction (as opposed to
RNN
):-
""Note that only a single image is used as input at every time instant. This follows from original observations where models that were trained end-to-end with a temporal information (
CNN
+LSTM
) are unable to decouple the underlying spatial information from the temporal control aspect. While these models perform well on test datasets, they face control feedback issues when placed on a physical vehicle and consistently drift off the road.""
-
- One idea about augmentation (also met in the Behavioral Cloning Project of the Udacity Self-Driving Car Engineer Nanodegree):
-
"To inject domain knowledge into our network we augmented the dataset with images collected from cameras placed approximately
2
feet to the left and right of the main centre camera. We correspondingly changed the supervised control value to teach the model how to recover from off-centre positions."
-
- One note about the output:
-
"We refer to steering command interchangeably as the road curvature: the actual steering angle requires reasoning about road slip and control plant parameters that change between vehicles."
-
- Previous and further works:
- "Spatial Uncertainty Sampling for End-to-End control" - (Amini, Soleimany, Karaman, & Rus, 2018)
- "Variational End-to-End Navigation and Localization" - (Amini, Rosman, Karaman, & Rus, 2019)
- One idea: incorporate some coarse-grained roadmaps with raw perceptual data.
- Either unrouted (just containing the drivable roads).
Output
= continuous probability distribution over steering control. - Or routed (target road highlighted).
Output
= deterministic steering control to navigate.
- Either unrouted (just containing the drivable roads).
- How to evaluate the continuous probability distribution over steering control given the human "scalar" demonstration?
-
"For a range of
z
-scores over the steering control distribution we compute the number of samples within the test set where the true (human) control output was within the predicted range."
-
- About the training dataset:
25 km
of urban driving data.
- One idea: incorporate some coarse-grained roadmaps with raw perceptual data.
"ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst"
Click to expand
Two figures:
Different layers composing the mid-level representation . Source. |
Training architecture around ChauffeurNet with the different loss terms, that can be grouped into environment and imitation losses. Source. |
Authors: Bansal, M., Krizhevsky, A., & Ogale, A.
- One term: "mid-level representation"
- The decision-making task (between
perception
andcontrol
) is packed into one single "learnable" module.- Input: the representation divided into several image-like layers:
Map features
such as lanes, stop signs, cross-walks...;Traffic lights
;Speed Limit
;Intended route
;Current agent box
;Dynamic objects
;Past agent poses
.- Such a representation is generic, i.e. independent of the number of dynamic objects and independent of the road geometry/topology.
- I discuss some equivalent representations seen at IV19.
- Output: intended route, i.e. the future poses recurrently predicted by the introduced
ChauffeurNet
model.
- Input: the representation divided into several image-like layers:
- This architecture lays between E2E (from
pixels
directly tocontrol
) and fully decomposed modular pipelines (decomposingplanning
in multiple modules). - Two notable advantages over E2E:
- It alleviates the burdens of learning perception and control:
- The desired trajectory is passed to a controls optimizer that takes care of creating the low-level control signals.
- Not to mention that different types of vehicles may possibly utilize different control outputs to achieve the same driving trajectory.
- Perturbations and input data from simulation are easier to generate.
- It alleviates the burdens of learning perception and control:
- The decision-making task (between
- One key finding: "pure imitation learning is not sufficient", despite the
60
days of continual driving (30 million
examples).- One quote about the "famous" distribution shift (deviation from the training distribution) in imitation learning:
"The key challenge is that we need to run the system closed-loop, where errors accumulate and induce a shift from the training distribution."
- The training data does not have any real collisions. How can the agent efficiently learn to avoid them if it has never been exposed during training?
- One solution consists in exposing the model to non-expert behaviours, such as collisions and off-road driving, and in adding extra loss functions.
- Going beyond vanilla cloning.
- Trajectory perturbation: Expose the learner to synthesized data in the form of perturbations to the expert’s driving (e.g. jitter the midpoint pose and heading)
- One idea for future works is to use more complex augmentations, e.g. with RL, especially for highly interactive scenarios.
- Past dropout: to prevent using the history to cheat by just extrapolating from the past rather than finding the underlying causes of the behaviour.
- Hence the concept of tweaking the training data in order to “simulate the bad rather than just imitate the good”.
- Trajectory perturbation: Expose the learner to synthesized data in the form of perturbations to the expert’s driving (e.g. jitter the midpoint pose and heading)
- Going beyond the vanilla imitation loss.
- Extend imitation losses.
- Add environment losses to discourage undesirable behaviour, e.g. measuring the overlap of predicted agent positions with the non-road regions.
- Use imitation dropout, i.e. sometimes favour the environment loss over the imitation loss.
- Going beyond vanilla cloning.
"Imitating Driver Behavior with Generative Adversarial Networks"
-
[
2017
] [📝] [] [ 🎓Stanford
] -
[
adversarial learning
,distributional shift problem
,cascading errors
,IDM
,NGSIM
,rllab
]
Click to expand
Some figures:
The state consists in 51 features divided into 3 groups: The core features include hand-picked features such as Speed , Curvature and Lane Offset . The LIDAR-like beams capture the surrounding objects in a fixed-size representation independent of the number of vehicles. Finally, 3 binary indicator features identify when the ego vehicle encounters undesirable states - collision , drives off road , and travels in reverse . Source. |
As for common adversarial approaches, the objective function in GAIL includes some sigmoid cross entropy terms. The objective is to fit ψ for the discriminator. But this objective function is non-differentiable with respect to θ . One solution is to optimize πθ separately using RL . But what for reward function ? In order to drive πθ into regions of the state-action space similar to those explored by the expert πE , a surrogate reward ˜r is generated from D _ψ based on samples and TRPO is used to perform a policy update of πθ . Source. |
Authors: Kuefler, A., Morton, J., Wheeler, T., & Kochenderfer, M.
- One term: the problem of "cascading errors" in behavioural cloning (
BC
).BC
, which treatsIL
as a supervised learning problem, tries to fit a model to a fixed dataset of expert state-action pairs. In other words,BC
solves a regression problem in which the policy parameterization is obtained by maximizing the likelihood of the actions taken in the training data.- But inaccuracies can lead the stochastic policy to states that are underrepresented in the training data (e.g., an ego-vehicle edging towards the side of the road). And datasets rarely contain information about how human drivers behave in such situations.
- The policy network is then forced to generalize, and this can lead to yet poorer predictions, and ultimately to invalid or unseen situations (e.g., off-road driving).
- "Cascading Errors" refers to this problem where small inaccuracies compound during simulation and the agent cannot recover from them.
- This issue is inherent to sequential decision making.
- As found by the results:
-
"The root-weighted square error results show that the feedforward
BC
model has the best short-horizon performance, but then begins to accumulate error for longer time horizons." -
"Only
GAIL
(and of courseIDM
+MOBIL
) are able to stay on the road for extended stretches."
-
- One idea:
RL
provides robustness against "cascading errors".RL
maximizes the global, expected return on a trajectory, rather than local instructions for each observation. Hence more appropriate for sequential decision making.- Also, the reward function
r
(s_t
,a_t
) is defined for all state-action pairs, allowing an agent to receive a learning signal even from unusual states. And these signals can establish preferences between mildly undesirable behaviour (e.g., hard braking) and extremely undesirable behaviour (e.g., collisions).- In contrast,
BC
only receives a learning signal for those states represented in a labelled, finite dataset. - Because handcrafting an accurate
RL
reward function is often difficult,IRL
seems promising. In addition, the imitation (via the recovered reward function) extends to unseen states: e.g. a vehicle that is perturbed toward the lane boundaries should know to return toward the lane centre.
- In contrast,
- Another idea: use
GAIL
instead ofIRL
:-
"IRL approaches are typically computationally expensive in their recovery of an expert cost function. Instead, recent work has attempted to imitate expert behaviour through direct policy optimization, without first learning a cost function."
- "Generative Adversarial Imitation Learning" (
GAIL
) implements this idea:-
"Expert behaviour can be imitated by training a policy to produce actions that a binary classifier mistakes for those of an expert."
-
"
GAIL
trains a policy to perform expert-like behaviour by rewarding it for “deceiving” a classifier trained to discriminate between policy and expert state-action pairs."
-
- One contribution is to extend
GAIL
to the optimization of recurrent neural networks (GRU
in this case).
-
- One concept: "Trust Region Policy Optimization".
- Policy-gradient
RL
optimization with "Trust Region" is used to optimize the agent's policyπθ
, addressing the issue of training instability of vanilla policy-gradient methods.-
"
TRPO
updates policy parameters through a constrained optimization procedure that enforces that a policy cannot change too much in a single update, and hence limits the damage that can be caused by noisy gradient estimates."
-
- But what reward function to apply? Again, we do not want to do
IRL
. - Some "surrogate" reward function is empirically derived from the discriminator. Although it may be quite different from the true reward function optimized by expert, it can be used to drive
πθ
into regions of the state-action space similar to those explored byπE
.
- Policy-gradient
- One finding: Should previous actions be included in the state
s
?-
"The previous action taken by the ego vehicle is not included in the set of features provided to the policies. We found that policies can develop an over-reliance on previous actions at the expense of relying on the other features contained in their input."
- But on the other hand, the authors find:
-
"The
GAIL
GRU
policy takes similar actions to humans, but oscillates between actions more than humans. For instance, rather than outputting a turn-rate of zero on straight road stretches, it will alternate between outputting small positive and negative turn-rates". -
"An engineered reward function could also be used to penalize the oscillations in acceleration and turn-rate produced by the
GAIL
GRU
".
-
- Some interesting interpretations about the
IDM
andMOBIL
driver models (resp. longitudinal and lateral control).- These commonly-used rule-based parametric models serve here as baselines:
-
"The Intelligent Driver Model (
IDM
) extended this work by capturing asymmetries between acceleration and deceleration, preferred free road and bumper-to-bumper headways, and realistic braking behaviour." -
"
MOBIL
maintains a utility function and 'politeness parameter' to capture intelligent driver behaviour in both acceleration and turning."
"Deep Inverse Q-learning with Constraints"
-
[
constrained Q-learning
,constrained imitation
,Boltzmann distribution
,SUMO
]
Click to expand
Left: both the reward and Q -function are estimated based on demonstrations and a set of constraints. Right: Comparing the expert, the unconstrained (blue) and constrained (green) imitating agents as well as a RL agent trained to optimize the true MDP with constraints (yellow). The constrained imitation can keep high speed while ensuring no constrain violation. Not to mention that it can converge faster than the RL agent (yellow). Source. |
Compared to existing IRL approaches, the proposed methods can enforce additional constraints that were not part of the original demonstrations. And it does not requires solving the MDP multiple times. Source. |
Derivation for the model-based case (Inverse Action-value Iteration): The IRL problem is transformed to solving a system of linear equations. ((3 ) reminds me the law of total probability ). The demonstrations are assumed to come from an expert following a stochastic policy with an underlying Boltzmann distribution over optimal Q -values (which enables working with log ). With this formulation, it is possible to calculate a matching reward function for the observed (optimal) behaviour analytically in closed-form. An extension based on sampling is proposed for model-free problems. Source. |
While the Expert agent was not trained to include the Keep Right constraint (US-highway demonstrations), the Deep Constrained Inverse Q-learning (DCIQL ) agent is satisfying the Keep Right (German highway) and Safety constraints while still imitating to overtake the other vehicles in an anticipatory manner. Source. |
Authors: Kalweit, G., Huegle, M., Werling, M., & Boedecker, J.
-
Previous related works:
1-
Interpretable multi time-scale constraints in model-free deep reinforcement learning for autonomous driving
, (Kalweit, Huegle, Werling, & Boedecker, 2020)- About Constrained
Q
-learning. - Constraints are considered at different time scales:
- Traffic rules and constraints are ensured in predictable short-term horizon.
- Long-term goals are optimized by optimization of long-term return: With an expected sum of discounted or average
constraint
signals.
- About Constrained
2-
Dynamic input for deep reinforcement learning in autonomous driving
, (Huegle, Kalweit, Mirchevska, Werling, & Boedecker, 2019)- A
DQN
agent is learnt with the following definitions and used as the expert to produce the demonstrations.simulator
:SUMO
.state
representation:DeepSet
to model interactions between an arbitrary number of objects or lanes.reward
function: minimize deviation to somedesired speed
.action
space: high-level manoeuvre in {keep lane
,perform left lane change
,perform right lane change
}.speed
is controlled by low-level controller.
- A
2-
Off-policy multi-step q-learning
, (Kalweit, Huegle, & Boedecker, 2019)- Considering two methods inspired by multi-step
TD
-learning to enhance data-efficiency while remaining off-policy:(1)
TruncatedQ
-functions: representing thereturn
for the firstn
steps of a policy rollout.(2)
ShiftedQ
-functions: acting as the far-sightedreturn
after this truncated rollout.
- Considering two methods inspired by multi-step
-
Motivations:
1-
Optimal constrained imitation.- The goal is to imitate an expert while respecting constraints, such as traffic rules, that may be violated by the expert.
-
"
DCIQL
is able to guarantee satisfaction of constraints on the long-term for optimal constrained imitation, even if the original demonstrations violate these constraints."
-
- For instance:
- Imitate a driver observed on the US highways.
- And transfer the policy to German highways by including a
keep-right
constraint.
- The goal is to imitate an expert while respecting constraints, such as traffic rules, that may be violated by the expert.
2-
Improve the training efficiency ofMaxEnt IRL
to offer faster training convergence.-
"Popular
MaxEnt IRL
approaches require the computation of expectedstate visitation
frequencies for the optimal policy under an estimate of thereward
function. This usually requires intermediatevalue
estimation in the inner loop of the algorithm, slowing down convergence considerably." -
"One general limitation of
MaxEnt IRL
based methods, however, is that the consideredMDP
underlying the demonstrations has to be solved MANY times inside the inner loop of the algorithm." - The goal is here to solve the
MDP
underlying the demonstrated behaviour once to recover the expert policy.-
"Our approach needs to solve the
MDP
underlying the demonstrated behavior only once, leading to a speedup of up to several orders of magnitude compared to the popularMaximum Entropy IRL
algorithm and some of its variants." -
"Compared to our learned
reward
, the agent trained on the truereward
function has a higher demand for training samples and requires more iterations to achieve a well-performing policy. [...] Which we hypothesize to result from the bootstrapping formulation ofstate-action
visitations in ourIRL
formulation, suggesting a strong link to successor features."
-
-
3-
Focus on off-policyQ-learning
, where an optimal policy can be found on the basis of a given transition set.
-
Core assumption.
- The expert follows a
Boltzmann
distribution over optimalQ-values
.-
"We assume a policy that only maximizes the entropy over actions locally at each step as an approximation."
-
- The expert follows a
-
Outputs.
- Each time, not only
r
but alsoQ
is estimated, i.e. apolicy
.- Deriving the
policy
by such imitation seems faster than solving theMDP
with the truereward
function: -
"Compared to our learned
reward
, the agent trained on the truereward
function has a higher demand for training samples and requires more iterations to achieve a well-performing policy."
- Deriving the
- Each time, not only
-
Two families:
-
1-
model-based.-
"If the observed
transitions
are samples from the true optimalBoltzmann
distribution, we can recover the truereward
function of theMDP
in closed-form. In case of an infinite control problem or if no clear reverse topological order exists, we solve theMDP
by iterating multiple times until convergence." - If the
transition
model is known, theIRL
problem is converted to a system of linear equations: it is possible to calculate a matching reward function for the observed (optimal) behaviour analytically in closed-form. - Hence named "Inverse Action-value Iteration" (
IAVI
). -
"Intuitively, this formulation of the immediate
reward
encodes the local probability of actiona
while also ensuring the probability of the maximizing next action underQ-learning
. Hence, we note that this formulation of bootstrapping visitation frequencies bears a strong resemblance to the Successor Feature Representation."
-
-
2-
model-free.-
"To relax the assumption of an existing
transition
model andaction
probabilities, we extendIAVI
to a sampling-based algorithm." -
"We extend
IAVI
to a sampling based approach using stochastic approximation, which we call Inverse Q-learning (IQL
), usingShifted
Q-functions proposed in [6] to make the approach model-free.- The shifted
Q-value
QSh
(s
,a
) skips the immediatereward
for takinga
ins
and only considers the discountedQ-value
of thenext state
.
- The shifted
1-1.
Tabular Inverse Q-learning algorithm:- The
action
probabilities are approximated withstate-action
visitation counterρ
(s
,a
). -
[
transition
model] "In order to avoid the need of a modelM
[inη
(a
,s
)], we evaluate all other actions via ShiftedQ-functions
."
- The
1-2.
Deep (non-tabular)IQL
.- Continuous
state
s are now addressed. - The
reward
function is estimated with function approximatorr(·, ·|θr)
, parameterized byθr
. - The
Q
-function andShifted-Q
function are also estimated with nets. -
"We approximate the state-action visitation by classifier ρ(·, ·|
θρ
), parameterized byθρ
and with linear output."
- Continuous
-
-
-
How to enforce (additional) constraints?
- In one previous work, two sets of constraints were considered.
1-
One about theaction
(action masking
).2-
One about thepolicy
, with multi-step constraint signals with horizon.- An expected sum of discounted or average
constraint
signals is estimated.
- An expected sum of discounted or average
- Here, only the
action
set is considered.- A set of constraints functions C={
ci
} is defined. Similar to thereward
function, it considers (s, a
):ci
(s
,a
). - A "safe"
action
set can be defined based on some threshold values for eachci
:Safe
i
= {a ∈ A
|ci(s, a)
≤βci
}
- A set of constraints functions C={
- In addition to the
Q
-function inIQL
, a constrainedQ
-functionQC
is estimated.-
"For policy extraction from
QC
afterQ-learning
, only theaction-values
of the constraint-satisfying actions must be considered."
-
- Including constraints directly in
IQL
leads to optimal constrained imitation from unconstrained demonstrations.
- In one previous work, two sets of constraints were considered.
"Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning"
-
[
2020
] [📝] [ 🎓TU Darmstadt
] [ 🚗Volkswagen
] -
[
max-entropy
,path integral
,sampling
,MPC
]
Click to expand
About path integral IRL and how the partition function in the MaxEnt-IRL formulation is approximated via sampling. Note the need to integrate over all possible trajectories (Π ) in the partition function . Besides, note the transition model M that produces the features but is also able to generate the next-state . Such a model-based generation is simple for static scenes, but can it work in dynamic environments (unknown system dynamics), where the future is way more uncertain? Source. |
Top-left: sampling-based and 'general-purpose' (non-hierarchical) planner that relies on some transition model. Top-right: the features used and their learnt/hard-coded associated weights . It can be seen that the weights are changing depending on the context (straight / curvy road). Bottom: For every planning cycle, a restricted set of demonstrations ΠD is considered, which are "geometrically" close (c.f. projection metric that transfers the actions of a manual drive into the state-action space of the planning algorithm) to the odometry record ζ (not very clear to me). Also note the labelling function that assigns categorical labels to transitions, e.g., a label associated with collision . Source. |
To ensure temporally consistent prediction, an analogy with the temporal abstraction of HRL is made. Source. |
To dynamically update the reward function while ensuring temporal consistency, the deep IRL architecture is separated into a policy attention and a temporal attention mechanism. The first one encodes the context of a situation and should learns to focus on collision-free policies in the configuration space. It also helps for dimension reduction. The second one predicts a mixture reward function given a history of context vectors. Source. |
Authors: Rosbach, S., Li, X., Großjohann, S., Homoceanu, S., & Roth, S.
-
Previous related works:
0-
Planning Universal On-Road Driving Strategies for Automated Vehicles
, (Heinrich, 2018) andOptimizing a driving strategy by its sensor coverage of relevant environment information
(Heinrich, Stubbemann, & Rojas, 2016).- A "general-purpose" (i.e. no
behavioural
/local path
hierarchy) planner.
- A "general-purpose" (i.e. no
1-
Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving
, (Rosbach, James, Großjohann, Homoceanu, & Roth, 2019).- Path integral (
PI
) maximum entropyIRL
method to learnreward
functions for/with the above planner. - The structure of the planner is leveraged to compute the
state visitation
, enablingMaxEnt-IRL
despite the high-dimensionalstate
space.
- Path integral (
2-
Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving
, (Rosbach et al., 2019).- Motivation: learn situation-dependent
reward
functions for the planner. - The (complex) mapping between the
situation
and theweights
of thereward function
is approximated by aNN
.
- Motivation: learn situation-dependent
-
Motivations:
1-
Automate the tuning of thereward
function for a "general-purpose" (non hierarchical) planner, using human driving demonstrations.- The absence of temporal abstraction brings a constraint: a high-dimensional
state
space with continuous actions.
- The absence of temporal abstraction brings a constraint: a high-dimensional
2-
Be able to update thereward
function dynamically, i.e. predict situation-dependentreward
functions.3-
Predict temporally-consistentreward
functions.- Since "Non-temporally-consistent
reward
functions"=>
"Non-persistent behaviours / interactions".
- Since "Non-temporally-consistent
-
About the planner.
- It is model-based, i.e. relies on some
transition model
to performed a forward search ofactions
via sampling, starting from some initial states0
.-
"The algorithm is similar to parallel breadth first search [leveraging GPU and made efficient with
pruning
] and forward value iteration."
-
- Optimization under constraints: The most promising sequence is selected based on some
reward
function while respectingkinematic
/dynamic
constraints.-
"The planner explores the subspace of feasible policies
Π
by samplingactions
from a distribution conditioned on vehicledynamics
for each states
." -
"The final driving policy is selected based on the policy value
V(π)
and model-based constraints."
-
- It is model-based, i.e. relies on some
-
Why no hierarchical planner, e.g. some tactical
manoeuvre
selection above some operational localtrajectory
optimization?-
"
Behavior
planning becomes difficult in complex and unforeseen driving situations in which the behavior fails to match predefined admissibility templates." -
[kinematic constraints] "These hierarchical planning architectures suffer from uncertain behavior planning due to insufficient knowledge about motion constraints. As a result, a maneuver may either be infeasible due to over-estimation or discarded due to under-estimation of the vehicle capabilities."
- One idea is instead to sample of a large set of
actions
that respectkinematic
constraints. And then evaluate the candidates with somecost
/reward
function. - The sequences of sampled actions can represent complex manoeuvres, implicitly including multiple behaviours, e.g.,
lane following
,lane changes
,swerving
, andemergency stops
.
-
-
Advantages of the flat planning architecture (no
planning
-task decomposition).-
"These general-purpose planners allow
behavior
-aware motion planning given a SINGLEreward
function." [Which would probably have to be adapted depending on the situation?] - Also, it can become more scalable since it does not rely on behaviour implementations: it does not decompose the decision-making based on behaviour templates for instance.
- But, again, there will not be a
One-size-fits-all
function. So now the challenge is to constantly adapt thereward
function based on the situation.
- But, again, there will not be a
-
-
About the
action
space.- Time-continuous polynomial functions:
Longitudinal
actions described byvelocity
profiles, up to the5th
-order.Lateral
actions described bywheel
angle, up to the3th
-order.
- Time-continuous polynomial functions:
-
About
IRL
.1- Idea.
Find thereward
function weightsθ
that enable the optimal policyπ∗
to be at least as good as the demonstrated policy.- Issue: learning a
reward
function given an optimal policy is ambiguous since manyreward
functions may lead to the same optimal policy.
- Issue: learning a
2- Solution.
Max-margin classification.- Issue: it suffers from drawbacks in the case of imperfect demonstrations.
3- Solution.
Use a probabilistic model. For instance maximize the entropy of the distribution onstate
-actions
under the learned policy:MaxEnt-IRL
.-
"It solves the ambiguity of imperfect demonstrations by recovering a distribution over potential reward functions while avoiding any bias."
- How to compute the gradient of the
entropy
?_- Many use state visitation calculation, similar to backward value iteration in
RL
.
- Many use state visitation calculation, similar to backward value iteration in
- Issue: this is intractable in the high-dimensional
state
space.-
[Because no
hierarchy
/temporal abstraction
] "Our desired driving style requires high-resolution sampling of time-continuous actions, which produces a high-dimensionalstate
space representation."
-
-
4- Solution.
Combines search-based planning withMaxEnt-IRL
.- Solution: Use the graph representation of the planner (i.e. starting from
s0
, samplingactions
and using atransition model
) to approximate the required empirical feature expectations and to allowMaxEnt-IRL
. -
"The parallelism of the
action
sampling of the search-based planner allows us to explore a high-resolution state representationSt
for each discrete planning horizon incrementt
." -
"Our sample-based planning methodology allows us to approximate the
partition function
similar to Markov chain Monte Carlo methods." -
"Due to the high-resolution sampling of actions, we ensure that there are policies that are geometrically close to human-recorded odometry and resemble human driving styles. The task of
IRL
is to find the unknown reward function that increases the likelihood of these trajectories to be considered as optimal policies."
- Solution: Use the graph representation of the planner (i.e. starting from
-
About the
features
.- The vector of
features
is generated by the environment modelM
at each step:f
(s
,a
). - The mapping from the complex
feature
space to thereward
is here linear.- The
reward
is computed by weighting thefeatures
in a sum:r
(s
,a
) =f
(s
,a
) .θ
.
- The
- The
feature path integral
for a policyπ
is defined byfi
(π
) = −Integral-over-t
[γt.fi
(st, at
).dt
].- The path integral is approximated by the iterative execution of sampled
state-action
sets.
- The path integral is approximated by the iterative execution of sampled
- The vector of
-
Why is it called "path integral"
MaxEnt-IRL
?- It builds on
Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals
, (Aghasadeghi & Bretl, 2011).-
"Similar to (Aghasadeghi et al.), we optimize [maximization of the
log-likelihood
of the expert behavior] under the constraint of matching the feature path integralsfπ
of the demonstration and feature expectations of the explored policies." - The expected
PI
feature valuesEp
(π|θ
)[fπ
] of the policy setΠ
should match the empirical feature valuesfˆΠD
of the demonstrations for each planning cycle of theMPC
.
-
- It builds on
-
How to adapt to continuously-changing objectives?_ I.e. learn situation-dependent reward functions.
-
"The probabilistic model
p(π|θ)
that recovers a single reward function for the demonstrated trajectories does not scale." -
"The tuned linear reward functions do not generalize well over different situations as the objectives change continuously, e.g., the importance of keeping to the lane center in straight road segments while allowing deviations in curvy parts."
Idea 1.
PI-clustered IRL
: Consider that there areN
different reward functions.- Reward functions (their
weights
for the linear combination) are computed for each cluster. -
"We utilize Expectation Maximization (
EM
) inIRL
, whereβπDc
is the probability that a demonstrationπD
belongs to a clusterc
[E
-step], andψ
(c
) is the estimated prior probability of a clusterc
[M
-step]."
- Reward functions (their
Idea 2.
Neural net as function approximator.- Input:
PI features
andactions
of sampled driving policies of anMPC
-based planner. - Output: a set of linear
reward
functionweights
for upcoming planning cycles:reward-weights
(t+1
)≈
net
[Θ
](fk
,ak
). - Hence the net learns a representation of the
statics
andkinematics
of the situation.-
"Previously sampled driving policies of the
MPC
are used as inputs to our neural network. The network learns a representation of the driving situation by matching distributions of features and actions to reward functions on the basis of the maximum entropy principle."
-
- It uses
1-d
convolutions over trajectories. With ideas similar toPointNet
:-
"The average pooling layers are used for dimensionality reduction of the features. Since we use only one-dimensional convolutional layers, no relationship is established between policies of a planning cycle by then. These inter-policy relationships are established by a series of
8
fully-connected layers at the end."
-
- Input:
- During inference.
- The
MPC
re-plans in discrete time-stepsk
- After receiving the
features
andactions
of the latest planning cycle, the neural network infers the newreward
weights. -
"To enable smooth transitions of the
reward
functions, we utilize a predefined history sizeh
to calculate the empirical mean of weightsθˆ
. The weights hence obtained are used to continuously re-parameterize the planning algorithm for the subsequent planning cycle."
- The
-
-
How to dynamically update the
reward
function while enabling persistent behaviour over an extended time horizon?-
"Continuous reward function switches may result in non-stationary behavior over an extended planning horizon."
-
"The interaction with dynamic objects requires an extended planning horizon, which requires sequential context modeling."
- The reward functions for the next planning cycle at time
t+1
is predicted with a net. With two attention mechanisms: 1-
Policy
(trajectory) attention mechanism.- Generate a low dimensional
context
vector of the driving situation fromfeatures
sampled-driving policies. -
"Inputs are a set of planning cycles each having a set of policies."
-
"The attention vector essentially filters non-human-like trajectories from the policy encoder."
- It also helps for dimension reduction.
-
"The size of the policy set used to understand the spatio-temporal scene can be significantly reduced by concentrating on relevant policies having a human-like driving style. In this work, we use a policy attention mechanism to achieve this dimension reduction using a situational
context
vector." -
"The attention networks stand out, having less parameters and a low-dimensional
context
vector while yielding similar performance as compared to larger neural network architectures."
-
- Generate a low dimensional
2-
Temporal
attention network (TAN
) with recurrent layers.- Predict a mixture
reward
function given a history ofcontext
vectors. -
"We use this context vector in a sequence model to predict a temporal
reward
function attention vector." -
"This
temporal
attention vector allows for stablereward
transitions for upcoming planning cycles of anMPC
-based planner."
- Predict a mixture
-
"We are able to produce stationary reward functions if the driving task does not change while at the same time addressing situation-dependent task switches with rapid response by giving the highest weight to the reward prediction of the last planning cycle."
-
"Efficient Sampling-Based Maximum Entropy Inverse Reinforcement Learning with Application to Autonomous Driving"
-
[
2020
] [📝] [ 🎓UC Berkeley
] -
[
max-entropy
,partition function
,sampling
,INTERACTION
]
Click to expand
The intractable partition Z function of Max-Entropy method is approximated by a sum of sampled trajectories. Source. |
Left: Prior knowledge is injected to make the sampled trajectories feasible, hence improving the efficiency of the IRL method. Middle: Along with speed-desired_speed , long-acc , lat-acc and long-jerk , two interactions features are considered. Bottom-right: Sample re-distribution is performed since generated samples are not necessarily uniformly distributed in the selected feature space. Top-right: The learned weights indicate that humans care more about longitudinal accelerations in both non-interactive and interactive scenarios. Source. |
Authors: Wu, Z., Sun, L., Zhan, W., Yang, C., & Tomizuka, M.
-
Motivations:
1-
The trajectories of the observed vehicles satisfy car kinematics constraints.- This should be considered while learning
reward
function.
- This should be considered while learning
2-
Uncertainties exist in real traffic demonstrations.- The demonstrations in naturalistic driving data are not necessarily optimal or near-optimal, and the
IRL
algorithms should be compatible with such uncertainties. Max-Entropy
methods (probabilistic) can cope with this sub-optimality.
- The demonstrations in naturalistic driving data are not necessarily optimal or near-optimal, and the
3-
The approach should converge quickly to scale to problems with large continuous-domain applications with long horizons.- The critical part in max-entropy
IRL
: How to estimate the intractable partitionZ
?
- The critical part in max-entropy
-
Some assumptions:
-
"We do not consider scenarios where human drivers change their
reward
functions along the demonstrations." -
"We also do not specify the diversity of
reward
functions among different human drivers. Hence, the acquiredreward
function is essentially an averaged result defined on the demonstration set."
-
-
Why "sampling-based"?
- The integral of the partition function is approximated by a sum over generated samples.
- It reminds me the Monte Carlo integration techniques.
- The sampled are not random. Instead they are feasible and represent long-horizon trajectories, leveraging prior knowledge on vehicle kinematics and motion planning.
- Efficiency:
1-
Around1 minute
to generate all samples for the entire training set.2-
The sampling process is one-shot in the algorithm through the training process (do they mean that the set needs only to be created once?).
- Sample Re-Distribution.
-
"The samples are not necessarily uniformly distributed in the selected feature space, which will cause biased evaluation of probabilities."
-
"To address this problem, we propose to use
Euclidean
distance [better metrics will be explored in future works] in the feature space as a similarity metric for re-distributing the samples."
-
- The sampling time of all trajectories is
∆t=0.1s
.
- The integral of the partition function is approximated by a sum over generated samples.
-
Features:
1-
Non-interactive:speed
deviation todesired_speed
, longitudinal and lateralaccelerations
, longitudinaljerk
.2-
Interactive:future distance
: minimum spatial distance of two interactive vehicles within a predicted horizonτ-predict
assuming that they are maintaining their current speeds.future interaction distance
: minimum distance between their distances to the collision point.
- All are normalized in (
0, 1
).
-
Metrics:
1-
Deterministic: feature deviation from the ground truth.2-
Deterministic: meanEuclidean
distance to the ground truth.3-
Probabilistic: the likelihood of the ground truth.
-
Baselines:
- They all are based on the principle of maximum entropy, but differ in the estimation of
Z
:1-
Continuous-domainIRL
(CIOC
).Z
is estimated in a continuous domain via Laplace approximation: thereward
at an arbitrary trajectoryξ˜
can be approximated by its second-order Taylor expansion at a demonstration trajectoryˆξD
.
2-
Optimization-approximatedIRL
(Opt-IRL
).- An optimal trajectory
ξopt
can be obtained by minimizing the updatedreward
function. Then,Z
≈exp
(βR
(θ
,ξopt
)). -
"In the forward problem at each iteration, it directly solves the optimization problem and use the optimal trajectories to represent the expected feature counts."
- An optimal trajectory
3-
Guided cost learning (GCL
).- This one is not model-based: it does not need manually crafted
features
, but automatically learns features via neural networks. - It uses rollouts (samples) of the
policy
network to estimateZ
in each iteration. - However, all these samples must be re-generated in every training iteration, while the proposed method only needs to generate all samples once.
- This one is not model-based: it does not need manually crafted
- They all are based on the principle of maximum entropy, but differ in the estimation of
"Analyzing the Suitability of Cost Functions for Explaining and Imitating Human Driving Behavior based on Inverse Reinforcement Learning"
-
[
2020
] [📝] [ 🎓FZI
,KIT
,UC Berkeley
] -
[
max-entropy
]
Click to expand
Left: Definition of the features retrieved from trajectory demonstrations and the evaluation function . Right: max-Entropy IRL enable only requires locally optimal demonstrations because the gradient and Hessian of the reward function is only considered in proximity of the demonstration. Note that the features are only based on the states , while the actions remain disregarded. And that their approach assumes that the cost function is parameterized as a linear combination of cost terms. Source. |
General cost function structures and commonly used trajectory features. Only one work considers crossing scenarios. To account for the right of way at intersections, the time that elapses between one vehicle leaving a conflict zone, i.e. an area where paths overlap, and another vehicle entering this zone, is considered: tTZC = dsecond /vsecond . Bottom: Due to the similarity of the variance -mean -ratio under different evaluation functions, the authors limit their experiments to the consideration of sum[f(t)²] , which is most used. Source. |
Authors: Naumann, M., Sun, L., Zhan, W., & Tomizuka, M.
-
Motivations:
1-
Overview of trajectoryfeatures
andcost
structures.2-
About demonstration selection: What are the requirements when entire trajectories are not available and trajectory segments must be used?- Not very clear to me.
-
"Bellman’s principle of optimality states that parts of optimal decision chains are also optimal decisions. Optimality, however, always refers to the entire decision chain. [...] Braking in front of a stop sign is only optimal as soon as the stop sign is considered within the
horizon
." -
"The key insight is that selected segments have to end in a
timestep
that is optimal, independent of the weights that are to be learned." -
"Assuming a non-negative cost definition, this motivates the choice of arbitrary trajectory segments ending in a timestep
T
such thatcT−d+1
...cT+d
(depending onxT−2d+1
...xT+2d
) are zero, i.e. optimal, independent ofθ
." -
"While this constraint limits the approach to cost functions that yield zero cost for some sections, it also yields the meaningful assumption that humans are not driven by a permanent dissatisfaction through their entire journey, but reach desirable states from time to time."
-
Miscellaneous: about
cost function
structures in related works:- Trajectory
features
can depend on:1-
A single trajectory only. They are based on ego-acceleration
,speed
andposition
for instance.2-
Trajectory ensembles. I.e. they describe quality of one trajectory with respect to the trajectories of other traffic participants. For instanceTTC
.
- As most approaches did not focus on crossings, the traffic rule features were not used by the related approaches.
- All approaches use a convenience term to prevent that being at a full stop is an optimal state with zero cost.
-
"In order to prevent that being at a full stop is beneficial, progress towards the target must be rewarded, that is,
costs
must be added in case of little progress. This can be done by considering the deviation from thedesired velocity
or thespeed limit
, or via the deviation from areference position
." -
"For stop signs, similarly, the deviation from a complete stop, i.e. the driven velocity at the stop line, can be used as a feature."
-
- All approaches incorporate both
smoothness
(longitudinal) andcurve comfort
(lateral). - The
lane geometry
is incorporated in the cost, unless it was already incorporated by using a predefined path.
- Trajectory
-
What
feature
forinteraction
with others traffic participants?- Simply relative
speeds
andpositions
(gap
). - Most approaches assume that the future trajectory of others is known or provided by an upstream prediction module. The effect of the ego vehicle on the traffic participant can then be measured. For instance the induced cost, such as
deceleration
. -
"Other approaches do not rely on an upstream prediction, but incorporate the prediction of others into the planning by optimizing a global cost functional, which weights other traffic participants equally, or allows for more egoistic behavior based on a
cooperation factor
."
- Simply relative
-
Some findings when applying
IRL
onINTERACTION
dataset on three scenarios:in-lane driving
,right turn
andstop
:1-
Among all scenarios, human drivers weightlongitudinal acceleration
higher thanlongitudinal jerks
.2-
The weight forlongitudinal
andlateral acceleration
are similar per scenario, such that neither seems to be preferred over the other. If implied by the scenario, as in the right turn, the weight decreases.3-
In theright turn
scenario, the weight of thelateral deviation
from the centerline is very large.-
"Rather than assuming that the
centerline
is especially important in turns, we hypothesize that a large weight ond-cl
is necessary to prefer turning over simply going straight, which would cause lessacceleration
cost."
-
-
"We found that the key
features
and human preferences differ largely, even in different single lane scenarios and disregarding interaction with other traffic participants."
"Modeling Human Driving Behavior through Generative Adversarial Imitation Learning"
Click to expand
Different variations of Generative Adversarial Imitation Learning (GAIL ) are used to model human drivers. These augmented GAIL -based models capture many desirable properties of both rule-based (IDM +MOBIL ) and machine learning (BC predicting single / multiple Gaussians) methods, while avoiding common pitfalls. Source. |
In Reward Augmented Imitation Learning (RAIL ), the imitation learning agent receives a second source of reward signals which is hard-coded to discourage undesirable driving behaviours. The reward can be either binary , receiving penalty when the collision actually occurs, or smoothed , via increasing penalties as it approaches an undesirable event. This should address the credit assignment problem in RL . Source. |
Authors: Bhattacharyya, R., Wulfe, B., Phillips, D., Kuefler, A., Morton, J., Senanayake, R., & Kochenderfer, M.
-
Related work:
- "Application of Imitation Learning to Modeling Driver Behavior in Generalized Environments", (Lange & Brannon, 2019), detailed in this page too.
-
Motivation: Derive realistic models of human drivers.
- Example of applications: populate surrounding vehicles with human-like behaviours in the simulation, to learn a driving policy.
-
Ingredients:
1-
Imitation learning instead ofRL
since thecost
function is unknown.2-
GAIL
instead ofapprenticeship learning
to not restrict the class ofcost
functions and avoid computationally expensiveRL
iterations.3-
Some variations ofGAIL
to deal with the specificities of driver modelling.
-
Challenges and solutions when modelling the driving task as a sequential decision-making problem (
MDP
formulation):1-
Continuousstate
andaction
spaces. And high dimensionality of thestate
representation.2-
Non-linearity in the desired mapping fromstates
toactions
.- For instance, large corrections in
steering
are applied to avoid collisions caused by small changes in the currentstate
. - Solution to
1-
+2-
: Neural nets.-
"The feedforward
MLP
is limited in its ability to adequately address partially observable environments. [...] By maintaining sufficient statistics of pastobservations
in memory, recurrent policies disambiguate perceptually similar states by acting with respect to histories of, rather than individualobservations
." GRU
layers are used: fewer parameters and still good performances.
-
- For instance, large corrections in
3-
Stochasticity: humans may take differentactions
each time they encounter a given traffic scene.- Solution: Predicting a [Gaussian] distribution and sampling from it:
at
∼
πθ
(at
|st
).
- Solution: Predicting a [Gaussian] distribution and sampling from it:
4-
The underlyingcost
function is unknown. DirectRL
is not applicable.- Solution: Learning from demonstrations (imitation learning). E.g.
IRL
+RL
orBC
. -
"The goal is to infer this human policy from a dataset consisting of a sequence of (
state
,action
) tuples."
- Solution: Learning from demonstrations (imitation learning). E.g.
5-
Interaction between agents needs to be modelled, i.e. it is a multi-agent problem.- Solution:
GAIL
extension. A parameter-sharingGAIL
(PS-GAIL
) to tackle multi-agent driver modelling.
- Solution:
6-
GAIL
andPS-GAIL
are domain agnostic, making it difficult to encode specific knowledge relevant to driving in the learning process.- Solution:
GAIL
extension. Reward Augmented Imitation Learning (RAIL
).
- Solution:
7-
The human demonstrations dataset is a mixture of different driving styles. I.e. human demonstrations are dependent upon latent factors that may not be captured byGAIL
.- Solution:
GAIL
extension. [Burn-
]Information MaximizingGAIL
(Burn-InfoGAIL
) to disentangle the latent variability in demonstrations.
- Solution:
-
Issues with behavioural cloning (
BC
) (supervised version ofimitation learning
).-
"
BC
trains the policy on the distribution ofstates
encountered by the expert. During testing, however, the policy acts within the environment for long time horizons, and small errors in the learned policy or stochasticity in the environment can cause the agent to encounter a different distribution ofstates
from what it observed during training. This problem, referred to as covariate shift, generally results in the policy making increasingly large errors from which it cannot recover." -
"
BC
can be effective when a large number of demonstrations are available, but in many environments, it is not possible to obtain sufficient quantities of data." - Solutions to the covariate shift problem:
1-
Dataset Aggregation (DAgger
), assuming access to an expert.2-
Learn a replacement for thecost
function that generalizes to unobservedstates
.- Inverse reinforcement learning (
IRL
) andapprenticeship learning
. -
"The goal in apprenticeship learning is to find a policy that performs no worse than the expert under the true [unknown]
cost
function."
- Inverse reinforcement learning (
-
-
Issues with
apprenticeship learning
:- A class of cost functions is used.
1-
It is often defined as the span of a set of basis functions that must be defined manually (as opposed to learned from the observations).2-
This class may be restricting. I.e. no guarantee that the learning agent will perform no worse than the expert, and the agent can fail at imitating the expert.-
"There is no reason to assume that the
cost
function of the human drivers lies within a small function class. Instead, thecost
function could be quite complex, which makesGAIL
a suitable choice for driver modeling."
3-
It generally involves runningRL
repeatedly, hence large computational cost.
- A class of cost functions is used.
-
About Generative Adversarial Imitation Learning (
GAIL
):- Recommended video: This CS285 lecture of Sergey Levine.
- It is derived from an alternative approach to imitation learning called Maximum Causal Entropy
IRL
(MaxEntIRL
). -
"While
apprenticeship learning
attempts to find a policy that performs at least as well as the expert acrosscost
functions,MaxEntIRL
seeks acost
function for which the expert is uniquely optimal." -
"While existing
apprenticeship learning
formalisms used thecost
function as the descriptor of desirable behavior,GAIL
relies instead on the divergence between the demonstration occupancy distribution and the learning agent’s occupancy distribution." - Connections to
GAN
:- It performs binary classification of (
state
,action
) pairs drawn from the occupancy distributionsρπ
andρπE
. -
"Unlike
GANs
,GAIL
considers the environment as a black box, and thus the objective is not differentiable with respect to the parameters of the policy. Therefore, simultaneous gradient descent [forD
andG
] is not suitable for solving theGAIL
optimization objective." -
"Instead, optimization over the
GAIL
objective is performed by alternating between a gradient step to increase the objective function with respect to the discriminator parametersD
, and a Trust Region Policy Optimization (TRPO
) step (Schulman et al., 2015) to decrease the objective function with respect to the parametersθ
of the policyπθ
."
- It performs binary classification of (
-
Advantages of
GAIL
:1-
It removes the restriction that thecost
belongs to a highly limited class of functions.-
"Instead allowing it to be learned using expressive function approximators such as neural networks".
-
2-
It scales to largestate
/action
spaces to work for practical problems.TRPO
forGAIL
works with direct policy search as opposed to finding intermediate value functions.
-
"
GAIL
proposes a new cost function regularizer. This regularizer allows scaling to large state action spaces and removes the requirement to specify basis cost functions."
-
Three extensions of
GAIL
to account for the specificities of driver modelling.1-
Parameter-SharingGAIL
(PS-GAIL
).- Idea: account for the multi-agent nature of the problem resulting from the interaction between traffic participants.
- "We formulate multi-agent driving as a Markov game (Littman, 1994) consisting of
M
agents and an unknownreward
function." - It combines
GAIL
withPS-TRPO
. -
"
PS-GAIL
training procedure encourages stabler interactions between agents, thereby making them less likely to encounter extreme or unlikely driving situations."
2-
Reward Augmented Imitation Learning (RAIL
).- Idea: reward augmentation during training to provide domain knowledge.
- It helps to improve the
state
space exploration of the learning agent by discouraging badstates
such as those that could potentially lead to collisions. -
"These include penalties for
going off the road
,braking hard
, andcolliding
with other vehicles. All of these are undesirable driving behaviors and therefore should be discouraged in the learning agent." - Two kinds of penalties:
2.1-
Binary penalty.2.2-
Smoothed penalty.-
"We hypothesize that providing advanced warning to the imitation learning agent in the form of smaller, increasing penalties as the agent approaches an event threshold will address the credit assignment problem in
RL
." -
"For off-road driving, we linearly increase the penalty from
0
toR
when the vehicle is within0.5m
of the edge of the road. For hard braking, we linearly increase the penalty from0
toR/2
when the acceleration is between−2m/s2
and−3m/s2
."
-
-
"
PS-GAIL
andRAIL
policies are less likely to lead vehicles into collisions, extreme decelerations, and off-road driving." - It looks like now a combination of
cloning
andRL
now: the agent receivesrewards
for imitating theactions
and gets hard-codedrewards
/penalties
defined by the human developer.
3-
Information MaximizingGAIL
(InfoGAIL
).- Idea: assume that the expert policy is a mixture of experts.
-
[Different driving style are present in the dataset] "Aggressive drivers will demonstrate significantly different driving trajectories as compared to passive drivers, even for the same road geometry and traffic scenario. To uncover these latent factors of variation, and learn policies that produce trajectories corresponding to these latent factors,
InfoGAIL
was proposed." - To ensure that the learned policy utilizes the latent variable
z
as much as possible,InfoGAIL
tries to enforce highmutual information
betweenz
and thestate-action
pairs in the generated trajectory. - Extension:
Burn-InfoGAIL
.- Playback is used to initialize the ego vehicle: the "
burn-in
demonstration". -
"If the policy is initialized from a
state
sampled at the end of a demonstrator’s trajectory (as is the case when initializing the ego vehicle from a human playback), the driving policy’s actions should be consistent with the driver’s past behavior." -
"To address this issue of inconsistency with real driving behavior,
Burn-InfoGAIL
was introduced, where a policy must take over where an expert demonstration trajectory ends."
- Playback is used to initialize the ego vehicle: the "
- When trained in a simulator, different parameterizations are possible, defining the style
z
of each car:Aggressive
: Highspeed
and largeacceleration
+ smallheadway distances
.Speeder
: same but largeheadway distances
.Passive
: Lowspeed
andacceleration
+ largeheadway distances
.Tailgating
: same but smallheadway distances
.
-
Experiments.
NGSIM
dataset:-
"The trajectories were smoothed using an extended Kalman filter on a bicycle model and projected to lanes using centerlines extracted from the
NGSIM
roadway geometry file."
-
- Metrics:
1-
Root Mean Square Error (RMSE
) metrics.2-
Metrics that quantify undesirable traffic phenomena:collisions
,hard-braking
, andoffroad driving
.
- Baselines:
BC
with single or mixture Gaussian regression.- Rule-based controller:
IDM
+MOBIL
.-
"A small amount of noise is added to both the lateral and longitudinal accelerations to make the controller nondeterministic."
-
- Simulation:
-
"The effectiveness of the resulting driving policy trained using
GAIL
in imitating human driving behavior is assessed by validation in rollouts conducted on the simulator."
-
- Some results:
-
[
GRU
helpsGAIL
, but notBC
] "Thus, we find that recurrence by itself is insufficient for addressing the detrimental effects that cascading errors can have onBC
policies." -
"Only
GAIL
-based policies (and of courseIDM
+MOBIL
) stay on the road for extended stretches."
-
-
Future work: How to refine the integration modelling?
-
"Explicitly modeling the interaction between agents in a centralized manner through the use of Graph Neural Networks."
-
"Deep Reinforcement Learning for Human-Like Driving Policies in Collision Avoidance Tasks of Self-Driving Cars"
-
[
2020
] [📝] [ 🎓University of the Negev
] -
[
data-driven reward
]
Click to expand
Note that the state variables are normalized in [0 , 1 ] or [-1 , 1 ] and that the previous actions are part of the state . Finally, both the previous and the current observations (only the current one for the scans ) are included in the state , in order to appreciate the temporal evolution. Source. |
Left: throttle and steering actions are not predicted as single scalars but rather as distributions. In this case a mixture of 3 Gaussian, each of them parametrized by a mean and a standard deviation . Weights are also learnt. This enable modelling multimodal distribution and offers better generalization capabilities. Right: the reward function is designed to make the agent imitate the expert driver's behaviour. Therefore the differences in term of mean speed and mean track position between the agent and expert driver are penalized. The mean speed and position of the expert driver is obtained from the learnt GP model. It also contains a non-learnable part: penalties for collision and action changes are independent of human driver observations. Source. |
Human speeds and lateral positions on the track are recorded and modelled using a GP regression. It is used to define the human-like behaviours in the reward function (instead of IRL ) as well as for comparison during test. Source. |
Authors: Emuna, R., Borowsky, A., & Biess, A.
- Motivations:
- Learn human-like behaviours via
RL
without traditionalIRL
. Imitation
should be considered in term ofmean
but also in term ofvariability
.
- Learn human-like behaviours via
- Main idea: hybrid (
rule-based
anddata-driven
) reward shaping.- The idea is to build a model based on observation of human behaviours.
- In this case a Gaussian Process (
GP
) describes the distribution ofspeed
andlateral position
along a track.
- In this case a Gaussian Process (
- Deviations from these learnt parameters are then penalized in the
reward
function. - Two variants are defined:
1-
Thereward
function is fixed, using themeans
of the twoGPs
are referencespeeds
andpositions
.2-
Thereward
function varies by sampling each time a trajectory from the learntGP
models and using its values are referencespeeds
andpositions
.- The goal here is not only to imitate
mean
human behaviour but to recover also the variability in human driving.
- The goal here is not only to imitate
-
"Track
position
was recovered better thanspeed
and we concluded that the latter is related to an agent acting in a partially observable environment." - Note that the weights of feature in the
reward
function stay arbitrary (they are not learnt, contrary toIRL
).
- The idea is to build a model based on observation of human behaviours.
- About the dynamic batch update.
-
"To improve exploration and avoid early termination, we used
reference state initialization
. We initialized thespeed
by sampling from a uniform distribution between30
to90km/h
. High variability in the policy at the beginning of training caused the agent to terminate after a few number of steps (30-40
). A full round of the track required about2000
steps. To improve learning we implemented a dynamic batch size that grows with the agent’s performance."
-
"Reinforcement Learning with Iterative Reasoning for Merging in Dense Traffic"
-
[
2020
] [📝] [ 🎓Stanford
] [ 🚗Honda
,Toyota
] -
[
curriculum learning
,level-k reasoning
]
Click to expand
Curriculum learning : the RL agent solves MDP s with iteratively increasing complexity. At each step of the curriculum, the behaviour of the cars in the environment is sampled from the previously learnt k-levels . Bottom left: 3 or 4 iterations seem to be enough and larger reasoning levels might not be needed for this merging task. Source. |
Authors: Bouton, M., Nakhaei, A., Isele, D., Fujimura, K., & Kochenderfer, M. J.
- Motivations:
1-
TrainingRL
agents more efficiently for complex traffic scenarios.- The goal is to avoid standard issues with
RL
:sparse rewards
,delayed rewards
, andgeneralization
. - Here the agent should merge in dense traffic, requiring interaction.
- The goal is to avoid standard issues with
2-
Cope with dense scenarios.-
"The lane change model
MOBIL
which is at the core of this rule-based policy has been designed for SPARSE traffic conditions [and performs poorly in comparison]."
-
3-
Learn a robust policy, able to deal with various behaviours.- Here learning is done iteratively, as the reasoning level increases, the learning agent is exposed to a larger variety of behaviours.
- Ingredients:
-
"Our training curriculum relies on the
level-k
cognitive hierarchy model from behavioral game theory".
-
- About
k-level
and game theory:-
"This model consists in assuming that an agent performs a limited number of iterations of strategic reasoning: (“I think that you think that I think”)."
- A level-
k
agent acts optimally against the strategy of a level-(k-1)
agent. - The level-
0
is not learnt but uses anIDM
+MOBIL
hand-engineered rule-based policy.
-
- About curriculum learning:
- The idea is to iteratively increase the complexity of the problem. Here increase the diversity and the optimality of the surrounding cars.
- Each cognitive level is trained in a
RL
environment populated with vehicles of any lower cognitive level.-
"We then train a level-
3
agent by populating the top lane with level-0
and level-2
agents and the bottom lane with level-0
or level-1
agents." -
"Note that a level-
1
policy corresponds to a standardRL
procedure [no further iteration]."
-
- Each learnt policy is learnt with
DQN
.- To accelerate training at each time step, the authors re-use the weights from the previous iteration to start training.
MDP
formulation.- Actually, two policies are learnt:
- Policies
1
,3
, and5
: change-lane agents. - Policies
2
and4
: keep-lane agents.
- Policies
action
-
"The learned policy is intended to be high level. At deployment, we expect the agent to decide on a desired speed (
0 m/s
,3 m/s
,5 m/s
) and a lane change command while a lower lever controller, operating at higher frequency, is responsible for executing the motion and triggering emergency braking system if needed." - Simulation runs at
10Hz
but the agent takes an action every five simulation steps:0.5 s
between two actions. - The authors chose high-level
action
s and to rely onIDM
:-
"By using the
IDM
rule to compute the acceleration, the behavior of braking if there is a car in front will not have to be learned." -
"The longitudinal action space is safe by design. This can be thought of as a form of shield to the
RL
agent from taking unsafe actions."- Well, all learnt agent exhibit at least
2%
collision rate ??
- Well, all learnt agent exhibit at least
-
-
state
- Relative
pose
andspeed
of the8
closest surrounding vehicles. - Full observability is assumed.
-
"Measurement uncertainty can be handled online (after training) using the
QMDP
approximation technique".
-
- Relative
reward
- Penalty for collisions:
−1
. - Penalty for deviating from a desired velocity:
−0.001|v-ego − v-desired|
. - Reward for being in the top lane:
+0.01
for the merging-agent and0
for the keep-lane agent. - Reward for success (passing the blocked vehicle):
+1
.
- Penalty for collisions:
- Actually, two policies are learnt:
"Using Counterfactual Reasoning and Reinforcement Learning for Decision-Making in Autonomous Driving"
-
[
2020
] [📝] [ 🎓Technische Universität München
] [ 🚗fortiss
] [] -
[
counterfactual reasoning
]
Click to expand
The idea is to first train the agent interacting with different driver models. This should lead to a more robust policy. During inference the possible outcomes are first evaluated. If too many predictions result in collisions, a non-learnt controller takes over. Otherwise, the learnt policy is executed. Source. |
Authors: Hart, P., & Knoll, A.
-
Motivations:
- Cope with the behavioural uncertainties of other traffic participants.
-
The idea is to perform predictions considering multiple interacting driver models.
1-
Duringtraining
: expose multiple behaviour models.- The parametrized model
IDM
is used to describe more passive or aggressive drivers. - Model-free
RL
is used. The diversity of driver models should improve the robustness.
- The parametrized model
2-
Duringapplication
: at each step, the learned policy is first evaluated before being executed.- The evolution of the present scene is simulated using the different driver models.
- The outcomes are then aggregated:
1-
Collision rate.2-
Success rate (reaching the destination).- Based on these risk and performance metrics, the policy is applied or not.
- If the collision rate is too high, then the ego vehicle stays on its current lane, controlled by
IDM
. -
"Choosing the thresholds is nontrivial as this could lead to too passive or risky behaviors."
- If the collision rate is too high, then the ego vehicle stays on its current lane, controlled by
- It could be seen as some prediction-based
action masking
. - These multi-modal predictions make me also think of the roll-out phase in tree searches.
- Besides it reminds me the concept of
concurrent MDP
, where the agent tries to infer in whichMDP
(parametrized) it has been placed.
-
Not clear to me:
- Why not doing planning if you explicitly know the
transition
(IMD
) and thereward
models? It would substantially increase the sampling efficiency.
- Why not doing planning if you explicitly know the
-
About the simulator:
-
About "counterfactual reasoning":
- From wikipedia: "Counterfactual thinking is, as it states: 'counter to the facts'. These thoughts consist of the 'What if?' ..."
-
"We use causal counterfactual reasoning: [...] sampling behaviors from a model pool for other traffic participants can be seen as assigning nonactual behaviors to other traffic participants.
"Modeling pedestrian-cyclist interactions in shared space using inverse reinforcement learning"
- [
2020
] [📝] [ 🎓University of British Columbia, Vancouver
] - [
max-entropy
,feature matching
]
Click to expand
Left: The contribution of each feature in the linear reward model differs between the Maximum Entropy (ME ) and the Feature Matching (FM ) algorithms. The FM algorithm is inconsistent across levels and has a higher intercept to parameter weight ratio compared with the estimated weights using the ME . Besides, why does it penalize all lateral distances and all speeds in these overtaking scenarios? Right: good idea how to visualize reward function for state of dimension 5 . Source. |
Authors: Alsaleh, R., & Sayed, T.
- In short: A simple but good illustration of
IRL
concepts using Maximum Entropy (ME
) and Feature Matching (FM
) algorithms.- It reminds me some experiments I talk about in this video: "From RL to Inverse Reinforcement Learning: Intuitions, Concepts + Applications to Autonomous Driving".
- Motivations, here:
1-
Work in non-motorized shared spaces, in this case a cyclist-pedestrian zone.- It means high degrees of freedom in motions for all participants.
- And offers complex road-user interactions (behaviours different than on conventional streets).
2-
Model the behaviour of cyclists in this share space using agent-based modelling.agent-based
as opposed to physics-based prediction models such associal force model
(SFM
) orcellular automata
(CA
).- The agent is trying to maximize an unknown
reward
function. - The recovery of that reward function is the core of the paper.
- First,
2
interaction types are considered:- The cyclist
following
the pedestrian. - The cyclist
overtaking
the pedestrian. - This distinction avoids the search for a 1-size-fits-all model.
- The cyclist
- About the
MDP
:- The cyclist is the
agent
. state
(absolute for the cyclist or relative compared to the pedestrian):longitudinal distance
lateral distance
angle difference
speed difference
cyclist speed
state
discretization:-
"Discretized for each interaction type by dividing each state feature into
6
levels based on equal frequency observation in each level." - This non-constant bin-width partially addresses the imbalanced dataset.
6^5
=7776
states.
-
action
:acceleration
yaw rate
action
discretization:-
"Dividing the acceleration into five levels based on equal frequency observation in each level."
5^2
=25
actions.
-
discount factor
:-
"A discount factor of
0.975
is used assuming10%
effect of the reward at a state3 sec
later (90
time steps) from the current state."
-
- The cyclist is the
- About the dataset.
- Videos of two streets in Vancouver, for a total of
39
hours. 228
cyclist and276
pedestrian trajectories are extracted.
- Videos of two streets in Vancouver, for a total of
IRL
.- The two methods assume that the
reward
is a linear combination offeatures
. Herefeatures
arestate
components. 1-
Feature Matching (FM
).- It matches the feature counts of the expert trajectories.
- The authors do not details the
max-margin
part of the algorithm.
2-
Maximum Entropy (ME
).- It estimates the
reward
function parameters by maximizing the likelihood of the expert demonstrations under the maximum entropy distribution. - Being probabilistic, it can account for non-optimal observed behaviours.
- It estimates the
- The two methods assume that the
- The recovered reward model can be used for prediction - How to measure the similarity between two trajectories?
1-
Mean Absolute Error
(MAE
).- It compares elements of same indices in the two sequences.
2-
Hausdorff Distance
.-
"It computes the largest distance between the simulated and the true trajectories while ignoring the time step alignment".
-
- Current limitations:
1-to-1
interactions, i.e. a single pedestrian/cyclist pair.- Low-density scenarios.
-
"[in future works] neighbor condition (i.e. other pedestrians and cyclists) and shared space density can be explicitly considered in the model."
"Accelerated Inverse Reinforcement Learning with Randomly Pre-sampled Policies for Autonomous Driving Reward Design"
- [
2019
] [📝] [ 🎓UC Berkeley
,Tsinghua University, Beijin
] - [
max-entropy
]
Click to expand
Instead of the costly RL optimisation step at each iteration of the vanilla IRL , the idea is to randomly sample a massive of policies in advance and then to pick one of them as the optimal policy. In case the sampled policy set does not contain the optimal policy, exploration of policy is introduced as well for supplement. Source. |
The approximation used in Kuderer et al. (2015) is applied here to compute the second term of gradient about the expected feature values. Source. |
Authors: Xin, L., Li, S. E., Wang, P., Cao, W., Nie, B., Chan, C., & Cheng, B.
- Reminder: Goal of
IRL
= Recover the reward function of an expert from demonstrations (here trajectories). - Motivations, here:
1-
Improve the efficiency of "weights updating" in the iterative routine ofIRL
.- More precisely: generating optimal policy using
model-free RL
suffers from low sampling efficiency and should therefore be avoided. - Hence the term "accelerated"
IRL
.
- More precisely: generating optimal policy using
2-
Embed human knowledge where restricting the search space (policy space).
- One idea: "Pre-designed policy subspace".
-
"An intuitive idea is to randomly sample a massive of policies in advance and then to pick one of them as the optimal policy instead of finding it via
RL
optimisation."
-
- How to construct the policies sub-space?
- Human knowledge about vehicle controllers is used.
- Parametrized linear controllers are implemented:
acc
=K1
∆d
+K2
∆v
+K3
*∆a
, where∆
are relative to the leading vehicle.- By sampling tuples of <
K1
,K2
,K3
> coefficients,1 million
(candidates) policies are generated to form the sub-space.
- Section about
Max-Entropy IRL
(btw. very well explained, as for the section introducingIRL
):-
"Ziebart et al. (2008) employed the principle of
maximum entropy
to resolve ambiguities in choosing trajectory distributions. This principle maximizes the uncertainty and leads to the distribution over behaviors constrained to matching feature expectations, while being no more committed to any particular trajectory than this constraint requires". -
"Maximizing the entropy of the distribution over trajectories subject to the feature constraints from expert’s trajectories implies to maximize the likelihood under the maximum entropy (exponential family) distributions. The problem is convex for
MDPs
and the optima can be obtained using gradient-based optimization methods". -
"The gradient [of the Lagrangian] is the difference between empirical feature expectations and the learners expected feature expectations."
-
- How to compute the second term of this gradient?
- It implies integrating over all possible trajectories, which is infeasible.
- As Kuderer et al. (2015), one can compute the feature values of the most likely trajectory as an approximation of the feature expectation.
-
"With this approximation, only the optimal trajectory associated to the optimal policy is needed, in contrast to regarding the generated trajectories as a probability distribution."
- About the features.
- As noted in my experiments about
IRL
, they serve two purposes (infeature-matching
-basedIRL
methods):1-
In the reward function: they should represent "things we want" and "things we do not want".2-
In the feature-match: to compare two policies based on their sampled trajectories, they should capture relevant properties of driving behaviours.
- Three features for this longitudinal acceleration task:
front-veh time headway
.long. acc
.deviation to speed limit
.
- As noted in my experiments about
- Who was the expert?
- Expert followed a modified linear car-following (
MLCF
) model.
- Expert followed a modified linear car-following (
- Results.
- Iterations are stopped after
11
loops. - It would have been interesting for comparison to test a "classic"
IRL
method whereRL
optimizations are applied.
- Iterations are stopped after
"Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles"
Click to expand
Both behavioural planner and trajectory optimizer share the same cost function, whose weigth parameters are learnt from demonstration. Source. |
Authors: Sadat, A., Ren, M., Pokrovsky, A., Lin, Y., Yumer, E., & Urtasun, R.
- Main motivation:
- Design a decision module where both the behavioural planner and the trajectory optimizer share the same objective (i.e. cost function).
- Therefore "joint".
-
"[In approaches not-joint approaches] the final trajectory outputted by the trajectory planner might differ significantly from the one generated by the behavior planner, as they do not share the same objective".
- Requirements:
1-
Avoid time-consuming, error-prone, and iterative hand-tuning of cost parameters.- E.g. Learning-based approaches (
BC
).
- E.g. Learning-based approaches (
2-
Offer interpretability about the costs jointly imposed on these modules.- E.g. Traditional modular
2
-stage approaches.
- E.g. Traditional modular
- About the structure:
- The driving scene is described in
W
(desired route
,ego-state
,map
, anddetected objects
). ProbablyW
for "World"? - The behavioural planner (
BP
) decides two things based onW
:1-
A high-level behaviourb
.- The path to converge to, based on one chosen manoeuvre:
keep-lane
,left-lane-change
, orright-lane-change
. - The
left
andright
lane boundaries. - The obstacle
side assignment
: whether an obstacle should stay in thefront
,back
,left
, orright
to the ego-car.
- The path to converge to, based on one chosen manoeuvre:
2-
A coarse-level trajectoryτ
.- The loss has also a regularization term.
- This decision is "simply" the
argmin
of the shared cost-function, obtained by sampling + selecting the best.
- The "trajectory optimizer" refines
τ
based on the constraints imposed byb
.- E.g. an overlap cost will be incurred if the
side assignment
ofb
is violated.
- E.g. an overlap cost will be incurred if the
- A cost function parametrized by
w
assesses the quality of the selected <b
,τ
> pair:cost
=w^T
.sub-costs-vec
(τ
,b
,W
).- Sub-costs relate to safety, comfort, feasibility, mission completion, and traffic rules.
- The driving scene is described in
- Why "learnable"?
- Because the weight vector
w
that captures the importance of each sub-cost is learnt based on human demonstrations.-
"Our planner can be trained jointly end-to-end without requiring manual tuning of the costs functions".
-
- They are two losses for that objective:
1-
Imitation loss (withMSE
).- It applies on the <
b
,τ
> produced by theBP
.
- It applies on the <
2-
Max-margin loss to penalize trajectories that have small cost and are different from the human driving trajectory.- It applies on the <
τ
> produced by the trajectory optimizer. -
"This encourages the human driving trajectory to have smaller cost than other trajectories".
- It reminds me the
max-margin
method inIRL
where the weights of the reward function should make the expert demonstration better than any other policy candidate.
- It applies on the <
- Because the weight vector
"Adversarial Inverse Reinforcement Learning for Decision Making in Autonomous Driving"
- [
2019
] [📝] [ 🎓UC Berkeley, Chalmers University, Peking University
] [ 🚗Zenuity
] - [
GAIL
,AIRL
,action-masking
,augmented reward function
]
Click to expand
Author: Wang, P., Liu, D., Chen, J., & Chan, C.-Y.
In Adversarial IRL (AIRL ), the discriminator tries to distinguish learnt actions from demonstrated expert actions. Action-masking is applied, removing some action combinations that are not preferable, in order to reduce the unnecessary exploration. Finally, the reward function of the discriminator is extended with some manually-designed semantic reward to help the agent successfully complete the lane change and not to collide with other objects. Source. |
- One related concept (detailed further on this page): Generative Adversarial Imitation Learning (
GAIL
).- An imitation learning method where the goal is to learn a policy against a discriminator that tries to distinguish learnt actions from expert actions.
- Another concept used here: Guided Cost Learning (
GCL
).- A
Max-Entropy
IRL
method that makes use of importance sampling (IS
) to approximate the partition function (the term in the gradient of the log-likelihood function that is hard to compute since it involves an integral of over all possible trajectories).
- A
- One concept introduced: Adversarial Inverse Reinforcement Learning (
AIRL
).- It combines
GAIL
withGCL
formulation.-
"It uses a special form of the discriminator different from that used in
GAIL
, and recovers a cost function and a policy simultaneously as that inGCL
but in an adversarial way."
-
- Another difference is the use of a model-free
RL
method to compute the new optimal policy, instead of model-basedguided policy search
(GPS
) used inGCL
:-
"As the dynamic driving environment is too complicated to learn for the driving task, we instead use a model-free policy optimization method."
-
- One motivation of
AIRL
is therefore to cope with changes in the dynamics of environment and make the learnt policy more robust to system noises.
- It combines
- One idea: Augment the learned reward with some "semantic reward" term to improve learning efficiency.
- The motivation is to manually embed some domain knowledge, in the generator reward function.
-
"This should provide the agent some informative guidance and assist it to learn fast."
- About the task:
-
"The task of our focus includes a
longitudinal
decision – the selection of a target gap - and alateral
decision – whether to commit the lane change right now." - It is a rather "high-level" decision:
- A low-level controller, consisting of a
PID
for lateral control andsliding-mode
for longitudinal control, is the use to execute the decision.
- A low-level controller, consisting of a
- The authors use some
action-masking
technics where only valid action pairs are allowed to reduce the agent’s unnecessary exploration.
-
"Predicting vehicle trajectories with inverse reinforcement learning"
- [
2019
] [📝] [ 🎓KTH
] - [
max-margin
]
Click to expand
Author: Hjaltason, B.
About the features: The φ are distances read from the origin of a vision field and are represented by red dotted lines. They take value in [0 , 1 ], where φi = 1 means the dotted line does not hit any object and φi = 0 means it hits an object at origin. In this case, two objects are inside the front vision field. Hence φ1 = 0.4 and φ2 = 0.6 . Source. |
Example of max-margin IRL . Source. |
- A good example of max-margin
IRL
:-
"There are two classes: The expert behaviour from data gets a label of
1
, and the "learnt" behaviours a label of-1
. The framework performs amax-margin
optimization step to maximise the difference between both classes. The result is an orthogonal vectorwi
from the max margin hyperplane, orthogonal to the estimated expert feature vectorµ(πE)
". - From this new
R=w*f
, an optimal policy is derived usingDDPG
. - Rollouts are performed to get an estimated feature vector that is added to the set of "learnt" behaviours.
- The process is repeated until convergence (when the estimated values
w*µ(π)
are close enough).
-
- Note about the reward function:
- Here, r(
s, a, s'
) is also function of the action and the next state. - Here a post about different forms of reward functions.
- Here, r(
"A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress"
- [
2019
] [📝] [ 🎓University of Georgia
] - [
reward engineering
]
Click to expand
Authors: Arora, S., & Doshi, P.
Trying to generalize and classify IRL methods. Source. |
I learnt about state visitation frequency: ψ (π )(s ) and the feature count expectation: µ (π )(φ ). Source. |
- This large review does not focus on
AD
applications, but it provides a good picture ofIRL
and can give ideas. Here are my take-aways. - Definition:
-
"Inverse reinforcement learning (
IRL
) is the problem of modeling the preferences of another agent using its observed behavior [hence class ofIL
], thereby avoiding a manual specification of its reward function."
-
- Potential
AD
applications ofIRL
:- Decision-making: If I find your underlying reward function, and I consider you as an expert, I can imitate you.
- Prediction: If I find your underlying reward function, I can imagine what you are going to do
- I start rethinking
Imitation Learning
. The goal ofIL
is to derive a policy based on some (expert) demonstrations.- Two branches emerge, depending on what structure is used to model the expert behaviour. Where is that model captured?
1-
In a policy.- This is a "direct approach". It includes
BC
and its variants. - The task is to learn that
state
->
action
mapping.
- This is a "direct approach". It includes
2-
In a reward function.- Core assumption: Each driver has an internal reward function and acts optimally w.r.t. it.
- The main task it to learn that reward function (
IRL
), which captures the expert's preferences. - The second step consists in deriving the optimal policy for this derived reward function.
As Ng and Russell put it: "The
reward function
, rather than thepolicy
, is the most succinct, robust, and transferable definition of the task"
- What happens if some states are missing in the demonstration?
1-
Direct methods will not know what to do. And will try to interpolate from similar states. This could be risky. (c.f.distributional shift
problem andDAgger
).-
"If a policy is used to describe a task, it will be less succinct since for each state we have to give a description of what the behaviour should look like". From this post
-
2-
IRL
methods acts optimally w.r.t. the underlying reward function, which could be better, since it is more robust.- This is particularly useful if we have an expert policy that is only approximately optimal.
- In other words, a policy that is better than the "expert" can be derived, while having very little exploration. This "minimal exploration" property is useful for tasks such as
AD
. - This is sometimes refers to as
Apprenticeship learning
.
- Two branches emerge, depending on what structure is used to model the expert behaviour. Where is that model captured?
- One new concept I learnt:
State-visitation frequency
(it reminds me some concepts of Markov chains).- Take a policy
π
. Let run the agent with it. Count how often it sees each state. This is called thestate-visitation frequency
(note it is for a specificπ
). - Two ideas from there:
- Iterating until this
state-visitation frequency
stops changing yields theconverged frequency function
. - Multiplying that
converged state-visitation frequency
withreward
gives another perspective to thevalue function
.- The
value function
can now be seen as a linear combination of the expected feature countµ
(φk
)(π
) (also calledsuccessor feature
).
- The
- Iterating until this
- Take a policy
- One common assumption:
-> "The solution is a weighted linear combination of a set of reward features".
- This greatly reduces the search space.
-
"It allowed the use of feature expectations as a sufficient statistic for representing the value of trajectories or the value of an expert’s policy."
- Known
IRL
issues (and solutions):1-
This is an under-constrained learning problem.-
"Many reward functions could explain the observations".
- Among them, they are highly "degenerate" functions with all reward values zero.
- One solution is to impose constraints in the optimization.
- For instance try to maximize the sum of "value-margins", i.e. the difference between the value functions of the best and the second-best actions.
-
"
mmp
makes the solution policy have state-action visitations that align with those in the expert’s demonstration." -
"
maxent
distributes probability mass based on entropy but under the constraint of feature expectation matching."
- Another common constraint is to encourage the reward function to be as simple as possible, similar to
L1
regularization in supervised learning.
-
2-
Two incomplete models:2.1-
How to deal with incomplete/absent model of transition probabilities?2.2-
How to select the reward features?-
"[One could] use neural networks as function approximators that avoid the cumbersome hand-engineering of appropriate reward features".
-
-
"These extensions share similarity with
model-free RL
where thetransition
model andreward function
features are also unknown".
3-
How to deal with noisy demonstrations?- Most approaches assume a Gaussian noise and therefore apply Gaussian filters.
- How to classify
IRL
methods?- It can be useful to ask yourself two questions:
1-
What are the parameters of the HypothesisR
function`?- Most approaches use the "linear approximation" and try to estimate the weights of the linear combination of features.
2-
What for "Divergence Metric", i.e. how to evaluate the discrepancy to the expert demonstrations?-
"[it boils down to] a search in reward function space that terminates when the behavior derived from the current solution aligns with the observed behavior."
- How to measure the closeness or the similarity to the expert?
1-
Compare the policies (i.e. the behaviour).- E.g. how many <
state
,action
> pairs are matching? -
"A difference between the two policies in just one state could still have a significant impact."
- E.g. how many <
2-
Compare the value functions (they are defined over all states).- The authors mention the
inverse learning error
(ILE
) =||
V
(expert policy
)-
V
(learnt policy
)||
and thevalue loss
(use as a margin).
- The authors mention the
-
- Classification:
Margin-based
optimization: Learn a reward function that explains the demonstrated policy better than alternative policies by amargin
(addressIRL
's "solution ambiguity").- The intuition here is that we want a reward function that clearly distinguishes the optimal policy from other possible policies.
Entropy-based
optimization: Apply the "maximum entropy principle" (together with the "feature expectations matching" constraint) to obtain a distribution over potential reward functions.Bayesian
inference to deriveP
(^R
|demonstration
).- What for the likelihood
P
(<s
,a
> |ˆR
)? This probability is proportional to the exponentiated value function:exp
(Q
[s
,a
]).
- What for the likelihood
Regression
andclassification
.
- It can be useful to ask yourself two questions:
"Learning Reward Functions for Optimal Highway Merging"
Click to expand
Author: Weiss, E.
The assumption-free reward function that uses a simple polynomial form based on state and action values at each time step does better at minimizing both safety and mobility objectives, even though it does not incorporate human knowledge of typical reward function structures. About Pareto optimum: at these points, it becomes impossible to improve in the minimization of one objective without worsening our minimization of the other objective). Source. |
- What?
- A Project from the Stanford course (AA228/CS238 - Decision Making under Uncertainty). Examples of student projects can be found here.
- My main takeaway:
- A simple problem that illustrates the need for (learning more about)
IRL
.
- A simple problem that illustrates the need for (learning more about)
- The merging task is formulated as a simple
MDP
:- The state space has size
3
and is discretized:lat
+long
ego position andlong
position of the other car. - The other vehicle transitions stochastically (
T
) according to three simple behavioural models:fast
,slow
,average speed
driving. - The main contribution concerns the reward design: how to shape the reward function for this multi-objective (trade-off
safety
/efficiency
) optimization problem?
- The state space has size
- Two reward functions (
R
) are compared:-
1-
"The first formulation models rewards based on our prior knowledge of how we would expect autonomous vehicles to operate, directly encoding human values such assafety
andmobility
into this problem as a positive reward for merging, a penalty formerging close
to the other vehicle, and a penalty forstaying
in the on-ramp." -
2-
"The second reward function formulation assumes no prior knowledge of human values and instead comprises a simple degree-one polynomial expression for the components of thestate
and theaction
."- The parameters are tuned using a sort of grid search (no proper
IRL
).
- The parameters are tuned using a sort of grid search (no proper
-
- How to compare them?
- Since both
T
andR
are known, a planning (as opposed to learning) algorithm can be used to find the optimal policy. Herevalue iteration
is implemented. - The resulting agents are then evaluated based on two conflicting objectives:
-
"Minimizing the distance along the road at which point merging occurs and maximizing the
gap
between the two vehicles when merging."
-
- Next step will be proper
IRL
:-
"We can therefore conclude that there may exist better reward functions for capturing optimal driving policies than either the intuitive prior knowledge reward function or the polynomial reward function, which doesn’t incorporate any human understanding of costs associated with
safety
andefficiency
."
-
- Since both
"Game-theoretic Modeling of Traffic in Unsignalized Intersection Network for Autonomous Vehicle Control Verification and Validation"
- [
2019
] [📝] [ 🎓University of Michigan and Bilkent University, Ankara
] - [
DAgger
,level-k control policy
]
Click to expand
Authors: Tian, R., Li, N., Kolmanovsky, I., Yildiz, Y., & Girard, A.
-
This paper builds on several works (also analysed further below):
- "Adaptive Game-Theoretic Decision Making for Autonomous Vehicle Control at Roundabouts" - (Tian, Li, Li, et al., 2019).
- "Game Theoretic Modeling of Vehicle Interactions at Unsignalized Intersections and Application to Autonomous Vehicle Control" - (N. Li, Kolmanovsky, Girard, & Yildiz, 2018).
- "Game-theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems" - (N. Li et al., 2016).
-
Addressed problem: unsignalized intersections with heterogenous driving styles (
k
in [0
,1
,2
])- The problem is formulated using the level-
k
game-theory formalism (See analysed related works for more details).
- The problem is formulated using the level-
-
One idea: use imitation learning (
IL
) to obtain an explicit level-k
control policy.- A level-
k
policy is a mappingpi
: <ego state
,other's states
,ego k
>->
<sequence of ego actions
>. - The ego-agent maintains belief over the level
k
of other participants. These estimates are updated using maximum likelihood and Bayes rule. - A first attempt with supervised learning on a fix dataset (
behavioural cloning
) was not satisfying enough due to its drift shortcomings:-
"A small error may cause the vehicle to reach a state that is not exactly included in the dataset and, consequently, a large error may occur at the next time step."
-
- The solution is to also aggregate experience sampled from the currently learnt policy.
- The
DAgger
algorithm (Dataset Aggregation) is used in this work. - One point I did not understand: I am surprised that no initial "off-policy" demonstrations is used. The dataset
D
is initialized as empty. - The policy is represented by a neural network.
- The
- A level-
"Interactive Decision Making for Autonomous Vehicles in Dense Traffic"
-
[
2019
] [📝] [ 🚗Honda
] -
[
game tree search
,interaction-aware decision making
]
Click to expand
In the rule-based stochastic driver model describing the other agents, 2 thresholds are introduced: The reaction threshold , sampled from the range {−1.5m , 0.4m }, describes whether or not the agent reacts to the ego car. The aggression threshold , uniformly sampled {−2.2 , 1.1m }, describes how the agent reacts. Source. |
Two tree searches are performed: The first step is to identify a target merging gap based on the probability of a successful merge for each of them. The second search involves forward simulation and collision checking for multiple ego and traffic intentions. In practice the author found that ''the coarse tree - i.e. with intention only - was sufficient for long term planning and only one intention depth needed to be considered for the fine-grained search''. This reduces this second tree to a matrix game. Source. |
Author: Isele, D.
- Three motivations when working on decision-making for merging in dense traffic:
1-
Prefergame theory
approaches overrule-based
planners.- To avoid the
frozen robot
issue, especially in dense traffic. -
"If the ego car were to wait for an opening, it may have to wait indefinitely, greatly frustrating drivers behind it".
- To avoid the
2-
Prefer thestochastic game
formulation overMDP
.- Merging in dense traffic involves interacting with self-interested agents ("self-interested" in the sense that they want to travel as fast as possible without crashing).
-
"
MDPs
assume agents follow a set distribution which limits an autonomous agent’s ability to handle non-stationary agents which change their behaviour over time." -
"
Stochastic games
are an extension toMDPs
that generalize to multiple agents, each of which has its own policy and own reward function." - In other words,
stochastic games
seen more appropriate to model interactive behaviours, especially in the forward rollout of tree search:- An interactive prediction model based on the concept of
counterfactual reasoning
is proposed. - It describes how behaviour might change in response to ego agent intervention.
- An interactive prediction model based on the concept of
3-
Prefertree search
overneural networks
.-
"Working with the
game trees
directly produces interpretable decisions which are better suited to safety guarantees, and ease the debugging of undesirable behaviour." - In addition, it is possible to include stochasticity for the tree search.
- More precisely, the probability of a successful merge is computed for each potential gap based on:
- The traffic participant’s willingness to yield.
- The size of the gap.
- The distance to the gap (from our current position).
- More precisely, the probability of a successful merge is computed for each potential gap based on:
-
- How to model other participants, so that they act "intelligently"?
-
"In order to validate our behaviour we need interactive agents to test against. This produces a
chicken and egg
problem, where we need to have an intelligent agent to develop and test our agent. To address this problem, we develop a stochastic rule-based merge behaviour which can give the appearance that agents are changing their mind." - This merging-response driver model builds on the ideas of
IDM
, introducing two thresholds (c.f. figure):- One threshold governs whether or not the agent reacts to the ego car,
- The second threshold determines how the agent reacts.
-
"This process can be viewed as a rule-based variant of negotiation strategies: an agent proposes he/she go first by making it more dangerous for the other, the other agent accepts by backing off."
-
- How to reduce the computational complexity of the probabilistic game tree search, while keeping safely considerations ?
- The forward simulation and the collision checking are costly operations. Especially when the depth of the tree increases.
- Some approximations include reducing the number of actions (for both the ego- and the other agents), reducing the number of interacting participants and reducing the branching factor, as can been seen in the steps of the presented approach:
1-
Select an intention class based on a coarse search.
- the ego-actions are decomposed into asub-goal selection task
and awithin-sub-goal set of actions
.2-
Identify the interactive traffic participant.
- it is assumed that at any given time, the ego-agent interacts with only one other agent.3-
Predict other agents’ intentions.
- working withintentions
, the continuous action space can be discretized. It reminds me the concept oftemporal abstraction
which reduces the depth of the search.4-
Sample and evaluate the ego intentions.
- a set of safe (absence of collision) ego-intentions can be generated and assessed.5-
Act, observe, and update our probability models.
- the probability of safe successful merge.
"Adaptive Robust Game-Theoretic Decision Making for Autonomous Vehicles"
-
[
2019
] [📝] [ 🎓University of Michigan
] [] -
[
k-level strategy
,MPC
,interaction-aware prediction
]
Click to expand
The agent maintain belief on the k parameter for other vehicles and updates it at each step. Source. |
Authors: Sankar, G. S., & Han, K.
- One related work (described further below): Decision making in dynamic and interactive environments based on cognitive hierarchy theory: Formulation, solution, and application to autonomous driving by (Li, S., Li, N., Girard, A., & Kolmanovsky, I. 2019).
- One framework: "level-
k
game-theoretic framework".- It is used to model the interactions between vehicles, taking into account the rationality of the other agents.
- The agents are categorized into hierarchical structure of their cognitive abilities, parametrized with a reasoning depth
k
in [0
,1
,2
].- A level-
0
vehicle considers the other vehicles in the traffic scenario as stationary obstacles, hence being "aggressive". - A level-
1
agent assumes other agents are at level-0
. ...
- A level-
- This parameter
k
is what the agent must estimate to model the interaction with the other vehicles.
- One term: "disturbance set".
- This set, denoted
W
, describe the uncertainty in the position estimate of other vehicle (with somedelta
, similar to the variance in Kalman filters). - It should capture both the uncertainty about the transition model and the uncertainty about the driver models.
- This set is considered when taking action using a "feedback min-max strategy".
- I must admit I did not fully understand the concept. Here is a quote:
-
"The min-max strategy considers the worst-case disturbance affecting the behaviour/performance of the system and provides control actions to mitigate the effect of the worst-case disturbance."
- The important idea is to adapt the size of this
W
set in order to avoid over-conservative behaviours (compared to reachable-set methods).- This is done based on the confidence in the estimated driver model (probability distribution of the estimated
k
) for the other vehicles.- If the agent is sure that the other car follows model
0
, then it should be "fully" conservative. - If the agent is sure it follows level
1
, then it could relax its conservatism (i.e. reduce the size of the disturbance set) since it is taken into consideration.
- If the agent is sure that the other car follows model
- This is done based on the confidence in the estimated driver model (probability distribution of the estimated
- This set, denoted
- I would like to draw some parallels:
- With
(PO)MDP
formulation: for the use of a transition model (or transition function) that is hard to define. - With
POMDP
formulation: for the tracking of believes about the driver model (or intention) of other vehicles.- The estimate of the probability distribution (for
k
) is updated at every step.
- The estimate of the probability distribution (for
- With
IRL
: where the agent can predict the reaction of other vehicles assuming they act optimally w.r.t a reward function it is estimating. - With
MPC
: the choice of the optimal control following a receding horizon strategy.
- With
"Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios"
Click to expand
- Note:
- this
190
-page thesis is also referenced in the sections for prediction and planning. - I really like how the author organizes synergies between three modules that are split and made independent in most modular architectures:
(1)
driver model(2)
behaviour prediction(3)
decision-making
- this
Author: Sierra Gonzalez, D.
- Related work: there are close concepts to the approach of
(Kuderer et al., 2015)
referenced below. - One idea: encode the driving preferences of a human driver with a reward function (or cost function), mentioning a quote from Abbeel, Ng and Russell:
“The reward function, rather than the policy or the value function, is the most succinct, robust, and transferable definition of a task”.
-
Other ideas:
- Use IRL to avoid the manual tuning of the parameters of the reward model. Hence learn a cost/reward function from demonstrations.
- Include dynamic features, such as the
time-headway
, in the linear combination of the cost function, to take the interactions between traffic participants into account. - Combine IRL with a trajectory planner based on "conformal spatiotemporal state lattices".
- The motivation is to deal with continuous state and action spaces and handle the presence of dynamic obstacles.
- Several advantages (I honestly did not understand that point): the ability to exploit the structure of the environment, to consider time as part of the state-space and respect the non-holonomic motion constraints of the vehicle.
-
One term: "planning-based motion prediction".
- The resulting reward function can be used to generate trajectory (for prediction), using optimal control.
- Simply put, it can be assumed that each vehicle in the scene behaves in the "risk-averse" manner encoded by the model, i.e. choosing actions leading to the lowest cost / highest reward.
- This method is also called "model-based prediction" since it relies on a reward function or on the models of an MDP.
- This prediction tool is not used alone but rather coupled with some DBN-based manoeuvre estimation (detailed in the section on prediction).
"An Auto-tuning Framework for Autonomous Vehicles"
-
[
2018
] [📝] [🚗Baidu
] -
[
max-margin
]
Click to expand
Two ideas of rank-based conditional IRL framework (RC -IRL ): Conditional comparison (left) and Rank -based learning (middle - is it a loss ? I think you want to maximize this term instead?). Right: Based on the idea of the maximum margin , the goal is to find the direction that clearly separates the demonstrated trajectory from randomly generated ones. Illustration of the benefits of using RC to prevent background shifting : Even if the optimal reward function direction is the same under the two scenarios, it may not be ideal to train them together because the optimal direction may be impacted by overfitting the background shifting . Instead, the idea of conditioning on scenarios can be viewed as a pairwise comparison, which can remove the background differences. Source. |
The human expert trajectory and randomly generated sample trajectories are sent to a SIAMESE network in a pair-wise manner. Again, I do not understand very well. Source. |
Authors: Fan, H., Xia, Z., Liu, C., Chen, Y., & Kong, Q.
-
Motivation:
- Define an automatic tuning method for the cost function used in the
Apollo
EM
-planning module to address many different scenarios. - The idea is to learn these parameters from human demonstration via
IRL
.
- Define an automatic tuning method for the cost function used in the
-
Two main ideas (to be honest, I have difficulties understanding their points):
-
1-
Conditional comparison.- How to measure similarities between the
expert policy
and acandidate policy
?- Usually: compare the
expectation
of theirvalue functions
. - Here: compare their
value functions
evaluatedstate
bystate
.
- Usually: compare the
- Why "conditional"?
- Because the loss function is conditional on
states
.- This can allegedly significantly reduce the
background variance
. - The authors use the term "background variance" to refer to the "differences in behaviours metrics", due to the diversity of scenarios. (Not very clear to me.)
- This can allegedly significantly reduce the
-
"Instead, the idea of conditioning on scenarios can be viewed as a pairwise comparison, which can remove the background differences."
- Because the loss function is conditional on
- How to measure similarities between the
-
2-
Rank-based learning.-
"To accelerate the training process and extend the coverage of corner cases, we sample random policies and compare against the expert demonstration instead of generating the optimal policy first, as in
policy gradient
." - Why "ranked"?
-
"Our assumption is that the human demonstrations rank near the top of the distribution of policies conditional on initial state on average."
-
"The value function is a rank or search objective for selecting best trajectories in the online module."
-
-
"Car-following method based on inverse reinforcement learning for autonomous vehicle decision-making"
-
[
2018
] [📝] [ 🎓Tsinghua University, California Institute of Technology, Hunan University
] -
[
maximum-margin IRL
]
Click to expand
Kernel functions are used on the continuous state space to obtain a smooth reward function using linear function approximation. Source. |
As often, the divergence metric (to measure the gap between one candidate and the expert) is the expected value function estimated on sampled trajectories. Example of how to use 2 other candidate policies. I am still confused that each of their decision is based on a state seen by the expert, i.e. they are not building their own full trajectory. Source. |
Authors: Gao, H., Shi, G., Xie, G., & Cheng, B.
- One idea: A simple and "educationally relevant" application to
IRL
and a good implementation of the algorithm of (Ng A. & Russell S., 2000): Algorithms for Inverse Reinforcement Learning.- Observe human behaviours during a "car following" task, assume his/her behaviour is optimal w.r.t. an hidden reward function, and try to estimate that function.
- Strong assumption:
no lane-change
,no overtaking
,no traffic-light
. In other worlds, just concerned about the longitudinal control.
- Which
IRL
method?Maximum-margin
. Prediction aim at learning a reward function that explains the demonstrated policy better than alternative policies by a margin.- The "margin" is there to address IRL's solution ambiguity.
- Steps:
1-
Define a simple2d
continuous state spaces
= (s0
,s1
).s0
=ego-speed
divided into15
intervals (eachcentre
will serve to buildmeans
for Gaussian kernel functions).s1
=dist-to-leader
divided into36
intervals (same remark).- A normalization is additionally applied.
2-
Feature transformation: Map the2d
continuous state to a finite number of features using kernel functions.- I recommend this short video about feature transformation using kernel functions.
- Here, Gaussian radial kernel functions are used:
- Why "radial"? The closer the state to the centre of the kernel, the higher the response of the function. And the further you go, the larger the response "falls".
- Why "Gaussian"? Because the standard deviation describes how sharp that "fall" is.
- Note that this functions are
2d
:mean
= (the centre of onespeed
interval, the centre of onedist
interval).
- The distance of the continuous state
s
=
(s0
,s1
) to each of the15
*36
=540
means
s
(i
,j
) can be computed. - This gives
540
kernel featuresf
(i
,j
) = K(s
,s
(i
,j
)).
3-
The one-stepreward
is assumed to be linear combination of that features.- Given a policy, a trajectory can be constructed.
- This is a list of
states
. This list can be mapped to a list ofrewards
. - The discounted sum of this list leads to the
trajectory return
, seen as expectedValue function
.
- This is a list of
- One could also form
540
lists for this trajectory (one per kernel feature). Then reduce them bydiscounted_sum()
, leading to540
V_f
(i
,j
) per trajectory.- The
trajectory return
is then a simple the linear combination:theta
(i
,j
)*
V_f
(i
,j
).
- The
- This can be computed for the demonstrating expert, as well as for many other policies.
- Again, the task it to tune the weights so that the expert results in the largest values, against all possible other policies.
- Given a policy, a trajectory can be constructed.
4-
The goal is now to find the540
theta
(i
,j
) weights parameters solution of themax-margin
objective:- One goal:
costly single-step deviation
.- Try to maximize the smallest difference one could find.
- I.e. select the best non-expert-policy action and try to maximize the difference to the expert-policy action in each state.
max
[overtheta
]min
[overπ
] of thesum
[overi
,j
] oftheta
(i
,j
)*
[f_candidate
(i
,j
) -f_expert
(i
,j
)].
- As often the
value function
serves as "divergence metric".
- Try to maximize the smallest difference one could find.
- One side heuristic to remove degenerate solutions:
-
"The reward functions with many small rewards are more natural and should be preferred". from here.
- Hence a regularization constraint (a constraint, not a loss like
L1
!) on thetheta
(i
,j
).
-
- The optimization problem with strict constraint is transformed into an optimization problem with "inequality" constraint.
- Violating constraints is allowed by penalized.
- As I understood from my readings, that relaxes the linear assumption in the case the true
reward function
cannot be expressed as a linear combination of the fixed basis functions.
- The resulting system of equations is solved here with Lagrange multipliers (linear programming was recommended in the orginal
max-margin
paper).
- One goal:
5-
Once thetheta
(i
,j
) are estimated, theR
can be expressed.
- About the other policy "candidates":
-
"For each optimal car-following state, one of the other car-following actions is randomly selected for the solution".
- In other words, in
V
(expert
)>
V
(other_candidates
) goal, "other_candidates
" refers to random policies. - It would have been interesting to have "better" competitors, for instance policies that are optional w.r.t. the current estimate of
R
function. E.g. learnt withRL
algorithms.- That would lead to an iterative process that stops when
R
converges.
- That would lead to an iterative process that stops when
-
"A Human-like Trajectory Planning Method by Learning from Naturalistic Driving Data"
-
[
2018
] [📝] [ 🎓Peking University
] [ 🚗Groupe PSA
] -
[
sampling-based trajectory planning
]
Click to expand
Source. |
Authors: He, X., Xu, D., Zhao, H., Moze, M., Aioun, F., & Franck, G.
- One idea: couple learning and sampling for motion planning.
- More precisely, learn from human demonstrations (offline) how to weight different contributions in a cost function (as opposed to hand-crafted approaches).
- This cost function is then used for trajectory planning (online) to evaluate and select one trajectory to follow, among a set of candidates generated by sampling methods.
- One detail: the weights of the optimal cost function minimise the sum of [
prob
(candidate
) *similarities
(demonstration
,candidate
)].- It is clear to me how a cost can be converted to some probability, using
softmax()
. - But for the similarity measure of a trajectory candidate, how to compute "its distance to the human driven one at the same driving situation"?
- Should the expert car have driven exactly on the same track before or is there any abstraction in the representation of the situation?
- How can it generalize at all if the similarity is estimated only on the location and velocity? The traffic situation will be different between two drives.
- It is clear to me how a cost can be converted to some probability, using
- One quote:
"The more similarity (i.e. less distance) the trajectory has with the human driven one, the higher probability it has to be selected."
"Learning driving styles for autonomous vehicles from demonstration"
-
[
2015
] [📝] [ 🎓University of Freiburg
] [ 🚗Bosch
] -
[
MaxEnt IRL
]
Click to expand
Source. |
Authors: Kuderer, M., Gulati, S., & Burgard, W.
- One important contribution: Deal with continuous features such as
integral of jerk
over the trajectory. - One motivation: Derive a cost function from observed trajectories.
- The trajectory object is first mapped to some feature vector (
speed
,acceleration
...).
- The trajectory object is first mapped to some feature vector (
- One Q&A: How to then derive a
cost
(orreward
) from these features?- The authors assume the cost function to be a linear combination of the features.
- The goal is then about learning the weights.
- They acknowledge in the conclusion that it may be a too simple model. Maybe neural nets could help to capture some more complex relations.
- One concept: "Feature matching":
-
"Our goal is to find a generative model
p
(traj
|weights
) that yields trajectories that are similar to the observations." - How to define the "Similarity"?
- The "features" serve as a measure of similarity.
-
- Another concept: "
ME-IRL
" = Maximum EntropyIRL
.- One issue: This "feature matching" formulation is ambiguous.
- There are potentially many (degenerated) solutions
p
(traj
|weights
). For instanceweights
= zeros.
- There are potentially many (degenerated) solutions
- One idea is to introduce an additional goal:
- In this case: "Among all the distributions that match features, they to select the one that maximizes the entropy."
- The probability distribution over trajectories is in the form
exp
(-cost[features(traj), θ]
), to model that agents are exponentially more likely to select trajectories with lower cost.
- One issue: This "feature matching" formulation is ambiguous.
- About the maximum likelihood approximation in
MaxEnt-IRL
:- The gradient of the Lagrangian cost function turns to be the difference between two terms:
1-
The empirical feature values (easy to compute from the recorded).2-
The expected feature values (hard to compute: it requires integrating over all possible trajectories).- An approximation is made to estimate the expected feature values: the authors compute the feature values of the "most" likely trajectory, instead of computing the expectations by sampling.
- Interpretation:
-
"We assume that the demonstrations are in fact generated by minimizing a cost function (
IOC
), in contrast to the assumption that demonstrations are samples from a probability distribution (IRL
)".
-
- The gradient of the Lagrangian cost function turns to be the difference between two terms:
- One related work:
- "Learning to Predict Trajectories of Cooperatively Navigating Agents" by (Kretzschmar, H., Kuderer, M., & Burgard, W., 2014).
"TNT: Target-driveN Trajectory Prediction"
-
[
2020
] [📝] [🚗Waymo
] -
[
multimodal
,VectorNet
,goal-based prediction
]
Click to expand
TNT = target-driven trajectory prediction. The intuition is that the uncertainty of future states can be decomposed into two parts: the target or intent uncertainty, such as the decision between turning left and right ; and the control uncertainty, such as the fine-grained motion required to perform a turn. Accordingly, the probabilistic distribution is decomposed by conditioning on targets and then marginalizing over them. Source. |
Authors: Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., Shen, Y., Shen, Y., Chai, Y., Schmid, C., Li, C., & Anguelov, D.
-
One sentence:
-
"Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target
states
."
-
-
Motivations:
1-
Model multimodal futures distributions.2-
Be able to incorporate expert knowledge.3-
Do not rely on run-time sampling to estimate trajectory distributions.
-
How to model multimodal futures distributions?
1-
Futuremodes
are implicitly modelled aslatent
variables, which should capture the underlyingintents
of the agents.- Diverse trajectories can be generated by sampling from these implicit distributions.
- E.g.
CVAE
inDESIRE
,GAN
inSocialGAN
, single-step policy roll-out methods: they are prone to mode collapse too. - Issue: the use of
latent
variables to model intents prohibits them to be interpreted. Incorporating expert knowledge is made challenging. - Issue: it often requires test-time sampling [stochastic sampling from the
latent
space] to evaluate probabilistic queries (e.g., “how likely is the agent to turn left?”) and obtain implicit distributions. -
"Modeling the future as a discrete set of targets does not suffer from mode averaging, which is the major factor that hampers multimodal predictions."
2-
Decompose the trajectory prediction task into subtasks.- For instance with
planning
-based prediction:- First estimate a Bayesian posterior distribution of destinations.
- Then used
IRL
to plan the trajectories.
- Or by decomposing
goal
distribution estimation andgoal
-directedplanning
.
- For instance with
3-
Discretize the output space as intents or with anchors.IntentNet
[UBER
]: several commonmotion
categories are manually defined (e.g.left turn
andlane changes
) and a separate motion predictor is learnt for eachintent
.-
"
MultiPath
[Waymo
] andCoverNet
[nuTonomy
] chose to quantize the trajectories intoanchors
, where the trajectory prediction task is reformulated intoanchor
selection and offset regression." -
"Unlike
anchor
trajectories, the targets inTNT
are much lower dimensional and can be easily discretized via uniform sampling or based on expert knowledge (e.g. HD maps). Hence, they can be estimated more reliably."
-
TNT
= target-driven trajectory prediction.- The framework has
3
stages that are trained end-to-end: 1-
Target
prediction.- Estimate a distribution over candidate
targets
,T
steps into the future, given the encodedscene context
. - The potential future
targets
are modelled via a set ofN
discrete, quantized locations with continuous offsets.-
"We can see that with regression the performance improved by
0.16m
, which shows the necessity of position refinement from the original target coordinates."
-
- Estimate a distribution over candidate
2-
Target
-conditioned motion estimation.- Predict trajectory
state
sequences conditioned ontargets
. - Two assumptions:
1-
Future time steps are conditionally independent. Sequential predictions are therefore avoided.2-
The distribution of the trajectories is unimodal (normal
) given the target.
- Predict trajectory
3-
Scoring and selection.- Estimates the likelihood of each predicted trajectory, taking into account the
context
of all other predicted trajectories-
"Our final stage estimates the
likelihood
of full future trajectoriessF
. This differs from the second stage, which decomposes over time steps andtargets
, and from the first stage which only has knowledge oftargets
, but not full trajectories — e.g., atarget
might be estimated to have highlikelihood
, but a full trajectory to reach that target might not."
-
- Select a final compact set of trajectory predictions.
-
"This process is inspired by the non-maximum suppression algorithm commonly used for computer vision problems, such as object detection."
-
- Estimates the likelihood of each predicted trajectory, taking into account the
- The framework has
-
How is the
context
information encoded, i.e. the ego-car's interactions with the environment and the other agents?- When the HD map is available: Using the hierarchical graph neural network
VectorNet
[Waymo
].-
"Polylines are used to abstract the HD map elements
cP
(lanes, traffic signs) and agent trajectoriessP
; a subgraph network is applied to encode each polyline, which contains a variable number of vectors; then a global graph is used to model the interactions between polylines."
-
-
" If
scene context
is only available in the form of top-down imagery, aConvNet
is used as the context encoder."
- When the HD map is available: Using the hierarchical graph neural network
-
How to generate the
targets
, i.e. design thetarget space
?- The
target space
is approximated by a set of discrete locations. Expert knowledge (such as road topology) can be incorporated there, for instance by samplingtargets
on and around the lanes."-
"These
targets
are not only grounded in physical entities that are interpretable (e.g.location
), but also correlate well withintent
(e.g. alane change
or aright turn
)."
-
1-
For vehicle: points are uniformly sampled on lane centerlines from the HD map and used astarget
candidates, with the assumption that vehicles never depart far away from lanes.2-
For pedestrians: a virtual grid is generated around the agent and use the grid points as target candidates.- Grid of range
20m
x20m
with a grid size of0.5m
->1600
targets are considered and only the best50
are kept for further processing.
- Grid of range
- The
-
What is the ground truth?
- Not very clear to me how you can use a single demonstration as an oracle: today you
turn left
, but yesterday, with samecontext
, youturned right
. -
"The
ground truth score
of each predicted trajectory is defined by its distance to ground truth trajectoryψ
(sF
)."
- Not very clear to me how you can use a single demonstration as an oracle: today you
-
Results:
TNT
outperforms the state-of-the-art on prediction of:1-
Vehicles:Argoverse
Forecasting andINTERACTION
.2-
Pedestrian:Stanford Drone
and an in-house Pedestrian-at-Intersection dataset (PAID
).
- For benchmark,
MultiPath
andDESIRE
are reimplemented by replacing theirConvNet
context encoders withVectorNet
.
"Modeling and Prediction of Human Driver Behavior: A Survey"
-
[
2020
] [📝] [ 🎓Stanford
,University of Illinois
] [🚗Qualcomm
] -
[
state estimation
,intention estimation
,trait estimation
,motion prediction
]
Click to expand
Terminology: the problem is formulated as a discrete-time multi-agent partially observable stochastic game (POSG ). In particular, the internal state can contain agent’s navigational goals or the behavioural traits . Source. |
Authors: Brown, K., Driggs-Campbell, K., & Kochenderfer, M. J.
-
Motivation:
- A review and taxonomy of
200
models from the literature on driver behaviour modelling.-
"In the context of the partially observable stochastic game (
POSG
) formulation, adriver behavior model
is a collection of assumptions about thehuman observation
functionG
,internal-state update
functionH
andpolicy
functionπ
(thestate-transition
functionF
also plays an important role in driver-modeling applications, though it has more to do with the vehicle than the driver)." - The "References" section is large and good!
-
- Models are categorized based on the tasks they aim to address.
1-
state
estimation.2-
intention
estimation.3-
trait
estimation.4-
motion
prediction.
- A review and taxonomy of
-
The following lists are non-exhaustive, see the tables for full details. Instead, they try to give an overview of the most represented instances:
-
1-
(Physical)state
estimation.- [Algorithm]: approximate recursive Bayesian filters. E.g.
KF
,PF
,moving average filter
. -
"Some advanced
state
estimation models take advantage of the structure inherent in the driving environment to improve filtering accuracy. E.g.DBN
".
- [Algorithm]: approximate recursive Bayesian filters. E.g.
-
2-
(Internal states)intention
estimation.-
"
Intention
estimation usually involves computing a probability distribution over a finite set of possible behavior modes - often corresponding to navigational goals (e.g.,change lanes
,overtake
) - that a driver might execute in the current situation." - [Architecture]:
[D]BN
,SVM
,HMM
,LSTM
. - [Scope]:
highway
,intersection
,lane-changing
,urban
. - [Evaluation]:
accuracy
(classification),ROC
curve,F1
,false positive rate
. - [Intention Space] (set of possible behaviour modes that may exist in a driver’s
internal state
- often combined):lateral
modes (e.g.lane-change
/lane-keeping
intentions),routes
(a sequence of decisions, e.g.turn right → go straight → turn right again
that a driver may intend to execute),longitudinal
modes (e.g.car-following
/cruising
),- joint
configurations
. -
"
Configuration
intentions are defined in terms of spatial relationships to other vehicles. For example,intention estimation
for amerging
scenario might involve reasoning about which gap between vehicles the target car intends to enter. Theintention space
of a car in the other lane might be whether or not to yield and allow the merging vehicle to enter."
- [Hypothesis Representation] (how to represent uncertainty in the intention hypothesis?):
discrete probability distribution
over possibleintentions
.-
"In contrast,
point estimate
hypothesis ignores uncertainty and simply assigns a probability of1
to a single (presumably the most likely) behavior mode."
-
- [Estimation / Inference Paradigm]:
single-shot
,recursive
,Bayesian
(based on probabilistic graphical models),black-box
,game theory
.-
"
Recursive
estimation algorithms operate by repeatedly updating the intention hypothesis at each time step based on the new information received. In contrast,single-shot
estimators compute a new hypothesis from scratch at each inference step. The latter may operate over a history of observations, but it does not store any information between successive inference iterations." -
"Game-theoretic models are distinguished by being
interaction-aware
. They explicitly consider possible situational outcomes in order to compute or refine an intention hypothesis. Thisinteraction-awareness
can be as simple as pruning intentions with a high probability of conflicting with other drivers, or it can mean computing the Nash equilibrium of an explicitly formulated game with a payoff matrix."
-
-
-
3-
trait
estimation.-
"Whereas
intention
estimation reasons about what a driver is trying to do,trait estimation
reasons about factors that affect how the driver will do it. Broadly speaking, traits encompassskills
,preferences
, andstyle
, as well as properties likefatigue
,distractedness
, etc." -
"
Trait
estimation may be interpreted as the process of inferring the “parameters” of the driver’spolicy function π
on the basis of observed driving behavior. [...]Traits
can also be interpreted as part of the driver’s internal state." -
[Architecture]:
IDM
,MOBIL
,reward
parameters. -
[Training]:
IRL
,EM
,genetic algorithms
,heuristic
. -
[Theory]: Inverse
RL
. -
[Scope]:
car following
(IDM
),highway
,intersection
,urban
. -
[Trait Space]:
policy
parameters,reward
parameters (assuming that drivers are trying to optimize acost
function).-
"Some of the most widely known driver models are simple parametric controllers with tuneable “style” or “preference”
policy
parameters that represent intuitive behavioraltraits
of drivers. E.g.IDM
." IDM
traits:minimum desired gap
,desired time headway
,maximum feasible acceleration
,preferred deceleration
,maximum desired speed
.-
"
Reward
function parameters often correspond to the same intuitive notions mentioned above (e.g.,preferred velocity
), the important difference being that they parametrize areward
function rather than a closed-loop controlpolicy
."
-
-
[Hypothesis Representation] (uncertainty): in almost all cases, the hypothesis is represented by a
point estimate
rather than adistribution
. -
[Estimation Paradigm]:
offline
/online
.-
"Some models combine the two paradigms by computing a prior distribution
offline
, then tuning itonline
. This tuning procedure often relies on Bayesian methods."
-
-
[Model Class]:
heuristic
,optimization
,Bayesian
,IRL
,contextually varying
.-
"One simple approach to
offline
trait estimation is to settrait
parameters heuristically. Specifying parameters manually is one way to incorporate expert domain knowledge into models." -
"In some approaches,
trait
parameters are modeled as contextually varying, meaning that they vary based on the region of thestate
space (the context) or the current behavior mode."
-
-
-
4-
motion
prediction.-
"Infer the future physical
states
of the surrounding vehicles". - [Architecture]:
IDM
,LSTM
(and otherRNN
/NN
), constant acceleration / speed (CA
,CV
),encoder-decoder
,GMM
,GP
,adaptive
,spline
. - [Training]:
heuristic
.-
"Simple examples include rule-based heuristic control laws like
IDM
. More sophisticated examples include closed-loop policies based onNN
,DBN
, and random forests."
-
- [Theory]:
RL
,MPC
, trajectory optimization.-
"Some
MPC
policy models (including those used within a forward simulation paradigm) fall into the game theoretic category because they explicitly predict the future states of their environment (including other cars) before computing a planned trajectory."
-
- [Scope]:
highway
,car-following
(e.g. usingIDM
),intersection
,urban
. - [Evaluation]:
RMSE
,NLL
,MAE
,collision rate
. - [Vehicle dynamics model]:
linear
,learned
,bicycle kinematic
.-
"Many models in the literature assume linear
state-transition
dynamics. Linear models can be first order (i.e.,output
isposition
,input
isvelocity
), second order (i.e.,output
isposition
,input
isacceleration
), and so forth." -
"Kinematic models are simpler than dynamic models, but the no-slip assumption can lead to significant modeling errors."
-
"Some
state-transition
models are learned, in the sense that the observed correlation between consecutive predictedstates
results entirely from training on large datasets. Some incorporate an explicit transition model where the parameters are learned, whereas others simply output a full trajectory."
-
- [
Scene
-level uncertainty modelling]:single-scenario
(ignoring multimodal uncertainty at the scene level),partial scenario
,multi-scenario
(reason about the different possible scenarios that may follow from an initial traffic scene),reachable set
.-
"Some models reason only about a partial scenario, meaning they predict the motion of only a subset of vehicles in the traffic scene, usually under a single scenario."
-
"Some models reason about multimodal uncertainty on the
scene
-level by performing multiple (parallel) rollouts associated with different scenarios." -
"Rather than reasoning about the likelihood of future
states
, some models reason about reachability. Reachability analysis implies taking a worst-case mindset in terms of predicting vehicle motion."
- [
Agent
-level uncertainty modelling]:single deterministic
,particle set
,Gaussians
. - [Prediction paradigm]:
- open-loop
independent
trajectory prediction.-
"Many models operate under the independent prediction paradigm, meaning that they predict a full trajectory independently for each agent in the scene. These approaches are
interaction-unaware
because they are open-loop. Though they may account for interaction between vehicles at the current timet
, they do not explicitly reason about interaction over the prediction window fromt+1
to the prediction horizontf
. [...] Because independent trajectory prediction models ignore interaction, their predictive power tends to quickly degrade as the prediction horizon extends further into the future."
-
- closed-loop
forward
simulation.-
"In the forward simulation paradigm, motion hypotheses are computed by rolling out a closed-loop control policy
π
for each target vehicle." -game theoretic
prediction.
-
game theoretic
prediction.-
"Agents are modeled as looking ahead to consider the possible ramifications of their
actions
. This notion of looking ahead makes game-theoretic prediction models more deeplyinteraction-aware
than forward simulation models based on reactive closed-loop control."
-
- open-loop
-
"Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge"
-
[
2020
] [📝] [🚗nuTonomy
] -
[
multimodal
,probabilistic
,mode collapse
,domain knowledge
,classification
]
Click to expand
Top: The idea of CoverNet is to first generate feasible future trajectories, and then classify them. It uses the past states of all road users and a HD map to compute a distribution over a vehicle's possible future states. Bottom-left: the set of trajectories can be reduced by considering the current state and the dynamics: at high speeds , sharp turns are not dynamically feasible for instance. Bottom-right: the contribution here also deals with "feasibility", i.e. tries to reduce the set using domain knowledge. A second loss is introduced to penalize predictions that go off-road . The first loss (cross entropy with closest prediction treated as ground truth) is also adapted: instead of a delta distribution over the closest mode, there is also probability assigned to near misses. Source. |
Authors: Boulton, F. A., Grigore, E. C., & Wolff, E. M.
-
Related work:
CoverNet
: Multimodal Behavior Prediction using Trajectory Sets, (Phan-Minh, Grigore, Boulton, Beijbom, & Wolff, 2019). -
Motivation:
- Extend their
CoverNet
by including further "domain knowledge".-
"Both dynamic constraints and "rules-of-the-road" place strong priors on likely motions."
CoverNet
: Predicted trajectories should be consistent with the current dynamic state.- This work : Predicted trajectories should stay on road.
-
- The main idea is to leverage the
map
information by adding an auxiliary loss that penalizes off-road predictions.
- Extend their
-
Motivations and ideas of
CoverNet
:1-
Avoid the issue of "mode collapse".- The prediction problem is treated as classification over a diverse set of trajectories.
- The trajectory sets for
CoverNet
is available onnuscenes-devkit
github.
2-
Ensure a desired level of coverage of thestate
space.- The larger and the more diverse the set, the higher the coverage. One can play with the resolution to ensure coverage guarantees, while pruning of the set improves the efficiency.
3-
Eliminate dynamically infeasible trajectories, i.e. introduced dynamic constraints.- Trajectories that are not physically possible are not considered, which limits the set of reachable states and improves the efficiency.
-
"We create a dynamic trajectory set based on the current
state
by integrating forward with our dynamic model over diverse control sequences."
-
Two losses:
1-
Moving beyond "standard"cross-entropy
loss for classification.- What is the ground truth trajectory? Obviously, it is not part of the set.
- One solution: designate the closest one in the set.
-
"We utilize cross-entropy with positive samples determined by the element in the
trajectory set
closest to the actual ground truth in minimum average of point-wise Euclidean distances." - Issue: This will penalize the second-closest trajectory just as much as the furthest, since it ignores the geometric structure of the trajectory set.
-
- Another idea: use a weighted cross-entropy loss, where the
weight
is a function of distance to the ground truth.-
"Instead of a
delta
distribution over the closest mode, there is also probability assigned tonear misses
." - A threshold defines which trajectories are "close enough" to the ground truth.
-
- This weighted loss is adapted to favour mode diversity:
-
"We tried an "Avoid Nearby" weighted cross entropy loss that assigns weight of
1
to the closest match,0
to all other trajectories within2
meters of ground truth, and1/|K|
to the rest. We see that we are able to increase mode diversity and recover the performance of the baseline loss." -
"Our results indicate that losses that are better able to enforce mode diversity may lead to improved performance."
-
2-
Add an auxiliary loss for off-road predictions.- This helps learn domain knowledge, i.e. partially encode "rules-of-the-road".
-
"This auxiliary loss can easily be pretrained using only
map
information (e.g.,off-road
area), which significantly improves performance on small datasets."
-
Related works for
predictions
(we want multimodal and probabilistic trajectory predictions):- Input, i.e. encoding of the scene:
-
"State-of-the-art motion prediction algorithms now typically use
CNNs
to learn appropriate features from abirds-eye-view
rendering of the scene (map and road users)." - Graph neural networks (
GNNs
) looks promising to encode interactions. - Here: A
BEV
rasterRGB
image (fixed size) containingmap
information and the paststates
of all objects.- It is inspired by the work of
UBER
: Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks, (Cui et al., 2018).
- It is inspired by the work of
-
- Output, i.e. representing the possible future motions:
1-
Generative models.- They encode choice over multiple actions via sampling latent variables.
- Issue: multiple trajectory samples or
1
-step policy rollouts (e.g.R2P2
) are required at inference. - Examples: Stochastic policies,
CVAE
s andGAN
s.
2-
Regression.- Unimodal: predict a single future trajectory. Issue: unrealistically average over behaviours, even when predicting Gaussian uncertainty.
- Multimodal: distribution over multiple trajectories. Issue: suffer from mode collapse.
3-
Classification.-
"We choose not to learn an uncertainty distribution over the space. The density of our trajectory sets reduces its benefit compared to the case when there are a only a handful of modes."
- How to deal with varying number of classes to predict? Not clear to me.
-
- Input, i.e. encoding of the scene:
-
How to solve
mode collapse
in regression?- The authors consider
MultiPath
byWaymo
(detailed also in this page) as their baseline. - A set of anchor boxes can be used, much like in object detection:
-
"This model implements ordinal regression by first choosing among a fixed set of anchors (computed a priori) and then regressing to residuals from the chosen anchor. This model predicts a fixed number of trajectories (
modes
) and their associated probabilities." - The authors extend
MultiPath
with dynamically-computed anchors, based on the agent's currentspeed
.- Again, it makes no sense to consider anchors that are not dynamically reachable.
- They also found that using one order of magnitude more “anchor” trajectories that Waymo (
64
) is beneficial: better coverage of space via anchors, leaving the network to learn smaller residuals.
- The authors consider
-
Extensions:
- As pointed out by this
KIT
Master Thesis offer, the current state ofCoverNet
only has a motion model for cars. Predicting bicycles and pedestrians' motions would be a next step. - Interactions are ignored now.
- As pointed out by this
"PnPNet: End-to-End Perception and Prediction with Tracking in the Loop"
-
[
2020
] [📝] [ 🎓University of Toronto
] [🚗Uber
] -
[
joint perception + prediction
,multi-object tracking
]
Click to expand
The authors propose to leverage tracking for the joint perception +prediction task. Source. |
Top: One main idea is to make the prediction module directly reuse the scene context captured in the perception features, and also consider the past object tracks . Bottom: a second contribution is the use of a LSTM as a sequence model to learn the object trajectory representation. This encoding is jointly used for the tracking and prediction tasks. Source. |
Authors: Liang, M., Yang, B., Zeng, W., Chen, Y., Hu, R., Casas, S., & Urtasun, R.
-
Motivations:
1-
Performperception
andprediction
jointly, with a single neural network.- There for it is called "Perception and Prediction":
PnP
. - The whole model is also said
end-to-end
, because it isend-to-end
trainable.- This constrast with modular sequential architectures where both the perception output and map information is forwarded to an independent
prediction
module, for instance in a bird’s eye view (BEV
) raster representation.
- This constrast with modular sequential architectures where both the perception output and map information is forwarded to an independent
- There for it is called "Perception and Prediction":
2-
Improveprediction
by leveraging the (past) temporal information (motion history) contained intracking
results.- In particular, one goal is to recover from long-term object occlusion.
-
"While all these [vanilla
PnP
] approaches share the sensor features fordetection
andprediction
, they fail to exploit the rich information of actors along the time dimension [...]. This may cause problems when dealing with occluded actors and may produce temporal inconsistency inpredictions
."
-
- The idea is to include
tracking
in the loop to improveprediction
(motion forecasting):-
"While the
detection
module processes sequential sensor data and generates object detections at each time step independently, thetracking
module associates these estimates across time for better understanding of object states (e.g., occlusion reasoning, trajectory smoothing), which in turn provides richer information for theprediction
module to produce accurate future trajectories." -
"Exploiting motion from explicit object trajectories is more accurate than inferring motion from the features computed from the raw sensor data. [this reduces the prediction error by (
∼6%
) in the experiment]"
-
- All modules share computation as there is a single backbone network, and the full model can be trained
end-to-end
.-
"While previous joint
perception
andprediction
models make the prediction module another convolutional header on top of thedetection
backbone network, which shares the same features with thedetection
header, inPnPNet
we put theprediction
module after explicit objecttracking
, with the object trajectory representation as input."
-
- In particular, one goal is to recover from long-term object occlusion.
-
How to represent (long-term) trajectories?
- The idea is to capture both
sensor
observation andmotion
information of actors. -
"For each object we first extract its inferred
motion
(from past detection estimates) and raw observations (fromsensor
features) at each time step, and then model its dynamics using a recurrent network." -
[interesting choice] "For angular velocity of ego car we parameterize it as its
cosine
andsine
values." - This trajectory representation is utilized in both
tracking
andprediction
modules.
- The idea is to capture both
-
About multi-object
tracking
(MOT
):- There exist two distinct challenges:
1-
The discrete problem ofdata association
between previous tracks and current detections.- Association errors (i.e.,
identity switches
) are prone to accumulate through time. -
"The
association problem
is formulated as abipartite
matching problem so that exclusivetrack-to-detection
correspondence is guaranteed. [...] Solved with theHungarian algorithm
." -
"Many frameworks have been proposed to solve the
data association problem
: e.g., Markov Decision Processes (MDP
),min-cost flow
,linear assignment
problem andgraph cut
."
- Association errors (i.e.,
2-
The continuous problem oftrajectory estimation
.- In the proposed approach, the
LSTM
representation of associated new tracks are refined to generate smoother trajectories:-
"For
trajectory refinement
, since it reduces the localization error of online generated perception results, it helps establish a smoother and more accurate motion history."
-
- In the proposed approach, the
- The proposed multi-object tracker solves both problems, therefore it is said "
discrete
-continuous
".
- There exist two distinct challenges:
"VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation"
Click to expand
Both map features and our sensor input can be simplified into either a point , a polygon , or a curve , which can be approximately represented as polylines, and eventually further split into vector fragments. The set of such vectors form a simplified abstracted world used to make prediction with less computation than rasterized images encoded with ConvNets . Source. |
A vectorized representation of the scene is preferred to the combination (rasterized rendering + ConvNet encoding). A global interaction graph can be built from these vectorized elements, to model the higher-order relationships between entities. To further improve the prediction performance, a supervision auxiliary task is introduced. Source. |
Authors: Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C.
- Motivations:
1-
Reduce computation cost while offering good prediction performances.2-
Capture long range context information, for longer horizon prediction.ConvNets
are commonly used to encode the scene context, but they have limited receptive field.- And increasing
kernel size
andinput image size
is not so easy:- FLOPs of
ConvNets
increase quadratically with thekernel size
andinput image size
. - The number of parameters increases quadratically with the
kernel size
.
- FLOPs of
- Four ingredients:
1-
A vectorized representation is preferred to the combination (rasterized rendering +ConvNet
encoding).2-
A graph network, to model interactions.3-
Hierarchy, to first encodemap
andsensor
information, and then learn interactions.4-
A supervision auxiliary task, in parallel to the prediction task.
- Two kind of input:
1-
HD map
information: Structured road context information such aslane boundaries
,stop/yield signs
,crosswalks
andspeed bumps
.2-
Sensor
information: Agent trajectories.
- How to encode the scene context information?
1-
Rasterized representation.Rendering
: in a bird-eye image, with colour-coded attributes. Issue: colouring requires manual specifications.Encoding
: encode the scene context information withConvNets
. Issue: receptive field may be limited.-
"The most popular way to incorporate highly detailed maps into behavior prediction models is by rendering the map into pixels and encoding the scene information, such as
traffic signs
,lanes
, androad boundaries
, with a convolutional neural network (CNN
). However, this process requires a lot of compute and time. Additionally, processing maps as imagery makes it challenging to model long-range geometry, such as lanes merging ahead, which affects the quality of the predictions." - Impacting parameters:
- Convolutional kernel sizes.
- Resolution of the rasterized images.
- Feature cropping:
-
"A larger crop size (
3
vs1
) can significantly improve the performance, and cropping along observed trajectory also leads to better performance."
-
2-
Vectorized representation- All
map
andtrajectory
elements can be approximated as sequences of vectors. -
"This avoids lossy rendering and computationally intensive
ConvNet
encoding steps."
- All
- About graph neural networks (
GNN
s), from Rishabh Anand's medium article:1-
Given a graph, we first convert thenodes
to recurrent units and theedges
to feed-forward neural networks.2-
Then we performNeighbourhood Aggregation
(Message Passing
) for all nodesn
number of times.3-
Then we sum over the embedding vectors of all nodes to get graph representationH
. Here the "Global interaction graph".4-
Feel free to passH
into higher layers or use it to represent the graph’s unique properties! Here to learn interaction models to make prediction.
- About hierarchy:
1-
First, aggregate information among vectors inside a polyline, namelypolyline subgraphs
.- Graph neural networks (
GNN
s) are used to incorporate these sets of vectors -
"We treat each vector vi belonging to a polyline Pj as a node in the graph with node features."
- How to encode attributes of these geometric elements? E.g.
traffic light state
,speed limit
?- I must admit I did not fully understand. But from what I read on medium:
- Each
node
has a set of features defining it. - Each
edge
may connectnodes
together that have similar features.
- Each
- I must admit I did not fully understand. But from what I read on medium:
- Graph neural networks (
2-
Then, model the higher-order relationships among polylines, directly from their vectorized form.- Two interactions are jointly modelled:
1-
The interactions of multiple agents.2-
Their interactions with the entities from road maps.- E.g. a car enters an intersection, or a pedestrian approaches a crosswalk.
-
"We clearly observe that adding map information significantly improves the trajectory prediction performance."
- Two interactions are jointly modelled:
- An auxiliary task:
1-
Randomly masking outmap
features during training, such as astop sign
at a four-way intersection.2-
Require the net to complete it.-
"The goal is to incentivize the model to better capture interactions among nodes."
- And to learn to deal with occlusion.
-
"Adding this objective consistently helps with performance, especially at longer time horizons".
- Two training objectives:
1-
Main task = Prediction. Future trajectories.2-
Auxiliary task = Supervision.Huber loss
between predicted node features and ground-truth masked node features.
- Evaluation metrics:
- The "widely used" Average Displacement Error (
ADE
) computed over the entire trajectories. - The Displacement Error at
t
(DE@ts
) metric, wheret
in {1.0
,2.0
,3.0
} seconds.
- The "widely used" Average Displacement Error (
- Performances and computation cost.
VectorNet
is compared toConvNets
on theArgoverse
forecasting dataset, as well as on someWaymo
in-house prediction dataset.ConvNets
consumes200+
times moreFLOPs
thanVectorNet
for a single agent:10.56G
vs0.041G
. Factor5
when there are50
agents per scene.VectorNet
needs29%
of the parameters ofConvNets
:72K
vs246K
.VectorNet
achieves up to18%
better performance onArgoverse
.
"Online parameter estimation for human driver behavior prediction"
-
[
2020
] [📝] [] [🎓Stanford
] [🚗Toyota Research Institute
] -
[
stochastic IDM
]
Click to expand
The vanilla IDM is a parametric rule-based car-following model that balances two forces: the desire to achieve free speed if there were no vehicle in front, and the need to maintain safe separation with the vehicle in front. It outputs an acceleration that is guaranteed to be collision free. The stochastic version introduces a new model parameter σ-IDM . Source. |
Authors: Bhattacharyya, R., Senanayake, R., Brown, K., & Kochenderfer
- Motivations:
1-
Explicitly model stochasticity in the behaviour of individual drivers.- Complex multi-modal distributions over possible outcomes should be modelled.
2-
Provide safety guarantees3-
Highway scenarios: no urban intersection.- The methods should combine advantages of
rule-based
andlearning-based
estimation/prediction methods:- Interpretability.
- Guarantees on safety (the learning-based model Generative Adversarial Imitation Learning (
GAIL
) used as baseline is not collision free). - Validity even in regions of the state space that are under-represented in the data.
- High expressive power to capture nuanced driving behaviour.
- About the method:
-
"We apply online parameter estimation to an extension of the Intelligent Driver Model IDM that explicitly models stochasticity in the behavior of individual drivers."
- This rule-based method is online, as opposed for instance to the IDM with parameters obtained by offline estimation, using non-linear least squares.
- Particle filtering is used for the recursive Bayesian estimation.
- The derived parameter estimates are then used for forward motion prediction.
-
- About the estimated parameters (per observed vehicle):
1-
The desired velocity (v-des
).2-
The driver-dependent stochasticity on acceleration (σ-IDM
).- They are assumed stationary for each driver, i.e., human drivers do not change their latent driving behaviour over the time horizons.
- About the datasets:
NGSIM
for US Highway 101 at10 Hz
.- Highway Drone Dataset (
HighD
) at25 Hz
. RMSE
of the position and velocity are used to measure “closeness” of a predicted trajectory to the corresponding ground-truth trajectory.- Undesirable events, e.g. collision, going off-the-road, hard braking, that occur in each scene prediction are also considered.
- How to deal with the "particle deprivation problem"?:
- Particle deprivation = particles converge to one region of the state space and there is no exploration of other regions.
Dithering
method = external noise is added to aid exploration of state space regions.- From (Schön, Gustafsson, & Karlsson, 2009) in "The Particle Filter in Practice":
-
"Both the
process
noise andmeasurement
noise distributions need some dithering (increased covariance). Dithering theprocess
noise is a well-known method to mitigate the sample impoverishment problem. Dithering themeasurement
noise is a good way to mitigate the effects of outliers and to robustify thePF
in general".
-
- Here:
-
"We implement dithering by adding random noise to the top
20%
particles ranked according to the corresponding likelihood. The noise is sampled from a discrete uniform distribution withv-des
∈
{−0.5
,0
,0.5
} andσ-IDM
∈
{−0.1
,0
,0.1
}. (This preserves the discretization present in the initial sampling of particles).
-
- Future works:
- Non-stationarity.
- Combination with a lane changing model such as
MOBIL
to extend to two-dimensional driving behaviour.
"PLOP: Probabilistic poLynomial Objects trajectory Planning for autonomous driving"
-
[
2020
] [📝] [[🎞️](TO COME)] [ 🚗Valeo
] -
[
Gaussian mixture
,multi-trajectory prediction
,nuScenes
,A2D2
,auxiliary loss
]
Click to expand
The architecture has two main sections: an encoder to synthesize information and the predictor where we exploit it. Note that PLOP does not use the classic RNN decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points. Also note the navigation command that conditions the ego prediction. Source. |
PLOP uses multimodal sensor data input: Lidar and camera . The map is accumulated over the past 2s , so 20 frames. It produces a multivariate gaussian mixture for a fixed number of K possible trajectories over a 4s horizon. Uncertainty and variability are handled by predicting vehicle trajectories as a probabilistic Gaussian Mixture models, constrained by a polynomial formulation. Source. |
Authors: Buhet, T., Wirbel, E., & Perrotton, X.
-
Motivations:
- The goal is to predicte multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework.
- In addition in an
end-to-end
trainable fashion.
- In addition in an
- It builds on a previous work: "Conditional vehicle trajectories prediction in carla urban environment" - (Buhet, Wirbel, & Perrotton, 2019). See analysis further below.
- The trajectory prediction based on polynomial representation is upgraded from deterministic output to multimodal probabilistic output.
- It re-uses the navigation command input for the conditional part of the network, e.g.
follow
,left
,straight
,right
. - One main difference is the introduction of a new input sensor: Lidar.
- And adding a semantic segmentation auxiliary loss.
- The authors also reflect about what metrics is relevant for trajectory prediction:
-
"We suggest to use two additional criteria to evaluate the predictions errors, one based on the most confident prediction, and one weighted by the confidence [how alternative trajectories with non maximum weights compare to the most confident trajectory]."
-
- The goal is to predicte multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework.
-
One term: "Probabilistic poLynomial Objects trajectory Planning" =
PLOP
. -
I especially like their review on related works about data-driven predictions (section taken from the paper):
SocialLSTM
: encodes the relations between close agents introducing a social pooling layer.- -Deterministic approaches derived from
SocialLSTM
:SEQ2SEQ
presents a newLSTM
-based encoder-decoder network to predict trajectories into an occupancy grid map.SocialGAN
andSoPhie
use generative adversarial networks to tackle uncertainty in future paths and augment the original set of samples.CS-LSTM
extendsSocialLSTM
using convolutional layers to encode the relations between the different agents.
ChauffeurNet
uses a sophisticated neural network with a complex high level scene representation (roadmap
,traffic lights
,speed limit
,route
,dynamic bounding boxes
, etc.) for deterministic ego vehicle trajectory prediction.
- Other works use a graph representation of the interactions between the agents in combination with neural networks for trajectory planning.
- Probabilistic approaches:
- Many works like
PRECOG
,R2P2
,Multiple Futures Prediction
,SocialGAN
include probabilistic estimation by adding a probabilistic framework at the end of their architecture producing multiple trajectories for ego vehicle, nearby vehicles or both. - In
PRECOG
,Rhinehart et al.
build a probabilistic model that explicitly models interactions between agents, using latent variables to model the plausible reactions of agents to each other, with a possibility to pre-condition the trajectory of the ego vehicle by a goal. MultiPath
also reuses an idea from object detection algorithms using trajectory anchors extracted from the training data for ego vehicle prediction.
- Many works like
-
About the auxiliary semantic segmentation task.
- Teaching the network to represent such semantic in its features improves the prediction.
-
"Our objective here is to make sure that in the
RGB
image encoding, there is information about the road position and availability, the applicability of the traffic rules (traffic sign/signal), the vulnerable road users (pedestrians, cyclists, etc.) position, etc. This information is useful for trajectory planning and brings some explainability to our model."
-
About interactions with other vehicles.
- The prediction for each vehicle does not have direct access to the sequence of history positions of others.
-
"The encoding of the interaction between vehicles is implicitly computed by the birdview encoding."
- The number of predicted trajectories is fixed in the network architecture.
K=12
is chosen.-
"It allows our architecture to be agnostic to the number of considered neighbors."
-
-
Multi-trajectory prediction in a probabilistic framework.
-
"We want to predict a fixed number
K
of possible trajectories for each vehicle, and associate them to a probability distribution overx
andy
:x
is the longitudinal axis,y
the lateral axis, pointing left." - About the Gaussian Mixture.
- Vehicle trajectories are predicted as probabilistic Gaussian Mixture models, constrained by a polynomial formulation: The mean of the distribution is expressed using a polynomial of degree
4
of time. -
"In the end, this representation can be interpreted as predicting
K
trajectories, each associated with a confidenceπk
[mixture weights shared for all sampled points belonging to the same trajectory], with sampled points following a Gaussian distribution centered on (µk,x,t
,µk,y,t
) and with standard deviation (σk,x,t
,σk,y,t
)." -
"
PLOP
does not use the classicRNN
decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points." - This offers a measure of uncertainty on the predictions.
- For the ego car, the probability distribution is conditioned by the navigation command.
- Vehicle trajectories are predicted as probabilistic Gaussian Mixture models, constrained by a polynomial formulation: The mean of the distribution is expressed using a polynomial of degree
- About the loss:
negative log-likelihood
over all sampled points of the ground truth ego and neighbour vehicles trajectories.- There is also the auxiliary
cross entropy loss
for segmentation.
-
-
Some findings:
- The presented model seems very robust to the varying number of neighbours.
- Finally, for
5
agents or more,PLOT
outperforms by a large margin allESP
andPRECOG
, on authors-defined metrics. -
"This result might be explained by our interaction encoding which is robust to the variations of
N
using only multiple birdview projections and our non-iterative single step trajectory generation."
- Finally, for
-
"Using
K = 1
approach yields very poor results, also visible in the training loss. It was an anticipated outcome due to the ambiguity of human behavior."
- The presented model seems very robust to the varying number of neighbours.
"Probabilistic Future Prediction for Video Scene Understanding"
-
[
2020
] [📝] [🎞️] [🎞️ (blog)] [ 🎓University of Cambridge
] [ 🚗Wayve
] -
[
multi frame
,multi future
,auxiliary learning
,multi-task
,conditional imitation learning
]
Click to expand
One main motivation is to supply the Control module (e.g. policy learnt via IL ) with a representation capable of modelling probability of future events. The Dynamics module produces such spatio-temporal representation, not directly from images but from learnt scene features. That embeddings, that are used by the Control in order to learn driving policy, can be explicitly decoded to future semantic segmentation , depth , and optical flow . Note that the stochasticity of the future is modelled with a conditional variational approach minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution . Source. |
There are many possible futures approaching this four-way intersection. Using 3 different noise vectors makes the model imagine different driving manoeuvres at an intersection: driving straight , turning left or turning right . These samples predict 10 frames, or 2 seconds into the future. Source. |
The differential entropy of the present distribution, characterizing how unsure the model is about the future is used. As we approach the intersection, it increases. Source. |
Authors: Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A.
-
Motivations:
1-
Supply the control module (e.g.IL
) with an appropriate representation forinteraction-aware
anduncertainty-aware
decision-making, i.e. one capable of modelling probability of future events.- Therefore the policy should receive temporal features explicitly trained to predict the future.
- Motivation for that: It is difficult to learn an effective temporal representation by only using imitation error as a learning signal.
- Therefore the policy should receive temporal features explicitly trained to predict the future.
- Others:
2-
"multi frame
andmulti future
" prediction.- Perform prediction:
- ... based on multiple past frames (i.e. not a single one).
- ... and producing multiple possible outcomes (i.e. not deterministic).
- Predict the stochasticity of the future, i.e. contemplate multiple possible outcomes and estimate the multi-modal uncertainty.
- Perform prediction:
3-
Offer a differentiable / end-to-end trainable system, as opposed to system that reason over hand-coded representations.- I understand it as considering the loss of the
IL
part into the layers that create the latent representation.
- I understand it as considering the loss of the
4-
Cope with multi-agent interaction situations such as traffic merging, i.e. do not predict the behaviour of each actor in the scene independently.- For instance by jointly predicting ego-motion and motion of other dynamic agents.
5-
Do not rely on anyHD-map
to predict the static scene, to stay resilient toHD-map
errors due to e.g. roadworks.
-
auxiliary learning
: The loss used to train the latent representation is composed of three terms (c.f. motivation3-
):future-prediction
: weighted sum of futuresegmentation
,depth
andoptical flow
losses.probabilistic
:KL
-divergence between thepresent
and thefuture
distributions.control
: regression for future time-steps up to someFuture control horizon
.
-
Let's explore some ideas behinds these three components.
-
1-
Temporal video encoding: How to build a temporal and visual representation?-
What should be predicted?
-
"Previous work on probabilistic future prediction focused on trajectory forecasting [DESIRE, Lee et al. 2017, Bhattacharyya et al. 2018, PRECOG, Rhinehart et al. 2019] or were restricted to single-frame image generation and low resolution (64x64) datasets that are either simulated (Moving MNIST) or with static scenes and limited dynamics."
-
"Directly predicting in the high-dimensional space of image pixels is unnecessary, as some details about the appearance of the world are irrelevant for planning and control."
- Instead, the task is to predict a more complete scene representation with
segmentation
,depth
, andflow
, two seconds in the future.
-
-
What should the
temporal module
process?- The temporal model should learn the spatio-temporal features from perception encodings [as opposed to RGB images].
- These encodings are "scene features" extracted from images by a
Perception
module. They constitute a more powerful and compact representation compared to RGB images.
-
How does the
temporal module
look like?-
"We propose a new spatio-temporal architecture that can learn hierarchically more complex features with a novel 3D convolutional structure incorporating both local and global space and time context."
-
-
The authors introduce a so-called
Temporal Block
module for temporal video encoding.- These
Temporal Block
should help to learn hierarchically more complex temporal features. With two main ideas: 1-
Decompose the convolutional filters and play with all possible configuration.-
"Learning
3D
filters is hard. Decomposing into two subtasks helps the network learn more efficient." -
"State-of-the-art works decompose
3D
filters into spatial and temporal convolutions. The model we propose further breaks down convolutions into many space-time combinations and context aggregation modules, stacking them together in a more complex hierarchical representation."
-
2-
Incorporate the "global context" in the features (I did not fully understand that).- They concatenate some local features based on
1x1x1
compression with some global features extracted withaverage pooling
. -
"By pooling the features spatially and temporally at different scales, each individual feature map also has information about the global scene context, which helps in ambiguous situations."
- They concatenate some local features based on
- These
-
-
2-
Probabilistic prediction: how to generate multiple futures?-
"There are various reasons why modelling the future is incredibly difficult: natural-scene data is rich in details, most of which are irrelevant for the driving task, dynamic agents have complex temporal dynamics, often controlled by unobservable variables, and the future is inherently uncertain, as multiple futures might arise from a unique and deterministic past."
- The idea is that the uncertainty of the future can be estimated by making the prediction probabilistic.
-
"From a unique past in the real-world, many futures are possible, but in reality we only observe one future. Consequently, modelling multi-modal futures from deterministic video training data is extremely challenging."
- Another challenge when trying to learn a multi-modal prediction model:
-
"If the network predicts a plausible future, but one that did not match the given training sequence, it will be heavily penalised."
-
-
"Our work addresses this by encoding the future state into a low-dimensional
future distribution
. We then allow the model to have a privileged view of the future through the future distribution at training time. As we cannot use the future at test time, we train apresent distribution
(using only the current state) to match thefuture distribution
through aKL
-divergence loss. We can then sample from the present distribution during inference, when we do not have access to the future."
-
- To put it another way, two probability distributions are modelled, in a conditional variational approach:
- A present distribution
P
, that represents all what could happen given the past context. - A future distribution
F
, that represents what actually happened in that particular observation.
- A present distribution
-
[Learning to align the
present distribution
with thefuture distribution
] "As the future is multimodal, different futures might arise from a unique past contextzt
. Each of these futures will be captured by the future distributionF
that will pull the present distributionP
towards it." - How to evaluate predictions?
-
"Our probabilistic model should be accurate, that is to say at least one of the generated future should match the ground truth future. It should also be diverse".
- The authors use a diversity distance metric (
DDM
), which measures both accuracy and diversity of the distribution.
-
- How to quantify uncertainty?
- The framework can automatically infer which scenes are unusual or unexpected and where the model is uncertain of the future, by computing the differential entropy of the
present distribution
. - This is useful for understanding edge-cases and when the model needs to "pay more attention".
- The framework can automatically infer which scenes are unusual or unexpected and where the model is uncertain of the future, by computing the differential entropy of the
-
-
3-
The rich spatio-temporal features explicitly trained to predict the future are used to learn a driving policy.-
Conditional Imitation Learning is used to learn
speed
andsterring
controls, i.e. regressing to the expert's true control actions {v
,θ
}.- One reason is that it is immediately transferable to the real world.
-
From the ablation study, it seems to highly benefit from both:
1-
Thetemporal features
."It is too difficult to forecast how the future is going to evolve with a single image".
2-
The fact that these features are capable ofprobabilistic predictions
.- Especially for multi-agent interaction scenarios.
-
About the training set:
-
"We address the inherent dataset bias by sampling data uniformly across lateral and longitudinal dimensions. First, the data is split into a histogram of bins by
steering
, and subsequently byspeed
. We found that weighting each data point proportionally to the width of the bin it belongs to avoids the need for alternative approaches such as data augmentation."
-
-
-
One exciting future direction:
- For the moment, the
control
module takes the representation learned from dynamics models. And ignores the predictions themselves.- By the way, why are predictions, especially for the ego trajectories, not conditionned on possible actions?
- It could use these probabilistic embedding capable of predicting multi-modal and plausible futures to generate imagined experience to train a policy in a model-based
RL
. - The design of the
reward
function from the latent space looks challenging at first sight.
- For the moment, the
"Efficient Behavior-aware Control of Automated Vehicles at Crosswalks using Minimal Information Pedestrian Prediction Model"
-
[
2020
] [📝] [ 🎓University of Michigan
,University of Massachusetts
] -
[
interaction-aware decision-making
,probabilistic hybrid automaton
]
Click to expand
The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton. Source. |
The interaction is captured inside a gap-acceptance model: the pedestrian evaluates the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk. Source. |
The baseline controller used for comparison is a finite state machine (FSM ) with four states. Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by yielding or through hard stop . Source. |
Authors: Jayaraman, S. K., Jr, L. P. R., Yang, X. J., Pradhan, A. K., & Tilbury, D. M.
-
Motivations:
- Scenario: interaction with a pedestrian
approaching
/crossing
/waiting
at a crosswalk. 1-
A (1.1
) simple and (1.2
) interaction-aware pedestrianprediction
model.- That means no requirement of extensive amounts of data.
-
"The crossing model as a hybrid system with a gap acceptance model that required minimal information, namely pedestrian's
position
andvelocity
".- It does not require information about pedestrian
actions
orpose
. - It builds on "Analysis and prediction of pedestrian crosswalk behavior during automated vehicle interactions" by (Jayaraman, Tilbury, Yang, Pradhan, & Jr, 2020).
- It does not require information about pedestrian
2-
Effectively incorporating thesepredictions
in acontrol
framework- The idea is to first forecast the position of the pedestrian using a pedestrian model, and then react accordingly.
3-
Be efficient on bothwaiting
andapproaching
pedestrian scenarios.- Assuming always a
crossing
may lead to over-conservative policies. -
"[in simulation] only a fraction of pedestrians (
80%
) are randomly assigned the intention to cross the street."
- Assuming always a
- Scenario: interaction with a pedestrian
-
Why are
CV
andCA
prediction models not applicable?-
"At crosswalks, pedestrian behavior is much more unpredictable as they have to wait for an opportunity and decide when to cross."
- Longer durations are needed.
1-
Interaction must be taken into account.2-
The authors decide to model pedestrians as a hybrid automaton that switches between discrete actions.
-
-
One term: Behavior-aware Model Predictive Controller (
B-MPC
)1-
The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton:- Four states:
Approach Crosswalk
,Wait
,Cross
,Walk away
. - Probabilistic transitions: using pedestrian's
gap acceptance
- hence capturing interactions.-
"What is the probability of accepting the current traffic gap?
-
"Pedestrians evaluate the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk."
-
- Four states:
2-
The problem is formulated as a constrained quadratic optimization problem:- Cost:
success
(passing the crosswalk),comfort
(penalize jerk and sudden changes in acceleration),efficiency
(deviation from the reference speed). - Constraints: respect
motion model
, restrictvelocity
,acceleration
, as well asjerk
, and ensurecollision avoidance
. - Solver: standard quadratic program solver in
MATLAB
.
- Cost:
-
Performances:
- Baseline controller:
- Finite state machine (
FSM
) with four states:Maintain Speed
,Accelerate
,Yield
, andHard Stop
. -
"Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by
yielding
or throughhard stop
." -
"The Boolean variable
InCW
, denotes the pedestrian’s crossing activity:InCW=1
from the time the pedestrian started moving laterally to cross until they completely crossed theAV
lane, andInCW=0
otherwise." - That means the baseline controller does not react at all to "non-crossing" cases since it never sees the pedestrian crossing laterally.
- Finite state machine (
-
"It can be seen that the
B-MPC
is more aggressive, efficient, and comfortable than the baseline as observed through the higher average velocity, lower average acceleration effort, and lower average jerk respectively."
- Baseline controller:
"Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions"
-
[
2019
] [📝] [ 🎓Caltech
] [🚗zoox
] -
[
multimodal
,probabilistic
,1-shot
]
Click to expand
The idea is to encode a history of world states (both static and dynamic ) and semantic map information in a unified, top-down spatial grid . This allows to use a deep convolutional architecture to model entity dynamics, entity interactions, and scene context jointly. The authors found that temporal convolutions achieved better performance and significantly faster training than an RNN structure. Source. |
Three 1-shot models are proposed. Top: parametric and continuous. Bottom: non-parametric and discrete (trajectories are here sampled for display from the state probabilities). Source. |
Authors: Hong, J., Sapp, B., & Philbin, J.
-
Motivations: A prediction method of future distributions of entity
state
that is:1-
Probabilistic.-
"A single most-likely point estimate isn't sufficient for a safety-critical system."
-
"Our perception module also gives us
state
estimation uncertainty in the form of covariance matrices, and we include this information in our representation via covariance norms."
-
2-
Multimodal.- It is important to cover a diversity of possible implicit actions an entity might take (e.g., which way through a junction).
3-
One-shot.- It should directly predict distributions of future states, rather than a single point estimate at each future timestep.
-
"For efficiency reasons, it is desirable to predict full trajectories (time sequences of
state
distributions) without iteratively applying a recurrence step." -
"The problem can be naturally formulated as a sequence-to-sequence generation problem. [...] We chose
ℓ=2.5s
of past history, and predict up tom=5s
in the future." -
"
DESIRE
andR2P2
address multimodality, but both do so via1
-step stochastic policies, in contrast to ours which directly predicts a time sequence of multimodal distributions. Such policy-based methods require bothfuture roll-out
andsampling
to obtain a set of possible trajectories, which has computational trade-offs to ourone-shot
feed-forward approach."
-
How to model entity interactions?
1-
Implicitly: By encoding them as surrounding dynamic context.2-
Explicitly: For instance,SocialLSTM
pools hidden temporalstate
between entity models.- Here, all surrounding entities are encoded within a specific tensor.
-
Input:
- A stack of
2d
-top-view-grids. Each frame has128×128
pixels, corresponding to50m×50m
.- For instance, the dynamic context is encoded in a
RGB
image with unique colours corresponding to each element type. - The
state
history of the considered entity is encoded in a stack of binary maps.- One could have use only
1
channel and play with the colour to represent the history.
- One could have use only
- For instance, the dynamic context is encoded in a
-
"Note that through rendering, we lose the true graph structure of the road network, leaving it as a modeling challenge to learn valid road rules like legal traffic direction, and valid paths through a junction."
- Cannot it just be coded in another tensor?
- A stack of
-
Output. Three approaches are proposed:
1-
Continuous + parametric representations.1.1.
A Gaussian distribution is regressed per future timestep.1.2.
Multi-modal Gaussian Regression (GMM-CVAE
).- A set of Gaussians is predicted by sampling from a categorial latent variable.
- If non enhanced, this method is naive and suffers from exchangeability, and mode collapse.
-
"In general, our mixture of sampled Gaussian trajectories underperformed our other proposed methods; we observed that some samples were implausible."
- One could have added an auxiliary loss that penalize off-road predictions, as in the improved version of
CoverNet
- One could have added an auxiliary loss that penalize off-road predictions, as in the improved version of
2-
Discrete + non-parametric representations.- Predict occupancy grid maps.
- A grid is produced for each future modelled timestep.
- Each grid location holds the probability of the corresponding output
state
. - For comparison, trajectories are extracted via some trajectory sampling procedure.
- Predict occupancy grid maps.
-
Against non-learnt baselines:
-
"Interestingly, both
Linear
andIndustry
baselines performed worse relative to our methods at larger time offsets, but better at smaller offsets. This can be attributed to the fact that predicting near futures can be accurately achieved with classical physics (which both baselines leverage) — more distant future predictions, however, require more challenging semantic understanding."
-
"Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios"
-
[
2019
] [📝] [ 🎓TUM
] [🚗BMW
] -
[
probabilistic predictions
,multi-modality
]
Click to expand
The network produces an action distribution for the next time-step. The features are function of one selected driver's route intention (such as turning left or right ) and the map . Redundant features can be pruned to reduce the complexity of the model: even with as few as 5 features (framed in blue), it is possible for the network to learn basic behaviour models that achieve lower losses than both baseline recurrent networks. Source. |
Right: At each time step, the confidence in the action changes. Left: How to compute some loss from the predicted variance if the ground-truth is a point-estimate? The predicted distribution can be evaluated at ground-truth, forming a likelihood. The negative-log-likelihood becomes the objective to minimize. The network can output high variance if it is not sure. But a regularization term deters it from being too uncertain. Source. |
Authors: Schulz, J., Hubmann, C., Morin, N., Löchner, J., & Darius, B.
-
Motivations:
1-
Learn a driver model that is:- Probabilistic. I.e. capture multi-modality and uncertainty in the predicted low-level actions.
- Interaction-aware. Well, here the
actions
of surrounding vehicles are ignored, but theirstates
are considered - "Markovian", i.e. that makes
1
-step prediction from the currentstate
, assuming independence of previousstate
s /action
s.
2-
Simplicity + lightweight.- This model is intended to be integrated as a probabilistic transition model into sampling-based algorithms, e.g.
particle filtering
. - Applications include:
1-
Forward simulation-based interaction-awareplanning
algorithms, e.g.Monte Carlo tree search
.2-
Driver intention estimation and trajectoryprediction
, here aDBN
example.
- Since samples are plenty, runtime should be kept low. And therefore, nested net structures such as
DESIRE
are excluded.
- This model is intended to be integrated as a probabilistic transition model into sampling-based algorithms, e.g.
- Ingredients:
- Feedforward net predicting
steering
andacceleration
distributions. - Enable multi-modality by building one
input vector
, and making one prediction, per possibleroute
.
- Feedforward net predicting
-
About the model:
- Input: a set of features build from:
- One route intention. For instance, the distances of both agents to
entry
andexit
of the related conflict areas are computed. - The map.
- The kinematic state (
pos
,heading
,vel
) of the2
closest agents.
- One route intention. For instance, the distances of both agents to
- Output:
steering
andacceleration
distributions, modelled as Gaussian:mean
andstd
are estimated (cov
=0
).- Not the next
state
!!-
"Instead of directly learning a
state
transition model, we restrict the neural network to learn a2
-dimensionalaction
distribution comprisingacceleration
andsteering angle
."
-
- Practical implication when building the dataset from real data: the
action
s of observed vehicles are unknown, but inferred using aninverse bicycle model
.
- Not the next
- Using the model at run time:
1-
Sample the possibleroutes
.2-
For each route:- Start with one
state
. - Get one
action
distribution. Note that the uncertainty can change at each step. - Sample (
acc
,steer
) from this distribution. - Move to next
state
. - Repeat.
- Start with one
- Input: a set of features build from:
-
Issue with the accumulation of
1
-step to form long-term predictions:- As in vanilla imitation learning, it suffers from distribution shift resulting from the accumulating errors.
-
"If this error is too high, the features determined during forward simulation are not represented within the training data anymore."
- A
DAgger
-like solution could be considered.
-
About conditioning on a driver's route intention:
- Without, one could pack all the road info in the input. How many routes to describe? And expect multiple trajectories to be produced. How many output heads? Tricky.
- Conditioning offers two advantages:
-
"The learning algorithm does not have to cope with the multi-modality induced by different route options. The varying number of possible
routes
(depending on the road topology) is handled outside of the neural network." - It also allows to define (and detail) relevant features along the considered path: upcoming
road curvature
or longitudinal distances tostop lines
.
-
- Limit to this approach (again related to the "off-distribution" issue):
-
"When enumerating all possible routes and running a forward simulation for each of the conditioned models, there might exist route candidates that are so unlikely that they have never been followed in the training data. Thus their features may result in unreasonable actions during inference, as the network only learns what actions are reasonable given a route, but not which routes are reasonable given a situation."
-
"Learning Predictive Models From Observation and Interaction"
-
[
2019
] [📝] [🎞️] [ 🎓University of Pennsylvania
,Stanford University
,UC Berkeley
] [ 🚗Honda
] -
[
visual prediction
,domain transfer
,nuScenes
,BDD100K
]
Click to expand
The idea is to learn a latent representation z that corresponds to the true action . The model can then perform joint training on the two kinds of data: it optimizes the likelihood of the interaction data, for which the action s are available, and observation data, for which the action s are missing. Hence the visual predictive model can predict the next frame xt+1 conditioned on the current frame xt and action learnt representation zt . Source. |
The visual prediction model is trained using two driving sets: action -conditioned videos from Boston and action -free videos from the Singapore. Frames from both subsets come from BDD100K and nuScenes datasets. Source. |
Authors: Schmeckpeper, K., Xie, A., Rybkin, O., Tian, S., Daniilidis, K., Levine, S., & Finn, C.
- On concrete industrial use-case:
-
"Imagine that a self-driving car company has data from a fleet of cars with sensors that record both
video
and the driver’sactions
in one city, and a second fleet of cars that only record dashboardvideo
, withoutaction
s, in a second city." -
"If the goal is to train an
action
-conditioned model that can be utilized to predict the outcomes of steeringaction
s, our method allows us to train such a model using data from both cities, even though only one of them hasaction
s."
-
- Motivations (mainly for robotics, but also AD):
- Generate predictions for complex tasks and new environments, without costly expert demonstrations.
- More precisely, learn an
action
-conditioned video predictive model from two kinds of data:1-
passive observations: [x0
,a1
,x1
...aN
,xN
].- Videos of another agent, e.g. a human, might show the robot how to use a tool.
- Observations represent a powerful source of information about the world and how actions lead to outcomes.
- A learnt model could also be used for
planning
andcontrol
, i.e. to plan coordinated sequences of actions to bring about desired outcomes. - But may suffer from large domain shifts.
2-
active interactions: [x0
,x1
...xN
].- Usually more expensive.
- Two challenges:
1-
Observations are not annotated with suitableaction
s: e.g. only access to the dashcam, not thethrottle
for instance.- In other words,
action
s are only observed in a subset of the data. - The goal is to learn from videos without
action
s, allowing it to leverage videos of agents for which the actions are unknown (unsupervised manner).
- In other words,
2-
Shift in the "embodiment" of the agent: e.g. robots' arms and humans' ones have physical differences.- The goal is to bridge the gap between the two domains (e.g.,
human arms
vs.robot arms
).
- The goal is to bridge the gap between the two domains (e.g.,
- What is learnt?
p
(xc+1:T
|x1:c
,a1:T
)- I.e. prediction of future frames conditioned on a set of
c
context frames and sequence of actions.
- What tests?
1-
Different environment within the same underlying dataset: driving inBoston
andSingapore
.2-
Same environment but different embodiment:humans
androbots
manipulate objects with different arms.
- What is assessed?
1-
Prediction quality (AD
test).2-
Control performance (robotics
test).
"Deep Learning-based Vehicle Behaviour Prediction For Autonomous Driving Applications: A Review"
-
[
2019
] [📝] [ 🎓University of Warwick
] [ 🚗Jaguar Land Rover
] -
[
multi-modality prediction
]
Click to expand
The author propose new classification of behavioural prediction methods. Only deep learning approaches are considered and physics -based approaches are excluded. The criteria are about the input , ouput and deep learning method . Source. |
First criterion is about the input: What is the prediction based on? Important is to capture road structure and interactions while staying flexible in the representation (e.g. describe different types of intersections and work with varying numbers of target vehicles and surrounding vehicles ). Partial observability should be considered by design. Source. |
Second criterion is about the output: What is predicted? Important is to propagate the uncertainty from the input and consider multiple options (multi-modality). Therefore to reason with probabilities. Bottom - why multi-modality is important. Source. |
Authors: Mozaffari, S., Al-Jarrah, O. Y., Dianati, M., Jennings, P., & Mouzakitis, A.
- One mentioned review: (Lefèvre et al.) classifies vehicle (behaviour) prediction models to three groups:
1-
physics
-based- Use dynamic or kinematic models of vehicles, e.g. a constant velocity (
CV
) Kalman Filter model.
- Use dynamic or kinematic models of vehicles, e.g. a constant velocity (
2-
manoeuvre
-based- Predict vehicles' manoeuvres, i.e. a classification problem from a defined set.
3-
interaction
-aware- Consider interaction of vehicles in the input.
- About the terminology:
- "Target Vehicles" (
TV
) are vehicles whose behaviour we are interested in predicting. - The other are "Surrounding Vehicles" (
SV
). - The "Ego Vehicle" (
EV
) can be also considered as anSV
, if it is close enough toTV
s.
- "Target Vehicles" (
- Here, the authors ignore the
physics
-based methods and propose three criteria for comparison:1-
Input.- Track history of
TV
only. - Track history of
TV
andSV
s. - Simplified bird’s eye view.
- Raw sensor data.
- Track history of
2-
Output.- Intention
class
: From a set of pre-defined discrete classes, e.g.go straight
,turn left
, andturn right
. - Unimodal
trajectory
: Usually the one with highest likelihood or the average). - Intention-based
trajectory
: Predict the trajectory that corresponds to the most probable intention (first case). - Multimodal
trajectory
: Combine the previous ones. Two options, depending if the intention set is fixed or dynamically learnt:static
intention set: predict for each member of the set (an extension to intention-based trajectory prediction approaches).dynamic
intention set: due to dynamic definition of manoeuvres, they are prone to converge to a single manoeuvre or not being able to explore all the existing manoeuvres.
- Intention
3-
In-between (deep learning method).RNN
are used because of their temporal feature extracting power.CNN
are used for their **spatial feature extracting ability (especially with bird’s eye views).
- Important considerations for behavioural prediction:
- Traffic rules.
- Road geometry.
- Multimodality: there may exist more than one possible future behaviour.
- Interaction.
- Uncertainty: both
aleatoric
(measurement noise) andepistemic
(partial observability). Hence the prediction should be probabilistic. - Prediction horizon: approaches can serve different purposes based on how far in the future they predict (
short-term
orlong-term
future motion).
- Two methods I would like to learn more about:
social pooling
layers, e.g. used by (Deo & Trivedi, 2019):-
"A social tensor is a spatial grid around the target vehicle that the occupied cells are filled with the processed temporal data (e.g.,
LSTM
hidden state value) of the corresponding vehicle. It contains both the temporal dynamic of vehicles represented and spatial inter-dependencies among them."
-
graph
neural networks, e.g. (Diehl et al., 2019) or (Li et al., 2019):- Graph Convolutional Network (
GCN
). - Graph Attention Network (
GAT
).
- Graph Convolutional Network (
- Comments:
- Contrary to the object detection task, there is no benchmark for systematically evaluating previous studies on vehicle behaviour prediction.
- Urban scenarios are excluded in the comparison since
NGSIM I-80
andUS-101 highway
driving datasets are used. - Maybe the
INTERACTION Dataset
could be used.
- Urban scenarios are excluded in the comparison since
- The authors suggest embedding domain knowledge in the prediction, and call for practical considerations (industry-supported research).
-
"Factors such as environment conditions and set of traffic rules are not directly inputted to the prediction model."
-
"Practical limitations such as sensor impairments and limited computational resources have not been fully taken into account."
-
- Contrary to the object detection task, there is no benchmark for systematically evaluating previous studies on vehicle behaviour prediction.
"Multi-Modal Simultaneous Forecasting of Vehicle Position Sequences using Social Attention"
-
[
2019
] [📝] [ 🎓Ecole CentraleSupelec
] [ 🚗Renault
] -
[
multi-modality prediction
,attention mechanism
]
Click to expand
Two multi-head attention layers are used to account for social interactions between all vehicles. They are combined with LSTM layers to offer joint, long-range and multi-modal forecasts. Source. |
Source. |
Authors: Mercat, J., Gilles, T., Zoghby, N. El, Sandou, G., Beauvois, D., & Gil, G. P.
- Previous work:
"Social Attention for Autonomous Decision-Making in Dense Traffic"
by (Leurent, & Mercat, 2019), detailed on this page as well. - Motivations:
1-
joint
- Considering interactions between all vehicles.2-
flexible
- Independant of the number/order of vehicles.3-
multi-modal
- Considering uncertainty.4-
long-horizon
- Predicting over a long range. Here5s
on simple highway scenarios.5-
interpretable
- E.g. using the social attention coefficients.6-
long distance interdependencies
- The authors decide to exclude the spatial grid representations that "limit the zone of interest to a predefined fixed size and the spatial relation precision to the grid cell size".
- Main idea: Stack
LSTM
layers withsocial
multi-head
attention
layers.- More precisely, the model is broken into four parts:
1-
AnEncoder
processes the sequences of all vehicle positions (no information aboutspeed
,orientation
,size
orblinker
).2-
ASelf-attention
layer captures interactions between all vehicles using "dot product attention". It has "multiple head", each specializing on different interaction patterns, e.g."closest front vehicle in any lane"
.3-
APredictor
, usingLSTM
cells, forecasts the positions.- A second multi-head self-attention layer is placed here.
4-
A finalDecoder
produces sequences of Gaussians mixtures for each vehicle.-
"What is forecast is not a mixture of trajectory density functions but a sequence of position mixture density functions. There is a dependency between forecasts at time
tk
and at timetk+1
but no explicit link between the modes at those times."
-
- More precisely, the model is broken into four parts:
- Two quotes about multi-modality prediction:
-
"When considering multiple modes, there is a challenging trade-off to find between anticipating a wide diversity of modes and focusing on realistic ones".
-
"
VAE
andGANs
are only able to generate an output distribution with sampling and do not express aPDF
".
-
- Baselines used to compare the presented "Social Attention Multi-Modal Prediction" approach:
- Constant velocity (
CV
), that uses Kalman filters (hence single modality). - Convolutional Social Pooling (
CSP
), that uses convolutional social pooling on a coarse spatial grid. Six mixture components are used. - Graph-based Interaction-aware Trajectory Prediction (
GRIP
), that uses aspatial
andtemporal
graph representation of the scene.
- Constant velocity (
"MultiPath : Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction"
-
[
2019
] [📝] [ 🚗Waymo
] -
[
anchor
,multi-modality prediction
,weighted prediction
,mode collapse
]
Click to expand
Source. |
A discrete set of intents is modelled as a set of K=3 anchor trajectories. Uncertainty is assumed to be unimodal given intent (here 3 intents are considered) while control uncertainty is modelled with a Gaussian distribution dependent on each waypoint state of an anchor trajectory. Such an example shows that modelling multiple intents is important. Source. |
Authors: Chai, Y., Sapp, B., Bansal, M., & Anguelov, D.
- One idea: "Anchor Trajectories".
- "Anchor" is a common idea in
ML
. Concrete applications of "anchor" methods forAD
includeFaster-RCNN
andYOLO
for object detections.- Instead of directly predicting the size of a bounding box, the
NN
predicts offsets from a predetermined set of boxes with particular height-width ratios. Those predetermined set of boxes are the anchor boxes. (explanation from this page).
- Instead of directly predicting the size of a bounding box, the
- One could therefore draw a parallel between the sizes of bounding boxes in
Yolo
and the shape of trajectories: they could be approximated with some static predetermined patterns and refined to the current context (the actual task of theNN
here).-
"After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios." [explanation about Yolo from this page]
-
"Our trajectory anchors are modes found in our training data in state-sequence space via unsupervised learning. These anchors provide templates for coarse-granularity futures for an agent and might correspond to semantic concepts like
change lanes
, orslow down
." [from the presented paper]
-
- This idea reminds also me the concept of
pre-defined templates
used for path planning.
- "Anchor" is a common idea in
- One motivation: model multiple intents.
- This contrasts with the numerous approaches which predict one single most-likely trajectory per agent, usually via supervised regression.
- The multi-modality is important since prediction is inherently stochastic.
- The authors distinguish between
intent uncertainty
andcontrol uncertainty
(conditioned on intent).
- The authors distinguish between
- A Gaussian Mixture Model (
GMM
) distribution is used to model both types of uncertainty.-
"At inference, our model predicts a discrete distribution over the anchors and, for each anchor, regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at each time step."
-
- One risk when working with multi-modality: directly learning a mixture suffers from issues of "mode collapse".
- This issue is common in
GAN
where the generator starts producing limited varieties of samples. - The solution implemented here is to estimate the anchors a priori before fixing them to learn the rest of our parameters (as for
Faster-RCNN
andYolo
for instance).
- This issue is common in
- Second motivation: weight the several trajectory predictions.
- This contrasts with methods that randomly sample from a generative model (e.g.
CVAE
andGAN
), leading to an unweighted set of trajectory samples (not to mention the problem of reproducibility and analysis). - Here, a parametric probability distribution is directly predicted: p(
trajectory
|observation
), together with a compact weighted set of explicit trajectories which summarizes this distribution well.- This contrasts with methods that outputs a probabilistic occupancy grid.
- This contrasts with methods that randomly sample from a generative model (e.g.
- About the "top-down" representation, structured in a
3d
array:- The first
2
dimensions represent spatial locations in the top-down image -
"The channels in the depth dimension hold
static
and time-varying (dynamic
) content of a fixed number of previous time steps."- Static context includes
lane connectivity
,lane type
,stop lines
,speed limit
. - Dynamic context includes
traffic light states
over the past5
time-steps. - The previous positions of the different dynamic objects are also encoded in some depth channels.
- Static context includes
- The first
- One word about the training dataset.
- The model is trained via
imitation learning
by fitting the parameters to maximize the log-likelihood of recorded driving trajectories. -
"The balanced dataset totals
3.85 million
examples, contains5.75 million
agent trajectories and constitutes approximately200 hours
of (real-world) driving."
- The model is trained via
"SafeCritic: Collision-Aware Trajectory Prediction"
-
[
2019
] [📝] [ 🎓University of Amsterdam
] [ 🚗BMW
] -
[
Conditional GAN
]
Click to expand
The Generator predicts trajectories that are scored against two criteria: The Discriminator (as in GAN ) for accuracy (i.e. consistent with the observed inputs) and the Critic (the generator acts as an Actor) for safety . The random noise vector variable z in the Generator can be sampled from N (0 , 1 ) to sample novel trajectories. Source. |
Several features offered by the predictions of SafeCritic : accuracy, diversity, attention and safety. Source. |
Authors: van der Heiden, T., Nagaraja, N. S., Weiss, C., & Gavves, E.
- Main motivation:
-
"We argue that one should take into account
safety
, when designing a model to predict future trajectories. Our focus is to generate trajectories that are not justaccurate
but also lead to minimum collisions and thus aresafe
. Safe trajectories are different from trajectories that try to imitate the ground truth, as the latter may lead toimplausible
paths, e.g, pedestrians going through walls." - Hence the trajectory predictions of the Generator are evaluated against multiple criteria:
Accuracy
: The Discriminator checks if the prediction is coherent / plausible with the observation.Safety
: Some Critic predicts the likelihood of a future dynamic and static collision.
- A third loss term is introduced:
-
"Training the generator is harder than training the discriminator, leading to slow convergence or even failure."
- An additional auto-encoding loss to the ground truth is introduced.
- It should encourage the model to avoid trivial solutions and mode collapse, and should increase the diversity of future generated trajectories.
- The term
mode collapse
means that instead of suggesting multiple trajectory candidates (multi-modal
), the model restricts its prediction to only one instance.
-
-
- About
RL
:- The authors mentioned several terms related to
RL
, in particular they try to dray a parallel withInverse RL
:-
"
GANs
resembleIRL
in that the discriminator learns the cost function and the generator represents the policy."
-
- I got the feeling of that idea, but I was honestly did not understand where it was implemented here. In particular no
MDP
formulation is given.
- The authors mentioned several terms related to
- About attention mechanism:
-
"We rely on attention mechanism for spatial relations in the scene to propose a compact representation for modelling interaction among all agents [...] We employ an attention mechanism to prioritize certain elements in the latent state representations."
- The grid-like scene representation is shared by both the Generator and the Critic.
-
- About the baselines:
- I like the "related work" section which shortly introduces the state-of-the-art trajectory prediction models based on deep learning.
SafeCritic
takes inspiration from some of their ideas, such as:- Aggregation of past information about multiple agents in a recurrent model.
- Use of Conditional
GAN
to offer the possibility to also generate novel trajectory given observation via sampling (standardGANs
have not encoder). - Generation of multi-modal future trajectories.
- Incorporation of semantic visual features (extracted by deep networks) combined with an attention mechanism.
SocialGAN
,SocialLSTM
,Car-Net
,SoPhie
andDESIRE
are used as baselines.R2P2
andSocialAttention
are also mentioned.
- I like the "related work" section which shortly introduces the state-of-the-art trajectory prediction models based on deep learning.
"A Review of Tracking, Prediction and Decision Making Methods for Autonomous Driving"
- [
2019
] [📝] [ 🎓University of Iasi
]
Click to expand
One figure:
Classification of motion models based on three increasingly abstract levels - adapted from (Lefèvre, S., Vasquez. D. & Laugier C. - 2014). Source. |
Authors: Leon, F., & Gavrilescu, M.
- A reference to one white paper: "Safety first for automated driving" 2019 - from Aptiv, Audi, Baidu, BMW, Continental, Daimler, Fiat Chrysler Automobiles, HERE, Infineon, Intel and Volkswagen (alphabetical order). The authors quote some of the good practices about Interpretation and Prediction:
- Predict only a short time into the future (the further the predicted state is in the future, the less likely it is that the prediction is correct).
- Rely on physics where possible (a vehicle driving in front of the automated vehicle will not stop in zero time on its own).
- Consider the compliance of other road users with traffic rules.
- Miscellaneous notes about prediction:
- The authors point the need of high-level reasoning (the more abstract the feature, the more reliable it is long term), mentioning both "affinity" and "attention" mechanisms.
- They also call for jointly addressing vehicle motion modelling and risk estimation (criticality assessment).
- Gaussian Processed is found to be a flexible tool for modelling motion patterns and is compared to Markov Models for prediction.
- In particular, GP regressions have the ability to quantify uncertainty (e.g. occlusion).
-
"CNNs can be superior to LSTMs for temporal modelling since trajectories are continuous in nature, do not have complicated "state", and have high spatial and temporal correlations".
"Deep Predictive Autonomous Driving Using Multi-Agent Joint Trajectory Prediction and Traffic Rules"
Click to expand
One figure:
The framework consists of four modules: encoder module, interaction module, prediction module and control module. Source. |
Authors: Cho, K., Ha, T., Lee, G., & Oh, S.
- One previous work: "Learning-Based Model Predictive Control under Signal Temporal Logic Specifications" by (Cho & Ho, 2018).
- One term: "robustness slackness" for
STL
-formula.- The motivation is to solve dilemma situations (inherent to strict compliance when all rules cannot be satisfied) by disobeying certain rules based on their predicted degree of satisfaction.
- The idea is to filter out non-plausible trajectories in the prediction step to only consider valid prediction candidates during planning.
- The filter considers some "rules" such as
Lane keeping
andCollision avoidance of front vehicle
orSpeed limit
(I did not understand why they are equally considered). - These rules are represented by Signal Temporal Logic (
STL
) formulas.- Note:
STL
is an extension of Linear Temporal Logic (with boolean predicates and discrete-time) with real-time and real-valued constraints.
- Note:
- A metric can be introduced to measure how well a given signal (here, a trajectory candidate) satisfies a
STL
formula.- This is called "robustness slackness" and acts as a margin to satisfaction of
STL
-formula.
- This is called "robustness slackness" and acts as a margin to satisfaction of
- This enables a "control under temporal logic specification" as mentioned by the authors.
- Architecture
- Encoder module: The observed trajectories are fed to some
LSTM
whose internal state is used by the two subsequent modules. - Interaction module: To consider interaction, all
LSTM
states are concatenated (joint state) together with a feature vector of relative distances. In addition, a CVAE is used for multi-modality (several possible trajectories are generated) and capture interactions (I did not fully understand that point), as stated by the authors:-
"The latent variable
z
models inherent structure in the interaction of multiple vehicles, and it also helps to describe underlying ambiguity of future behaviours of other vehicles."
-
- Prediction module: Based on the
LSTM
states, the concatenated vector and the latent variable, both future trajectories and margins to the satisfaction of each rule are predicted. - Control module: An
MPC
optimizes the control of the ego car, deciding which rules should be prioritized based on the two predicted objects (trajectories and robustness slackness).
- Encoder module: The observed trajectories are fed to some
"An Online Evolving Framework for Modeling the Safe Autonomous Vehicle Control System via Online Recognition of Latent Risks"
-
[
2019
] [📝] [ 🎓Ohio State University
] [ 🚗Ford
] -
[
MDP
,action-state transitions matrix
,SUMO
,risk assessment
]
Click to expand
One figure:
Both the state space and the transition model are adapted online, offering two features: prediction about the next state and detection of unknown (i.e. risky ) situations. Source. |
Authors: Han, T., Filev, D., & Ozguner, U.
- Motivation
- "Rule-based and supervised-learning methods cannot recognize unexpected situations so that the AV controller cannot react appropriately under unknown circumstances."
- Based on their previous work on RL “Highway Traffic Modeling and Decision Making for Autonomous Vehicle Using Reinforcement Learning” by (You, Lu, Filev, & Tsiotras, 2018).
- Main ideas: Both the state space and the transition model (here discrete state space so transition matrices) of an MDP are adapted online.
- I understand it as trying to learn the transition model (experience is generated using
SUMO
), hence to some extent going toward model-based RL. - The motivation is to assist any AV control framework with a so-called "evolving Finite State Machine" (
e
-FSM
).- By identifying state-transitions precisely, the future states can be predicted.
- By determining states uniquely (using online-clustering methods) and recognizing the state consistently (expressed by a probability distribution), initially unexpected dangerous situations can be detected.
- It reminds some ideas about risk assessment discussed during IV19: the discrepancy between expected outcome and observed outcome is used to quantify risk, i.e. the surprise or misinterpretation of the current situation).
- I understand it as trying to learn the transition model (experience is generated using
- Some concerns:
- "The dimension of transition matrices should be expanded to represent state-transitions between all existing states"
- What when the scenario gets more complex than the presented "simple car-following" and that the state space (treated as discrete) becomes huge?
- In addition, "the total number of transition matrices is identical to the total number of actions".
- Alone for the simple example, the acceleration command was sampled into
17
bins. Continuous action spaces are not an option.
- Alone for the simple example, the acceleration command was sampled into
- "The dimension of transition matrices should be expanded to represent state-transitions between all existing states"
"A Driving Intention Prediction Method Based on Hidden Markov Model for Autonomous Driving"
-
[
2019
] [📝] [ 🎓IEEE
] -
[
HMM
,Baum-Welch algorithm
,forward algorithm
]
Click to expand
One figure:
Source. |
Authors: Liu, S., Zheng, K., Member, S., Zhao, L., & Fan, P.
- One term: "mobility feature matrix"
- The recorded data (e.g. absolute positions, timestamps ...) are processed to form the mobility feature matrix (e.g. speed, relative position, lateral gap in lane ...).
- Its size is
T × L × N
:T
time steps,L
vehicles,N
types of mobility features. - In the discrete characterization, this matrix is then turned into a set of observations using K-means clustering.
- In the continuous case, mobility features are modelled as Gaussian mixture models (GMMs).
- This work implements HMM concepts presented in my project Educational application of Hidden Markov Model to Autonomous Driving.
"Online Risk-Bounded Motion Planning for Autonomous Vehicles in Dynamic Environments"
-
[
2019
] [📝] [ 🎓MIT
] [ 🚗Toyota
] -
[
intention-aware planning
,manoeuvre-based motion prediction
,POMDP
,probabilistic risk assessment
,CARLA
]
Click to expand
One figure:
Source. |
Authors: Huang, X., Hong, S., Hofmann, A., & Williams, B.
- One term: "Probabilistic Flow Tubes" (
PFT
)- A motion representation used in the "Motion Model Generator".
- Instead of using hand-crafted rules for the transition model, the idea is to learns human behaviours from demonstration.
- The inferred models are encoded with PFTs and are used to generate probabilistic predictions for both manoeuvre (long-term reasoning) and motion of the other vehicles.
- The advantage of belief-based probabilistic planning is that it can avoid over-conservative behaviours while offering probabilistic safety guarantees.
- Another term: "Risk-bounded POMDP Planner"
- The uncertainty in the intention estimation is then propagated to the decision module.
- Some notion of risk, defined as the probability of collision, is evaluated and considered when taking actions, leading to the introduction of a "chance-constrained POMDP" (
CC-POMDP
). - The online solver uses a heuristic-search algorithm, Risk-Bounded AO* (
RAO*
), takes advantage of the risk estimation to prune the over-risky branches that violate the risk constraints and eventually outputs a plan with a guarantee over the probability of success.
- One quote (this could apply to many other works):
"One possible future work is to test our work in real systems".
"Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios"
-
[
planning-based motion prediction
,manoeuvre-based motion prediction
]
Click to expand
Author: Sierra Gonzalez, D.
-
Prediction techniques are often classified into three types:
physics-based
manoeuvre-based
(andgoal-based
).interaction-aware
-
As I understood, the main idea here is to combine prediction techniques (and their advantages).
- The driver-models (i.e. the reward functions previously learnt with IRL) can be used to identify the most likely, risk-aversive, anticipatory manoeuvres. This is called the
model-based
prediction by the author since it relies on one model.- But relying only on driver models to predict the behaviour of surrounding traffic might fail to predict dangerous manoeuvres.
- As stated, "the model-based method is not a reliable alternative for the short-term estimation of behaviour, since it cannot predict dangerous actions that deviate from what is encoded in the model".
- One solution is to add a term that represents how the observed movement of the target matches a given maneuver.
- In other words, to consider the noisy observation of the dynamics of the targets and include these so-called
dynamic evidence
into the prediction.
- The driver-models (i.e. the reward functions previously learnt with IRL) can be used to identify the most likely, risk-aversive, anticipatory manoeuvres. This is called the
-
Usage:
- The resulting approach is used in the probabilistic filtering framework to update the belief in the POMDP and in its rollout (to bias the construction of the history tree towards likely situations given the state and intention estimations of the surrounding vehicles).
- It improves the inference of manoeuvres, reducing rate of false positives in the detection of
lane change
manoeuvres and enables the exploration of situations in which the surrounding vehicles behave dangerously (not possible if relying on safe generative models such asIDM
).
-
One quote about this combination:
"This model mimics the reasoning process of human drivers: they can guess what a given vehicle is likely to do given the situation (the model-based prediction), but they closely monitor its dynamics to detect deviations from the expected behaviour".
-
One idea: use this combination for risk assessment.
- As stated, "if the intended and expected maneuver of a vehicle do not match, the situation is classified as dangerous and an alert is triggered".
- This is an important concept of risk assessment I could identify at IV19: a situation is dangerous if there is a discrepancy between what is expected (given the context) and what is observed.
-
One term: "Interacting Multiple Model" (
IMM
), used as baseline in the comparison.- The idea is to consider a group of motion models (e.g.
lane keeping with CV
,lane change with CV
) and continuously estimate which of them captures more accurately the dynamics exhibited by the target. - The final predictions are produced as a weighted combination of the individual predictions of each filter.
IMM
belongs to the physics-based predictions approaches and could be extended formanoeuvre inference
(called dynamics matching). It is often used to maintain the beliefs and guide the observation sampling in POMDP.- But the issue is that IMM completely disregards the interactions between vehicles.
- The idea is to consider a group of motion models (e.g.
"Decision making in dynamic and interactive environments based on cognitive hierarchy theory: Formulation, solution, and application to autonomous driving"
-
[
2019
] [📝] [ 🎓University of Michigan
] -
[
level-k game theory
,cognitive hierarchy theory
,interaction modelling
,interaction-aware decision making
]
Click to expand
Authors: Li, S., Li, N., Girard, A., & Kolmanovsky, I.
-
One concept:
cognitive hierarchy
.- Other drivers are assumed to follow some "cognitive behavioural models", parametrized with a so called "cognitive level"
σ
. - The goal is to obtain and maintain belief about
σ
based on observation in order to optimally respond (using anMPC
). - Three levels are considered:
- level-
0
: driver that treats other vehicles on road as stationary obstacles. - level-
1
: cautious/conservative driver. - level-
2
: aggressive driver.
- level-
- Other drivers are assumed to follow some "cognitive behavioural models", parametrized with a so called "cognitive level"
-
One quote about the "cognitive level" of human drivers:
"Humans are most commonly level-1 and level-2 reasoners".
Related works:
-
Li, S., Li, N., Girard, A. & Kolmanovsky, I. [2019]. "Decision making in dynamic and interactive environments based on cognitive hierarchy theory, Bayesian inference, and predictive control" [pdf]
-
Li, N., Oyler, D., Zhang, M., Yildiz, Y., Kolmanovsky, I., & Girard, A. [2016]. "Game-theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems" [pdf]
-
"If a driver assumes that the other drivers are level-
1
and takes an action accordingly, this driver is a level-2
driver". - Use RL with hierarchical assignment to learn the policy:
- First, the
π-0
(for level-0
) is learnt for the ego-agent. - Then
π-1
with all the other participants followingπ-0
. - Then
π-2
...
- First, the
- Action masking: "If a car in the left lane is in a parallel position, the controlled car cannot change lane to the left".
- "The use of these hard constrains eliminates the clearly undesirable behaviours better than through penalizing them in the reward function, and also increases the learning speed during training"
-
-
Ren, Y., Elliott, S., Wang, Y., Yang, Y., & Zhang, W. [2019]. "How Shall I Drive ? Interaction Modeling and Motion Planning towards Empathetic and Socially-Graceful Driving" [pdf] [code]
Source. |
Source. |
"Formalizing Traffic Rules for Machine Interpretability"
-
[
2020
] [📝] [ 🚗Fortiss
] -
[
LTL
,Vienna convention
,SPOT
,INTERACTION Dataset
]
Click to expand
Top left: Different techniques on how to model the rules have been employed: formal logics such as Linear Temporal Logic LTL or Signal Temporal Logic (STL ), as well as real-value constraints. Middle and bottom: Rules are separated into premise and conclusion . The initial premise and exceptions (red) are combined by conjunction . Source |
Authors: Esterle, K., Gressenbuch, L., & Knoll, A.
-
Motivation:
-
"Traffic rules are fuzzy and not well defined, making them incomprehensible to machines."
- The authors formalize traffic rules from legal texts (here
StVO
) to a formal language (hereLTL
).
-
-
Which legal text defines rules?
- For instance the Straßenverkehrsordnung (
StVO
), which is the German concretization of theVienna Convention
on Road Traffic.
- For instance the Straßenverkehrsordnung (
-
Why Linear Temporal Logic (
LTL
) as the formal language to specify traffic rules?-
"During the legal analysis,
conjunction
,disjunction
,negation
andimplication
proved to be powerful and useful tools for formalizing rules. As traffic rules such asovertaking
consider temporal behaviors, we decided to useLTL
." -
"Others have used Signal Temporal Logic (
STL
) to obtain quantitative semantics about rule satisfaction. Quantitive semantics might be beneficial for relaxing the requirements to satisfy a rule."
-
-
Rules are separated into
premise
andconclusion
.-
"This allows rules to be separated into a
premise
about the current state of the environment, i.e. when a rule applies, and the legal behavior of the ego agent in that situation (conclusion
). Then, exceptions to the rules can be modeled to be part of the assumption."
-
-
Tools:
INTERACTION
: a dataset which focuses on dense interactions and analyze the compliance* of each vehicle to the traffic rules.SPOT
: aC++
library for model checking, to translate the formalizedLTL
formula to a deterministic finite automaton and to manipulate the automatons.BARK
: a benchmarking framework.
-
Evaluation of rule-violation on public data:
-
"Roughly every fourth
lane change
does not keep asafe distance
to the rear vehicle, which is similar for theGerman
andChinese
Data."
-
"A hierarchical control system for autonomous driving towards urban challenges"
-
[
2020
] [📝] [ 🎓Chungbuk National University, Korea
] -
[
FSM
]
Click to expand
Behavioural planning is performed using a two-state FSM . Right: transition conditions in the M-FSM . Source |
Authors: Van, N. D., Sualeh, M., Kim, D., & Kim, G. W.
- Motivations:
-
"In the
DARPA
Urban Challenge, StanfordJunior
team succeeded in applyingFSM
with several scenarios in urban traffic roads. However, the main drawback ofFSM
is the difficulty in solving uncertainty and in large-scale scenarios." - Here:
- The uncertainty is not addressed.
- The diversity of scenarios is handled by a two-stage Finite State Machine (
FSM
).
-
- About the two-state
FSM
:1-
A MissionFSM
(M-FSM
).- Five states:
Ready
,Stop-and-Go
(SAG
) (main mode),Change-Lane
(CL
),Emergency-stop
,avoid obstacle mode
.
- Five states:
2-
A ControlFSM
(C-FSM
) in eachM-FSM
state.
- The
decision
is then converted intospeed
andwaypoints
objectives, handled by the local path planning.- It uses a real-time hybrid
A*
algorithm with an occupancy grid map. - The communication
decision
->path planner
is unidirectional: No feedback is given regarding the feasibility for instance.
- It uses a real-time hybrid
"Trajectory Optimization and Situational Analysis Framework for Autonomous Overtaking with Visibility Maximization"
- [
2019
] [📝] [🎞️] [ 🎓National University of Singapore, Delft University, MIT
] - [
FSM
,occlusion
,partial observability
]
Click to expand
Left: previous work Source. Right: The BP FSM consists in 5 states and 11 transitions. Each transition from one state to the other is triggered by specific alphabet unique to the state. For instance, 1 is Obstacle to be overtaken in ego lane detected . Together with the MPC set of parameters, a guidance path is passed to the trajectory optimizer. Source. |
Authors: Andersen, H., Alonso-mora, J., Eng, Y. H., Rus, D., & Ang Jr, M. H.
- Main motivation:
- Deal with occlusions, i.e. partial observability.
- Use case: a car is illegally parked on the vehicle’s ego lane. It may fully occlude the visibility. But has to be overtaken.
- One related works:
- "Trajectory Optimization for Autonomous Overtaking with Visibility Maximization" - (Andersen et al., 2017)
- [🎞️].
- [🎞️].
- [🎞️].
- About the hierarchical structure.
1-
A high-level behaviour planner (BP
).- It is structured as a deterministic finite state machine (
FSM
). - States include:
Follow ego-lane
Visibility Maximization
Overtake
Merge back
Wait
- Transition are based on some deterministic
risk assessment
.- The authors argue that the deterministic methods (e.g. formal verification of trajectory using
reachability analysis
) are simpler and computationally more efficient than probabilistic versions, while being very adapted for this information maximization: -
This is due to the fact that the designed behaviour planner explicitly breaks the traffic rule in order to progress along the vehicle’s course.
- The authors argue that the deterministic methods (e.g. formal verification of trajectory using
- It is structured as a deterministic finite state machine (
- Interface
1-
>2-
:- Each state correspond to specific set of parameters that is used in the trajectory optimizer.
-
"In case of
Overtake
, a suggested guidance path is given to both theMPC
and `backup trajectory generator".
2-
A trajectory optimizer.- The problem is formulated as receding horizon planner and the task is to solve, in real-time, the non-linear constrained optimization.
- Cost include
guidance path deviation
,progress
,speed deviation
,size of blind spot
(visible area) andcontrol inputs
. - Constraints include, among other,
obstacle avoidance
. - The prediction horizon of this
MPC
is5s
.
- Cost include
- Again (I really like this idea),
MPC
parameters are set by theBP
.- For instance, the cost for
path deviation
is high forFollow ego-lane
, while it can be reduced forVisibility Maximization
. -
"Increasing the visibility maximization cost resulted in the vehicle deviating from the path earlier and more abrupt, leading to frequent wait or merge back cases when an oncoming car comes into the vehicle’s sensor range. Reducing visibility maximization resulted in later and less abrupt deviation, leading to overtaking trajectories that are too late to be aborted. We tune the costs for a good trade-off in performance."
- Hence, depending on the state, the task might be to maximize the amount of information that the autonomous vehicle gains along its trajectory.
- For instance, the cost for
- The problem is formulated as receding horizon planner and the task is to solve, in real-time, the non-linear constrained optimization.
-
"Our method considers visibility as a part of both
decision-making
andtrajectory generation
".
"Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles"
- [
2019
] [📝] [ 🚗Uber
] - [
max-margin
]
Click to expand
Source. |
Authors: Sadat, A., Ren, M., Pokrovsky, A., Lin, Y., Yumer, E., & Urtasun, R.
- Main motivation:
- Design a decision module where both the behavioural planner and the trajectory optimizer share the same objective (i.e. cost function).
- Therefore "joint".
-
"[In approaches not-joint approaches] the final trajectory outputted by the trajectory planner might differ significantly from the one generated by the behavior planner, as they do not share the same objective".
- Requirements:
1-
Avoid time-consuming, error-prone, and iterative hand-tuning of cost parameters.- E.g. Learning-based approaches (
BC
).
- E.g. Learning-based approaches (
2-
Offer interpretability about the costs jointly imposed on these modules.- E.g. Traditional modular
2
-stage approaches.
- E.g. Traditional modular
- About the structure:
- The driving scene is described in
W
(desired route
,ego-state
,map
, anddetected objects
). ProbablyW
for "World"? - The behavioural planner (
BP
) decides two things based onW
:1-
An high-level behaviourb
.- The path to converge to based on one chosen manoeuvre:
keep-lane
,left-lane-change
, orright-lane-change
. - The
left
andright
lane boundaries. - The obstacle
side assignment
: whether an obstacle should stay in thefront
,back
,left
, orright
to the ego-car.
- The path to converge to based on one chosen manoeuvre:
2-
A coarse-level trajectoryτ
.- The loss has also a regularization term.
- This decision is "simply" the
argmin
of the shared cost-function, obtained by sampling+selecting the best.
- The "trajectory optimizer" refines
τ
based on the constraints imposed byb
.- For instance an overlap cost will be incurred if the
side assignment
ofb
is violated.
- For instance an overlap cost will be incurred if the
- A cost function parametrized by
w
assesses the quality of the selected <b
,τ
> pair:cost
=w^T
.sub-costs-vec
(τ
,b
,W
).- Sub-costs relate to safety, comfort, feasibility, mission completion, and traffic rules.
- The driving scene is described in
- Why "learnable"?
- Because the weight vector
w
that captures the importance of each sub-cost is learnt based on human demonstrations.-
"Our planner can be trained jointly end-to-end without requiring manual tuning of the costs functions".
-
- They are two losses for that objective:
1-
Imitation loss (withMSE
).- It applies on the <
b
,τ
> produced by theBP
.
- It applies on the <
2-
Max-margin loss to penalize trajectories that have small cost and are different from the human driving trajectory.- It applies on the <
τ
> produced by the trajectory optimizer. -
"This encourages the human driving trajectory to have smaller cost than other trajectories".
- It reminds me the
max-margin
method inIRL
where the weights of the reward function should make the expert demonstration better than any other policy candidate.
- It applies on the <
- Because the weight vector
"Liability, Ethics, and Culture-Aware Behavior Specification using Rulebooks"
-
[
2019
] [📝] [] [🎞️] [🎞️] [ 🎓ETH Zurich
] [ 🚗nuTonomy
,Aptiv
] -
[
sampling-based planning
,safety validation
,reward function
,RSS
]
Click to expand
Some figures:
Defining the rulebook . Source. |
The rulebook is associated to an operator =< to prioritize between rules. Source. |
The rulebook serves for deciding which trajectory to take and can be adapted using a series of operations. Source. |
Authors: Censi, A., Slutsky, K., Wongpiromsarn, T., Yershov, D., Pendleton, S., Fu, J., & Frazzoli, E.
-
Allegedly how nuTonomy (an Aptiv company) cars work.
-
One main concept: "rulebook".
- It contains multiple
rules
, that specify the desired behaviour of the self-driving cars. - A rule is simply a scoring function, or “violation metric”, on the realizations (= trajectories).
- The degree of violation acts like some penalty term: here some examples of evaluation of a realization
x
evaluated by a ruler
:- For speed limit:
r
(x
) = interval for which the car was above45 km/h
. - For minimizing harm:
r
(x
) = kinetic energy transferred to human bodies.
- For speed limit:
- Based on Use as a comparison operator to rank candidate trajectories.
- It contains multiple
-
One idea: Hierarchy of rules.
- With many rules being defined, it may be impossible to find a realization (e.g. trajectory) that satisfies all.
- But even in critical situation, the algorithm must make a choice - the least catastrophic option (hence no concept of infeasibility.)
- To deal with this concept of "Unfeasibility", priorities between conflicting rules which are therefore hierarchically ordered.
- Hence a rulebook
R
comes with some operator<
: <R
,<
>. - This leads to some concepts:
- Safety vs. infractions.
- Ex.: a rule "not to collide with other objects" will have a higher priority than the rule "not crossing the double line".
- Liability-aware specification.
- Ex.: (edge-case): Instruct the agent to collide with the object on its lane, rather than collide with the object on the opposite lane, since changing lane will provoke an accident for which it would be at fault.
- This is close to the RSS ("responsibility-sensitive safety" model) of Mobileye.
- Hierarchy between rules:
- Top: Guarantee safety of humans.
- This is written analytically (e.g. a precise expression for the kinetic energy to minimize harm to people).
- Bottom: Comfort constraints and progress goals.
- Can be learnt based on observed behaviour (and also tend to be platform- and implementation- specific).
- Middle: All the other priorities among rule groups
- There are somehow open for discussion.
- Top: Guarantee safety of humans.
-
How to build a rulebook:
- Rules can be defined analytically (e.g.
LTL
formalism) or learnt from data (for non-safety-critical rules). - Violation functions can be learned from data (e.g.
IRL
). - Priorities between rules can also be learnt.
- Rules can be defined analytically (e.g.
-
One idea: manipulation of rulebooks.
- Regulations and cultures differ depending on the country and the state.
- A rulebook <
R
,<
> can easily be adapted using three operations (priority refinement
,rule augmentation
,rule aggregation
).
-
Related work: Several topics raised in this paper reminds me subjects addressed in Emilio Frazzoli, CTO, nuTonomy - 09.03.2018
- 1- Decision making with FSM:
- Too complex to code. Easy to make mistake. Difficult to adjust. Impossible to debug (:cry:).
- 2- Decision making with E2E learning:
- Appealing since there are too many possible scenarios.
- But how to prove that and justify it to the authorities?
- One solution is to revert such imitation strategy: start by defining the rules.
- 3- Decision making "cost-function-based" methods
- 3-1-
RL
/MCTS
: not addressed here. - 3-2- Rule-based (not the
if
-else
-then
logic but rather traffic/behaviour rules).
- 3-1-
- First note:
- Number of rules: small (
15
are enough for level-4
). - Number of possible scenarios: huge (combinational).
- Number of rules: small (
- Second note:
- Driving baheviours are hard to code.
- Driving baheviours are hard to learn.
- But driving baheviours are easy to assess.
- Strategy:
- 1- Generate candidate trajectories
- Not only in time and space.
- Also in term of semantic (logical trajectories in Kripke structure).
- 2- Check if they satisfy the constraints and pick the best.
- This involves linear operations.
- 1- Generate candidate trajectories
- Conclusion:
-
"Rules and rules priorities, especially those that concern safety and liability, must be part of nation-wide regulations to be developed after an informed public discourse; it should not be up to engineers to choose these important aspects."
- This reminds me the discussion about social-acceptance I had at IV19.^
- As E. Frazzoli concluded during his talk, the remaining question is:
- "We do not know how we want human-driven vehicle to behave?"
- Once we have the answer, that is easy.
-
- 1- Decision making with FSM:
Some figures from this related presentation:
Candidate trajectories are not just spatio-temporal but also semantic. Source. |
Define priorities between rules, as Asimov did for his laws. Source. |
As raised here by the main author of the paper, I am still wondering how the presented framework deals with the different sources of uncertainties. Source. |
"Provably Safe and Smooth Lane Changes in Mixed Traffic"
Click to expand
Some figures:
The first safe? check might lead to conservative behaviours (huge gaps would be needed for safe lane changes). Hence it is relaxed with some Probably Safe? condition. Source. |
Source. |
Formulation by Pek, Zahn, & Althoff, 2017. Source. |
Authors: Naumann, M., Königshof, H., & Stiller, C.
-
Main ideas:
- The notion of safety is based on the responsibility sensitive safety (
RSS
) definition.- As stated by the authors, "A
safe
lane change is guaranteed not tocause
a collision according to the previously defined rules, while a single vehicle cannot ensure that it will never be involved in a collision."
- As stated by the authors, "A
- Use set-based reachability analysis to prove the "RSS-safety" of lane change manoeuvre based on gap evaluation.
- In other words, it is the responsibility of the ego vehicle to maintain safe distances during the lane change manoeuvre.
- The notion of safety is based on the responsibility sensitive safety (
-
Related works: A couple of safe distances are defined, building on
RSS
principles (after IV19, I tried to summarize some of the RSS concepts here).- "Verifying the Safety of Lane Change Maneuvers of Self-driving Vehicles Based on Formalized Traffic Rules", (Pek, Zahn, & Althoff, 2017)
"Decision-Making Framework for Autonomous Driving at Road Intersections: Safeguarding Against Collision, Overly Conservative Behavior, and Violation Vehicles"
-
[
2018
] [📝] [🎞️] [ 🎓Daejeon Research Institute, South Korea
] -
[
probabilistic risk assessment
,rule-based probabilistic decision making
]
Click to expand
One figure:
Source. |
Author: Noh, S.
- Many ML-based works criticize rule-based approaches (over-conservative, no generalization capability and painful parameter tuning).
- True, the presented framework contains many parameters whose tuning may be tedious.
- But this approach just works! At least they go out of the simulator and show some experiments on a real car.
- I really like their video, especially the multiple camera views together with the
RViz
representation. - It can be seen that probabilistic reasoning and uncertainty-aware decision making are essential for robustness.
- One term: "Time-to-Enter" (tte).
- It represents the time it takes a relevant vehicle to reach the potential collision area (CA), from its current position at its current speed.
- To deal with uncertainty in the measurements, a variant of this heuristic is coupled with a Bayesian network for probabilistic threat-assessment.
- One Q&A: What is the difference between situation awareness and situation assessment?
- In situation awareness, all possible routes are considered for the detected vehicles using a map. The vehicles whose potential route intersect with the ego-path are classified as relevant vehicles.
- In situation assessment, a threat level in {
Dangerous
,Attentive
,Safe
} is inferred for each relevant vehicle.
- One quote:
"The existing literature mostly focuses on motion prediction, threat assessment, or decision-making problems, but not all three in combination."
"MIDAS: Multi-agent Interaction-aware Decision-making with Adaptive Strategies for Urban Autonomous Navigation"
-
[
2020
] [📝] [ 🎓University of Pennsylvania
] [ 🚗Nuro
] -
[
attention
,parametrized driver-type
]
Click to expand
Top: scenarios are generated to require interaction-aware decisions from the ego agent. Bottom: A driver-type parameter is introduced to learn a single policy that works across different planning objectives. It represents the driving style such as the level of aggressiveness . It affects the terms in the reward function in an affine way. Source. |
The conditional parameter (ego’s driver-type ) should only affect the encoding of its own state , not the state of the other agents. Therefore it injected after the observation encoder. MIDAS is compared to DeepSet and “Social Attention”. SAB stands for ''set-attention block'', ISAB for ''induced SAB '' and PMA for ''pooling by multi-head attention''. Source. |
Authors: Chen, X., & Chaudhari, P.
-
Motivations:
1-
Derive an adaptive ego policy.- For instance, conditioned on a parameter that represents the driving style such as the level of
aggressiveness
. -
"
MIDAS
includes adriver-type
parameter to learn a single policy that works across different planning objectives."
- For instance, conditioned on a parameter that represents the driving style such as the level of
2-
Handle an arbitrary number of other agents, with a permutation-invariant input representation.-
[
MIDAS
uses anattention
-mechanism] "The ability to pay attention to only the part of theobservation
vector that matters for control irrespective of the number of other agents in the vicinity."
-
3-
Decision-making should be interaction-aware.-
"A typical
planning
algorithm would predict the forward motion of the other cars and ego would stop until it is deemed safe and legal to proceed. While this is reasonable, it leads to overly conservative plans because it does not explicitly model the mutual influence of the actions of interacting agents." - The goal here is to obtain a policy more optimistic than a worst-case assumption via the tuning of the
driver-type
.
-
-
Why "
MIDAS
"? I could not find the meaning of this acronym. -
How to train a user-tuneable adaptive policy?
-
Each agent possesses a real-valued parameter
βk
∈ [−1
,1
] that models its “driver-type”. A large value ofβk
indicates an aggressive agent and a small value ofβk
indicates that the agent is inclined to wait for others around it before making progress." β
is not observable to others and is used to determine the agent’s velocity asv = 2.7β + 8.3
.- This affine form
wβ+b
is also used in all sub-rewards of thereward
function:1-
Time-penalty for every timestep.2-
Reward for non-zero speed.3-
Timeout penalty that discourages ego from stopping the traffic flow.-
"This includes a stalement penalty where all nearby agents including ego are standstill waiting for one of them to take initiative and break the tie."
-
4-
Collision penalty.5-
A penalty for following too close to the agent in front. - This one does not depend onβ
.
- The goal is not to explicitly infer
β
fromobservation
s, as it is done by thebelief tracker
of somePOMDP
solvers. - Where to inject
β
?-
"We want ego’s
driver-type
information to only affect the encoding of its ownstate
, not the state of the other agents." -
"We use a two-layer perceptron with ReLU nonlinearities to embed the scalar variable
β
and add the output to the encoding of ego’sstate
." [Does that mean that the single scalarβ
go alone through two FC layers?]
-
- To derive adjustable drivers, one could use
counterfactual reasoning
withk-levels
for instance.-
"It however uses self-play to train the policy and while this approach is reasonable for highway merging, the competition is likely to result in high collision rates in busy urban intersections such as ours."
-
-
-
About the
attention
-based architecture.- The permutation-invariance and size independence can be achieved by combining a
sum
followed by someaggregation
operator such asaverage pooling
.- The authors show the limitation of using a
sum
. Instead of preferring amax pooling
, the authors suggest usingattention
: -
"Observe however that the summation assigns the same weight to all elements in the input. As we see in our experiments, a value function using this
DeepSet
architecture is likely to be distracted by agents that do not inform the optimal action."
- The authors show the limitation of using a
-
"An
attention
module is an elegant way for thevalue function
to learnkey
,query
,value
embeddings that pay moreattention
to parts of the input that are more relevant to the output (for decision making)." Set transformer
.-
"The set-transformer in
MIDAS
is an easy, automatic way to encode variable-sizedobservation
vectors. In this sense, our work is closest to “Social Attention”, (Leurent & Mercat, 2019), which learns to influence other agents based on road priority and demonstrates results on a limited set of road geometries."
-
- The permutation-invariance and size independence can be achieved by combining a
-
observation
space (not very clear).-
"Their observation vector contains the locations of all agents within a Euclidean distance of
10m
and is created in an ego-centric coordinate frame." - The list of
waypoints
to reach agents-specific goal locations is allegedly also part of theobservation
. - No orientation? No previous poses?
-
-
Binary
action
.-
"Control
actions
of all agents areukt
∈ {0
,1
} which correspond tostop
andgo
respectively." - No information about the
transition
/dynamics
model.
-
-
Tricks for off-policy RL, here
DQN
.-
"The
TD2
objective can be zero even if the value function is not accurate because the Bellman operator is only a contraction in theL∞
norm, not theL2
norm." - Together with
double DQN
, andduelling DQN
, the authors proposed two variants:1-
Thenet 1
(local) selects the action withargmax
fornet 2
(target
ortime-lagged
, whose weights are copied fromnet 1
at fixed periods) and vice-versa.-
"This forces the first copy, via its time-lagged parameters to be the evaluator for the second copy and vice-versa; it leads to further variance reduction of the target in the
TD
objective."
-
2-
Duringaction selection
,argmax
is applied on the average of theq
-values of both networks.
-
-
Evaluation.
- Non-ego agents drive using an Oracle policy that has full access to trajectories of nearby agents. Here it is rule-based (
TTC
). - Scenarios are diverse for training (no
collision
scenario during testing. [Why?]):generic
: Initial and goal locations are uniformly sampled.collision
: The ego will collide with at least one other agent in the future if it does not stop at an appropriate timestep.interaction
: At least2
other agents will arrive at a location simultaneously with the ego car.-
"Ego cannot do well in
interaction
episodes unless it negotiates with other agents." -
"We randomize over the number of agents,
driver-types
, agent IDs, road geometries, and add small perturbations to their arrival time to construct1917
interaction
episodes." -
"Curating the dataset in this fashion aids the reproducibility of results compared to using random seeds to initialize the environment."
-
-
"[Robustness] At test time, we add
Bernoulli
noise of probability0.1
to theactions
of other agents to model the fact that driving policies of other agents may be different from each other." - Performance in the simulator is evaluated based on:
1-
Thetime-to-finish
which is the average episode length.2-
Thecollision-
,timeout-
andsuccess rate
which refer to the percentage of episodes that end with the corresponding status (the three add up to1
).
-
"To qualitatively compare performance, we prioritize
collision rate
(an indicator forsafety
) over thetimeout rate
andtime-to-finish
(which indicateefficiency
). Performance of theOracle planner
is reported over4
trials. Performance of the trained policy is reported across4
random seeds."
- Non-ego agents drive using an Oracle policy that has full access to trajectories of nearby agents. Here it is rule-based (
"An end-to-end learning of driving strategies based on DDPG and imitation learning"
Click to expand
A small set of collected expert demonstrations is used to train an IL agent while pre-training the DDPG (offline RL ). Then, the IL agent is used to generate new experiences, stored in the M1 buffer. The RL agent is then trained online ('self-learning') and the generated experiences are stored in M2 . During this training phase, the sampling from M1 is progressively reduced. The decay of the sampling ratio is automated based on the return . Source. |
Authors: Zou, Q., Xiong, K., & Hou, Y.
- I must say the figures are low quality and some sections are not well-written. But I think the main idea is interesting to report here.
- Main inspiration:
Deep Q-learning from Demonstrations
(Hester et al. 2017) atDeepMind
.-
"We present an algorithm, Deep Q-learning from Demonstrations (
DQfD
), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism."
-
- Motivations:
- Improve the training efficiency of model-free off-policy
RL
algorithms.-
[issue with
DDPG
] "The reason whyRL
converges slowly is that it requires constant exploration from the environment. The data in the early experience pool (s
,a
,r
,s_
) are all low-reward data, and the algorithm cannot always be trained with useful data, resulting in a slow convergence rate." -
[issue with
behavioural cloning
] "The performance of pureIL
depends on the quality and richness of expert data, and cannot be self-improved."
-
-
[main idea] "Demonstration data generated by
IL
are used to accelerate the learning speed ofDDPG
, and then the algorithm is further enhanced through self-learning."
- Improve the training efficiency of model-free off-policy
- Using
2
experience replay buffers.1-
Collect a small amount of expert demonstrations.-
"We collect about
2000
sets of artificial data from theTORCS
simulator as expert data and use them to train a simpleIL
network. Because there is less expert data,Dagger
(Dataset Aggregation) is introduced in the training method, so that the network can get the best results with the least data."
-
2-1.
Generate demonstration data using theIL
agent. And store them in theexpert pool
.2-2.
Meanwhile, pre-train theDDPG
algorithm on the demonstration data. It isoffline
, i.e. the agent trains solely without any interaction with the environment.3-
After the pre-training is completed, theDDPG
algorithm starts exploration. It stores the exploration data in theordinary pool
.- Experiences are sampled from the
2
pools. -
"We record the maximum value of total reward
E
whenIL
produced demonstration data for each round, and use it as the threshold to adjust thesampling ratio
[..] when the algorithm reaches an approximate expert level, the value of theexpert/ordinary
ratio will gradually decrease."
- Experiences are sampled from the
- Results
-
[training time] "Experimental results show that the
DDPG-IL
algorithm is3
times faster than the ordinaryDDPG
and2
times faster than the single experience poolDDPG
based on demonstration data." - In addition,
DDPG-IL
shows the best performance.
-
"High-Speed Autonomous Drifting with Deep Reinforcement Learning"
-
[
generalization
,action smoothing
,ablation
,SAC
,CARLA
]
Click to expand
The goal is to control the vehicle to follow a trajectory at high speed (>80 km/h ) and drift through manifold corners with large side slip angles (>20° ), like a professional racing driver. The slip angle β is the angle between the direction of the heading and the direction of speed vector. The desired heading angle is determined by the vector field guidance (VFG ): it is close to the direction of the reference trajectory when the lateral error is small. Source. |
Top-left: The generalization capability is tested by using cars with different kinematics and dynamics. Top-right: An action smoothing strategy is adopted for stable control outputs. Bottom: the reward function penalized deviations to references states in term of distance , direction , and slip angle . The speed factor v is used to stimulate the vehicle to drive fast: If v is smaller than 6 m/s , the total reward is decreased by half as a punishment. Source. |
The proposed SAC -based approach as well as the three baselines can follow the reference trajectory. However, SAC achieves a much higher average velocity (80 km/h ) than the baselines. In addition, it is shown that the action smoothing strategy can improve the final performance by comparing SAC-WOS and SAC . Despite the action smoothing , the steering angles of DDPG is also shaky. Have a look at the steering gauges! Source. |
Authors: Cai, P., Mei, X., Tai, L., Sun, Y., & Liu, M.
-
Motivations:
1-
Learning-based.- The car dynamics during transient drift (high
speed
>80km/h
andslipe angle
>20°
as opposed to steady-state drift) is too hard to model precisely. The authors claim it should rather be addressed by model-free learning methods.
- The car dynamics during transient drift (high
2-
Generalization.- The drift controller should generalize well on various
road structures
,tire friction
andvehicle types
.
- The drift controller should generalize well on various
-
MDP
formulation:state
(42
-dimensional):- [Close to
imitation learning
] It is called "error-basedstate
" since it describes deviations to the referenced drift trajectories (performed by a experienced driver with aLogitech G920
). - These deviations relate to the
location
,heading angle
,velocity
andslip angle
.-
[Similar to
D
inPID
] "Time derivatives of the error variables, such asd(ey)/d(t)
, are included to provide temporal information to the controller."
-
- The
state
also contains the laststeering
andthrottle
commands. Probably enabling consistent action selection in theaction smoothing
mechanism.
- [Close to
action
:- The
steering
is limited to a smaller range of [−0.8
,0.8
] instead of [−1
,1
] to prevent rollover. -
"Since the vehicle is expected to drive at high speed, we further limit the range of the
throttle
to [0.6
,1
] to prevent slow driving and improve training efficiency.
- The
action
smoothing against shaky control output.-
"We impose continuity in the
action
, by constraining the change of output with the deployed action in the previous step:a[t]
=K1
.a_net[t]
+K2
.a[t-1]
.
-
-
Algorithms:
1-
DQN
can only handle the discreteaction
space: it selects among5
*10
=50
combinations, without theaction smoothing
strategy.2-
DDPG
is difficult to converge due to the limited exploration ability caused by its deterministic character.3-
[Proposed] Soft actor-critic (SAC
) offers a better convergence ability while avoiding the high sample complexity: Instead of only seeking to maximize the lifetime rewards,SAC
seeks to also maximize the entropy of the policy (as a regularizer). This encourages exploration: the policy should act as randomly as possible [encourage uniform action probability] while being able to succeed at the task.4-
TheSAC-WOS
baseline does not have anyaction smoothing
mechanism (a[t]
=K1
.a_net[t]
+K2
.a[t-1]
) and suffers from shaky behaviours.
-
Curriculum learning:
-
"Map (
a
) is relatively simple and is used for the first-stage training, in which the vehicle learns some basic driving skills such as speeding up by applying large values of throttle and drifting through some simple corners. Maps (b
-f
) have different levels of difficulty with diverse corner shapes, which are used for further training with the pre-trained weights from map (a
). The vehicle can use the knowledge learned from map (a
) and quickly adapt to these tougher maps, to learn a more advanced drift technique." - Is the replay buffer build from training with map (
a
) reused? What about the weighting parameterα
of the entropy termH
in the objective function?
-
-
Robust
training
for generalization:- The idea is to expose different
road structures
andcar models
during training, i.e. make theMDP
environmentstochastic
. -
"At the start of each episode, the
tire friction
andvehicle mass
are sampled from the range of [3.0
,4.0
] and [1.7t
,1.9t
] respectively." -
[Evaluation.] "Note that for each kind of vehicle, the referenced drift trajectories are different in order to meet the respective physical dynamics." [How are they adjusted?]
- Benefit of
SAC
:-
[from
BAIR
blog] "Due to entropy maximization at training time, the policy can readily generalize to these perturbations without any additional learning."
-
- The idea is to expose different
-
Ablation study for the
state
space.-
"Can we also provide less information during the training and achieve no degradation in the final performance?"
- Findings:
1-
Necessity ofslip angle
information (instate
andreward
) duringtraining
.-
[Need for
supervision
/expert demonstration
] "Generally, accurateslip angles
from expert drift trajectories are indeed necessary in the training stage, which can improve the final performance and the training efficiency."
-
2-
Non-degraded performance with a rough and easy-to-access reference trajectory duringtesting
. Making it less dependant on expert demonstraions.
-
"Trajectory based lateral control: A Reinforcement Learning case study"
-
[
generalization
,sim2real
,ablation
,DDPG
,IPG CarMaker
]
Click to expand
The task is to predict the steering commands to follow the given trajectory on a race track. Instead of a sum , a product of three deviation terms is proposed for the multi-objective reward . Not clear to me: which WP is considered for the deviation computation in the reward ? Source. |