Skip to content

Dudongyue/How-to-train-your-drone-using-PPO

 
 

Repository files navigation

How to train your drone using PPO

RL based robust controller for quadrotor. Proximal Policy Optimization (PPO) is used. This project is a part of the course XAI 601 Applications in Deep Learning in Korea University.

This repository contains MATLAB-based simulation for quadrotor using PPO agent.

Main Contributors: Soowon Kim ([email protected])

Affiliation: Korea University

Requirements

Tested on MATLAB 2021a. RL package is required.

How To

State, action and reward

Observation (state) consists of the errors between the reference trajectory and current position, velocity, and acclereation. Addition to that, the state has orientation (quaternion) and angular velocity of the agent. Then the state can be expressed as follows:

numObs = 16;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'Quad States';

Action consists of four thrusts by each rotor in Newton.

numAct = 4;
actInfo = rlNumericSpec([numAct 1]);
actInfo.Name = 'Quad Action';
actInfo.LowerLimit = [0,0,0,0]';

Reward functions are inspired by Peng's work. You can modify them in myStep.m.

% Rewards
r_pos = exp(-(1/0.5*pos_l2).^2);
r_vel = exp(-(1/1*vel_l2).^2);
r_acc = exp(-(1/1*acc_l2).^2);
r_yaw = exp(-(1/(5/180*pi)*yaw_error).^2);

Agent

In the declaration of actor, you can choose either single or dual tanh activation for the mean output of the actions.

% Define environment
env = rlFunctionEnv(obsInfo,actInfo,'myStep','myReset');
% Actor network
actor = actorNetwork(obsInfo, actInfo,'single');
% Critic network
critic = criticNetwork(obsInfo);

Now you can create PPO agent. You can experiment with different parameters but these setup worked for me.

agentOpts = rlPPOAgentOptions('SampleTime',0.01,...
    'ExperienceHorizon',2^11,...
    'MiniBatchSize',2^11,...
    'UseDeterministicExploitation',false,...
    'EntropyLossWeight',0.4);

agent = rlPPOAgent(actor,critic,agentOpts);

Train

To train the agent, simply use train.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',1000000, ...
    'MaxStepsPerEpisode',1000000, ...
    'Verbose',false, ...
    'Plots','training-progress',...
    'ScoreAveragingWindowLength',100,...
    'StopTrainingCriteria',"AverageReward",...
    'UseParallel',true,...
    'StopTrainingValue',10000000);
trainingStats = train(agent,env,trainOpts);

Test

To test the agent, run Test section under PPO.m. You can create your own trajectory by modifying myReset.m function by declaring your own path and tau_vec. You may want to modify some of the termination conditions that are used for training. For example, in myStep.m,

if Time >= total_time+1 && ~IsDone
%     IsDone = true;
    State = State(:,end);
    State(1:3) = pos_error;
    Time = 0;
    tau_vec = 9;
    vel_min = 1; vel_max = 5;
    vel = (vel_max-vel_min)*rand + vel_min;
    dist = vel*tau_vec;
    target = dist * random_unit_vector;
    path = [zeros(1,3); target'];
    Traj = MinimumSnapTrajectory(tau_vec, path);
    Fext = 1 * rand * random_unit_vector;
end

comment out all statements and use IsDone = true.

Trajectory

path consists of x,y,z coordinates of way points. It always starts and ends with zero velocity, acceleration, jerk, and snap. Then you have to declare time interval between way points. For example, you can set path=[0,0,0;1,1,1];. Since it only has start and end points, time interval can be set tau_vec=10;. It would look something like this.

alt text

You can also use time allocation technique to automatically set time intervals tau_vec using timeAllocation with γ parameter. For example,

path = [0,0,0;1,1,0;2,0,0;1,-1,0;0,0,0;-1,1,0;-2,0,0;-1,-1,0;0,0,0];
tau_vec = timeAllocation(path, 100)';

alt text

The greater the γ is, the higher the time penalty, resulting a faster trajectory.

To try a trained model, you can load 0615_FM.mat in results folder.

Aerobatic maneuvers

To train an agent to perform aerobatic maneuvers, declare a trajectory using loopTheLoop, splitS, or canopyRoll in myStep.m. For example,

[tau_vec, path] = splitS();

And modify reward functions so as the body x-axis to follow the direction of the velocity and the body z-axis to follow the direction of the acceleration of the reference trajectory. For example,

x_cos = acos(getCosineSimilarity(xb,desired_state.vel));
z_cos = acos(getCosineSimilarity(zb,desired_state.acc));
r_xb = exp(-(1/(30/180*pi)*x_cos).^2);
r_zb = exp(-(1/(30/180*pi)*z_cos).^2);
rewards = [0.6 0.1 0.1 0.1 0.1] .* [r_pos r_vel r_acc r_xb r_zb];

Each aerobatic maneuver has its own speicificity of the shape of the reward function.

Some results

PPO agent trained with random disturbances. Below is demonstrating the agent with sinusoidal vertical force applied.

PPO agent trained to perform Inside loop

PPO agent trained to perform Split S

PPO agent trained to perform Canopy roll

About

RL based robust controller for quadrotor.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • MATLAB 100.0%