Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo a pass on memory allocations #169

Open
PierreMartinon opened this issue Jul 11, 2024 · 29 comments
Open

Redo a pass on memory allocations #169

PierreMartinon opened this issue Jul 11, 2024 · 29 comments
Assignees

Comments

@PierreMartinon
Copy link
Member

PierreMartinon commented Jul 11, 2024

Todo: check the .mem files thingy (julia --track-allocation=user). Also recheck code_warntype for type unstabilities.

@PierreMartinon
Copy link
Member Author

Reminder: keep track of what has been tried in the performance Discussion

@jbcaillau
Copy link
Member

@PierreMartinon should more or less be subsumed by #188

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 13, 2024

Ha, you wish :D
In the current state, most of the remaining allocations seem to be for the variables passed to the functions, ie t,x,u,v. These are basically the same for in_place / out_of_place, and are surprisingly difficult to get rid of. Trying to reuse pre-allocated vectors (with out_of_place or in_place getters) led to more allocations in practice, which was a bit confusing. Currently the getters return views of vectors, and the scalar conversion is done if needed when calling the OCP functions, ie

ocp.dynamics((@view f[1:dim_OCP_x]), t, _x(x), _u(u), v)

with

_x(x::AbstractVector) = (dim_OCP_x == 1) ? x[1] : x[1:dim_OCP_x]
_u(u::AbstractVector) = (dim_NLP_u == 1) ? u[1] : u

An annoying point is that, probably due to the very small size of our vectors, there is no consistent performance gain when using dot operators and/or views.

Right now both versions have similar allocations, and in_place seems a bit faster. Still working on it.

@jbcaillau
Copy link
Member

jbcaillau commented Sep 13, 2024

@PierreMartinon When you pass sth like x[1:dim_OCP_x] you make a copy of the argument; a view must be passed to avoid this... or more simply x itself (when you pass not a slice but the full vector - no need for a view, then). Check:

julia> f!(y, x) = begin
       y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x[1:3]))
       end
400

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x[1:3]))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
       end
240

Also note that more allocations can be saved using @views y[:] .= ... inside the in place function (see control-toolbox/CTBase.jl#271 (comment)); this is what is systematically done in the code generated from the abstract form:

julia> f!(y, x) = begin
       @views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
160

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x[1:3]))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x[1:3]))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
       end
160

So as rule, the good combination seems to be:

julia> f!(y, x) = begin
       @views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
160

Moreover, you should be able to write a code that is uniform and works both for vectors and scalars.

@jbcaillau
Copy link
Member

@amontoison any further comments on the post above?

@jbcaillau
Copy link
Member

@PierreMartinon comments on #169 (comment) above?

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 19, 2024

In the example x was already a view, and adding another view in _x gives slightly worse results.

For very small sizes, views are just not better than slices, due to the cost of creating the view (which is not just a pointer).

Regarding types, we still have some unstable ones due to the unions from CTBase (mayer for instance is now a Union{Nothing,Mayer,Mayer!}), but it's hard to tell the actual impact of these.

Same with the scalar / vector case, I ended up having similar performance using either the 'converters' (_x, _u) or getters that directly return scalar or vector values.

I'll probably merge the current inplace branch once it's cleaned up a bit. Performance is slightly better than current main.

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 20, 2024

I think we can still improve the allocations in the constraints: compared to the best_allocs_trapeze branch, introducing the midpoint method and the inplace case added some overhead (current code accepts both inplace and outplace(tm)).

We can probably use work arrays better too. Started to look at Profile.Allocs with PProf, flamegraphs are nice, however the 'source' view is not always accurate. More handy than the .mem files though.

@jbcaillau
Copy link
Member

@PierreMartinon ready to merge and test inplace code? (sorry if I missed sth!)

@PierreMartinon
Copy link
Member Author

Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression.
Still a few tests to do for the dynamics part.

In CTBase could we have the option to enable/disable inplace for abstract OCPs ?

@jbcaillau
Copy link
Member

Hi

Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression. Still a few tests to do for the dynamics part.

well, performance is supposed to be much better by reducing the number of allocations!

In CTBase could we have the option to enable/disable inplace for abstract OCPs ?

it is planned to keep only the memory efficient code, that is inplace alone. what would be te point to keep out of place?

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 25, 2024

Following our discussion this morning, a few details on the current version, more specifically the constraints.

function DOCP_constraints!(c, xu, docp::DOCP)

    # initialization
    if docp.has_free_t0 || docp.has_free_tf
        docp.get_time_grid!(xu)
    end
    v = docp.get_optim_variable(xu)
    work = setWorkArray(docp, xu, docp.NLP_time_grid, v)

    # main loop on time steps 
    for i = 1:docp.dim_NLP_steps+1
        setConstraintBlock!(docp, c, xu, v, docp.NLP_time_grid, i, work)
    end

    # point constraints
    setPointConstraints!(docp, c, xu, v)

    # NB. the function *needs* to return c for AD...
    return c
end

Here is the loop core function, here for midpoint since it is a bit simpler than trapeze (no save/reuse of dynamics). First we have the discretized dynamics

function setConstraintBlock!(docp::DOCP{Trapeze}, c, xu, v, time_grid, i, work)

    disc = docp.discretization

    # offset for previous steps
    offset = (i-1)*(docp.dim_NLP_x + docp.dim_path_cons)

    # 0. variables
    ti = time_grid[i]
    xi = get_state_at_time_step(xu, docp, i)
    ui = get_control_at_time_step(xu, docp, i)

    #1. state equation
    if i <= docp.dim_NLP_steps
        # more variables
        fi = copy(work) # create new copy, not just a reference
        tip1 = time_grid[i+1]
        xip1 = get_state_at_time_step(xu, docp, i+1)
        uip1 = get_control_at_time_step(xu, docp, i+1)
        docp.dynamics_ext!(work, tip1, xip1, uip1, v)

        # trapeze rule with 'smart' update for dynamics (similar with @.)
        c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges
        offset += docp.dim_NLP_x
    end

followed by the path constraints (it is the same function, I split the quote for clarity).*

 
    # 2. path constraints
    # Notes on allocations:.= seems similar
    if docp.dim_u_cons > 0
        if docp.has_inplace
            docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v)
        else
            c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v)
        end
    end
    if docp.dim_x_cons > 0 
        if docp.has_inplace
            docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v)
        else
            c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v)
        end
    end
    if docp.dim_mixed_cons > 0 
        if docp.has_inplace
            docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v)
        else
            c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v)
        end
    end

end

*Here you can see it still accepts the out of place format for comparison purposes. This is also true for the dynamics part, although the encapsulating function dynamics_ext! is always inplace.

The getter for x (u is similar)

function get_state_at_time_step(xu, docp::DOCP{Trapeze}, i)
    nx = docp.dim_NLP_x
    m = docp.dim_NLP_u
    offset = (nx + m) * (i-1)
    return @view xu[(offset + 1):(offset + nx)]
end

JET gives a number of runtime dispatch alerts, unfortunately the output is not comfortable to navigate.

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 25, 2024

The getters for x's and u's seem to allocate 80 bytes for the views (a view with constant indices is 48bytes), regardless of the size. NB. @jbcaillau the size 1 view is still a subarray and is not necessarily compatible with scalar syntax in OCP functions, did I miss something ?

Copying values for slices takes 96, 112, 112, 128, 128, etc (16 bytes alignment), so Views seem always better in terms of allocations.

For whatever reason, when testing the constraints evaluation, we see a bit less allocations for the slice version (299kB vs 308kB for views) but performance is similar. The main drawback of views is that they are allocated at each getter call, while the slice version may work inplace with a reusable work array ? I'll try the variant without getters at all first to check what could be gained.

@amontoison
Copy link
Contributor

A view should not allocate any bytes. Something looks wrong 🤔

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 25, 2024

Hi @amontoison ,
Testing the getters separately with btime did show 0 allocations, I'll stop using the .mem files. The source view from the profiler is more interesting:

setConstraintBlock!(::CTDirect.DOCP{CTDirect.Trapeze}, ::Vector{Float64}, ::Vector{Float64}, ::Float64, ::Vector{Any}, ::Int64, ::Vector{Float64})

/home/pierre/CTDirect.jl/src/trapeze.jl

  Total:           0       7298 (flat, cum) 92.90%
     89            .          .               ui = get_control_at_time_step(xu, docp, i) 
     90            .          .            
     91            .          .               #1. state equation 
     92            .          .               if i <= docp.dim_NLP_steps 
     93            .          .                   # more variables 
     94            .        100                   fi = copy(work) # create new copy, not just a reference 
     95            .          .                   tip1 = time_grid[i+1] 
     96            .          .                   xip1 = get_state_at_time_step(xu, docp, i+1) 
     97            .          .                   uip1 = get_control_at_time_step(xu, docp, i+1) 
     98            .       1300                   docp.dynamics_ext!(work, tip1, xip1, uip1, v) 
     99            .          .            
    100            .          .                   # trapeze rule with 'smart' update for dynamics (similar with @.) 
    101            .        800                   c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges 
    102            .          .                   offset += docp.dim_NLP_x 
    103            .          .               end 
    104            .          .            
    105            .          .               # 2. path constraints 
    106            .          .               # Notes on allocations:.= seems similar 
    107            .          .               if docp.dim_u_cons > 0 
    108            .          .                   if docp.has_inplace 
    109            .          .                       docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v) 
    110            .          .                   else 
    111            .       1632                       c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v) 
    112            .          .                   end 
    113            .          .               end 
    114            .          .               if docp.dim_x_cons > 0  
    115            .          .                   if docp.has_inplace 
    116            .          .                       docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v) 
    117            .          .                   else 
    118            .       1531                       c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v) 
    119            .          .                   end 
    120            .          .               end 
    121            .          .               if docp.dim_mixed_cons > 0  
    122            .          .                   if docp.has_inplace 
    123            .          .                       docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v) 
    124            .          .                   else 
    125            .       1935                       c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v) 
    126            .          .                   end 
    127            .          .               end 
    128            .          .            
    129            .          .           end 

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 25, 2024

Inplace version, the constraints still allocate apparently:

    107            .          .                   if docp.has_inplace 
    108            .       1515                       docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v) 
    109            .          .                   else 
    110            .          .                       c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v) 
    111            .          .                   end 
    112            .          .               end 
    113            .          .               if docp.dim_x_cons > 0  
    114            .          .                   if docp.has_inplace 
    115            .       1313                       docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v) 
    116            .          .                   else 
    117            .          .                       c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v) 
    118            .          .                   end 
    119            .          .               end 
    120            .          .               if docp.dim_mixed_cons > 0  
    121            .          .                   if docp.has_inplace 
    122            .       1717                       docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v) 
    123            .          .                   else 

The last one goes to

    207            .        101               function ψ!(val, t, x, u, v) # nonlinear mixed constraints (in place) 
    208            .          .                   j = 1 
    209            .        101                   for i ∈ 1:length(ψf) 
    210            .          .                       li = ψs[i] 
    211            .        909                       ψf[i](@view(val[j:(j + li - 1)]), t, x, u, v) 
    212            .          .                       j = j + li 
    213            .          .                   end 
    214            .          .                   return nothing 
    215            .          .               end 

I defined the constraint for the ocp as

constraint!(goddard, :mixed, f = m_fun!, lb = mf, ub = Inf)

with

    function m_fun!(c, x, u, v) 
        @views c[:] .= x[3]
        return
    end

(basic c[1] = x[3] is similar)

@amontoison
Copy link
Contributor

Did you try @code_warntype? I suspect a type instability.

@PierreMartinon
Copy link
Member Author

I think so too. Warntype was mostly ok on the constraints but it did not descend into the subfunctions. JET gave runtime dispatch at all OCP function calls, but I find the output harder to read. I'll try warntype on the subfunctions, thanks !

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Sep 27, 2024

Some progress. Fixed a type instability on the times, and got rid of the scalar/vector Union for the state/control/optim variables. Reduction of allocations is not as big as hoped though, still looking into it.

Update: getters for t,x,u,v are now type-stable and show 0 allocations. For the times we split the Union and use 2 distinct variables for fixed time vs free time index. For the scalar/vector x,u,v we now have 3 parametric types in DOCP and the getters dispatch along these types. Each getter only has 2 methods (scalar / vector) for its own variable, regardless of the others.

The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'.
So here we are :D

I'll start with mayer.

@jbcaillau
Copy link
Member

@PierreMartinon thanks for the update (and @amontoison for the feedback):

  • good to see using views does not allocate (!)
  • so did splitting unions and using dispatch suppressed type instabilities?
  • what about the ocp primitives you mentioned (mayer and so)?
  • at the moment, we have union of vectors and scalars everywhere in ocp...

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Oct 3, 2024

@PierreMartinon thanks for the update (and @amontoison for the feedback):

* good to see using views does not allocate (!)

* so did splitting unions and using dispatch suppressed type instabilities?

Yes it certainly seems so. Older code used single methods with if statements and possibly different/ambiguous return types. Current code uses

  • two different variables for value/index in the case of initial.time and final.time
  • for the getters (x,u,v), DOCP is now parametrized by 3 indicators corresponding to scalar / vector state, control and variable. Getters have 2 methods each. There may be better ways to do this, but it seems to do the job. Fortunately there is no combinatorial aspect on the different types: each getter only cares about its own type.
abstract type ScalVect end
struct ScalVariable <: ScalVect end
struct VectVariable <: ScalVect end
...
struct DOCP{T <: Discretization, X <: ScalVect, U <: ScalVect, V <: ScalVect}
...
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, ScalVariable, <: ScalVect, <: ScalVect}, i)
    offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
    return xu[offset+1]
end
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, VectVariable, <: ScalVect, <: ScalVect}, i)
    offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
    return @view xu[(offset + 1):(offset + docp.dim_OCP_x)]
end
* what about the ocp primitives you mentioned (mayer and so)?

* at the moment, we have union of vectors and scalars everywhere in ocp...

See below. Maybe we'll add some post-processing in the OCP to define functions with more fixed types ?

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Oct 3, 2024

Looking at the mayer cost which is one of the simpler calls, and also the main part of the objective. 'Local Mayer' is the (inplace) locally defined version

function local_mayer(obj, x0, xf, v)
    obj[1] = xf[3]
    return
end

while 'OCP Mayer' refers to docp.ocp.mayer from CTBase. The call is done as follows (with $ for btime calls)

xu = CTDirect.DOCP_initial_guess(docp)
x0 = CTDirect.get_OCP_state_at_time_step(xu, docp, 1)
xf = CTDirect.get_OCP_state_at_time_step(xu, docp, N+1)
v = CTDirect.get_OCP_variable(xu, docp)
obj = similar(xu,1)
docp.ocp.mayer(obj, x0, xf, v)

We see that the local Mayer does not allocate, either with hardcoded views / scalar arguments or the new getters (while old getters added some allocations there).
However, the OCP Mayer has some allocations, which I'm trying to track.

julia> test_unit(;test_mayer=true, warntype=true, jet=true, profile=true)

Local Mayer: views for x0/xf and scalar v  4.474 ns (0 allocations: 0 bytes)
Local Mayer: param scal/vec getters  2.467 ns (0 allocations: 0 bytes)
OCP Mayer: param scal/vec getters  28.047 ns (3 allocations: 112 bytes)

Code_warntype looks clean except for the F.f! function that is tagged as 'Any'.

MethodInstance for (::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)
  from (F::Mayer!{NonFixed})(r::Union{Real, AbstractVector{<:Real}}, x0::Union{Real, AbstractVector{<:Real}}, xf::Union{Real, AbstractVector{<:Real}}, v::Union{Real, AbstractVector{<:Real}}) @ CTBase ~/.julia/packages/CTBase/0TgT4/src/functions.jl:179
Arguments
  F::Mayer!{NonFixed}
  r::Vector{Float64}
  x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  v::Float64
Locals
  @_6::Any
Body::Nothing
1 ─ %1 = CTBase.Nothing::Core.Const(Nothing)
│   %2 = Base.getproperty(F, :f!)::Function
│   %3 = (%2)(r, x0, xf, v)::Any
│        (@_6 = %3)
│   %5 = (@_6 isa %1)::Bool
└──      goto #3 if not %5
2 ─      goto #4
3 ─ %8 = Base.convert(%1, @_6)::Core.Const(nothing)
└──      (@_6 = Core.typeassert(%8, %1))
4 ┄      return @_6::Nothing

JET indicates 2 runtime dispatches for the call to F.f!, the second one seems related to the return type ?

═════ 2 possible errors found ═════
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: %1::Function(r::Vector{Float64}, x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::Float64)::Any
└────────────────────
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: convert(CTBase.Nothing, %2::Any)::Nothing
└────────────────────

Profile locates the 3 allocations at the f! call, and below that the trace looks very low level.

(::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)

/home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl
        
    179            .          .           function (F::Mayer!{NonFixed})(r::ctVector, x0::State, xf::State, v::Variable)::Nothing 
    180            .          3               return F.f!(r, x0, xf, v) 
    181            .          .           end 

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Oct 3, 2024

@jbcaillau @ocots Question: State is currently an union of scalar / vector, would it be possible, when building the ocp, to define State only after reading the state dimension, so it would be either scalar or vector, but not both ? Iirc the dimensions of the problems are supposed to be declared before the functions when using the def macro. Same for Control and Variable.

@gdalle
Copy link

gdalle commented Oct 31, 2024

The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'.
So here we are :D

This confirms what we saw yesterday with @ocots. Your main problem everywhere (here and in CTBase) is fields with abstract types.

For more performance tips, as well as a faster way of profiling (@profview and @profview_allocs), check out my Julia & Optimization Days tutorial: https://gdalle.github.io/JuliaOptimizationDays2024-FastJulia/

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Nov 15, 2024

Hi there @jbcaillau @ocots @gdalle @joseph-gergaud,

After reworking the code a bit, the state equation part is no longer allocating :-)
Used both dot and views macros, as well as a single big work array for all dynamics evaluations (this latter part may take advantage of the simd macro at some point).
This removed 10 to 20% of the allocations for the constraints evaluation, with the rest occuring when calling the OCP functions, likely due to the runtime dispatch.
Olivier, I don't know if this is useful, but I can check the call for each OCP function individually.

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Nov 19, 2024

To illustrate, here are the results of the tests in benchmark/prof.jl on the simple integrator problem:

@def ocp begin
    t ∈ [0, 1], time
    x ∈ R, state
    u ∈ R², control
    [0, 0] ≤ u(t) ≤ [Inf, Inf]
    x(0) == -1
    x(1) == 0
    ẋ(t) == -x(t) -u[1](t) + u[2](t)
    ∫((u[1](t)+u[2](t))^2) → min
end

a) @btime from BenchmarkTools indicates the execution time and memory allocations:

print("Objective"); @btime CTDirect.DOCP_objective($xu, $docp)
print("Constraints"); @btime CTDirect.DOCP_constraints!($c, $xu, $docp)
print("Transcription"); @btime direct_transcription($prob.ocp, grid_size=$grid_size)
print("Solve"); @btime direct_solve($prob.ocp, display=false, grid_size=$grid_size)

outputs

Objective  17.436 ns (2 allocations: 64 bytes)
Constraints  16.141 μs (1646 allocations: 59.45 KiB)
Transcription  5.817 ms (70264 allocations: 5.83 MiB)
Solve  15.649 ms (215572 allocations: 13.78 MiB)

b) code_warntype gives some information about the types

@code_warntype CTDirect.DOCP_objective(xu, docp)

gives (not showing the Body part which details each line / operation and is quite verbose)

MethodInstance for CTDirect.DOCP_objective(::Vector{Float64}, ::CTDirect.DOCP{CTDirect.Trapeze, CTDirect.ScalVariable, CTDirect.VectVariable, CTDirect.VectVariable})
  from DOCP_objective(xu, docp::CTDirect.DOCP) @ CTDirect ~/CTDirect.jl/src/problem.jl:338
Arguments
  #self#::Core.Const(CTDirect.DOCP_objective)
  xu::Vector{Float64}
  docp::CTDirect.DOCP{CTDirect.Trapeze, CTDirect.ScalVariable, CTDirect.VectVariable, CTDirect.VectVariable}
Locals
  x0::Float64
  xf::Float64
  v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  N::Int64
  obj::Vector{Float64}
Body::Float64
 ...

and

@code_warntype CTDirect.DOCP_constraints!(c, xu, docp)

gives

MethodInstance for CTDirect.DOCP_constraints!(::Vector{Float64}, ::Vector{Float64}, ::CTDirect.DOCP{CTDirect.Trapeze, CTDirect.ScalVariable, CTDirect.VectVariable, CTDirect.VectVariable})
  from DOCP_constraints!(c, xu, docp::CTDirect.DOCP) @ CTDirect ~/CTDirect.jl/src/problem.jl:378
Arguments
  #self#::Core.Const(CTDirect.DOCP_constraints!)
  c::Vector{Float64}
  xu::Vector{Float64}
  docp::CTDirect.DOCP{CTDirect.Trapeze, CTDirect.ScalVariable, CTDirect.VectVariable, CTDirect.VectVariable}
Locals
  @_5::Union{Nothing, Tuple{Int64, Int64}}
  work::Vector{Float64}
  v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  time_grid::Vector{Float64}
  i::Int64
Body::Vector{Float64}
 ...

There are no 'Any' types anymore with the parametrized getters handling the scalar/vector distinction.

c) Another tool is JET that signals possible runtime dispatch cases in particular (I had to force the display for whatever reason):

display(@report_opt CTDirect.DOCP_objective(xu, docp))

gives a runtime dispatch when calling the Mayer cost in DOCP_objective, as well as 2x2 runtime dispatches for the Mayer cost in the OCP.

═════ 5 possible errors found ═════
┌ DOCP_objective(xu::Vector{…}, docp::CTDirect.DOCP{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:352
│┌ (::Mayer!{Fixed})(r::Vector{Float64}, x0::Float64, xf::Float64, v::SubArray{Float64, 1, Vector{…}, Tuple{…}, true}) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:176
││ runtime dispatch detected: %1::Function(r::Vector{Float64}, x0::Float64, xf::Float64)::Any
│└────────────────────
│┌ (::Mayer!{Fixed})(r::Vector{Float64}, x0::Float64, xf::Float64, v::SubArray{Float64, 1, Vector{…}, Tuple{…}, true}) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:176
││ runtime dispatch detected: convert(CTBase.Nothing, %2::Any)::Nothing
│└────────────────────
│┌ (::Mayer!{NonFixed})(r::Vector{Float64}, x0::Float64, xf::Float64, v::SubArray{Float64, 1, Vector{…}, Tuple{…}, true}) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
││ runtime dispatch detected: %1::Function(r::Vector{Float64}, x0::Float64, xf::Float64, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
│┌ (::Mayer!{NonFixed})(r::Vector{Float64}, x0::Float64, xf::Float64, v::SubArray{Float64, 1, Vector{…}, Tuple{…}, true}) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
││ runtime dispatch detected: convert(CTBase.Nothing, %2::Any)::Nothing
│└────────────────────
┌ DOCP_objective(xu::Vector{…}, docp::CTDirect.DOCP{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:352
│ runtime dispatch detected: %99::Union{Nothing, Mayer, Mayer!}(%3::Vector{Float64}, %95::Float64, %69::Float64, %43::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
└────────────────────

For the constraints

display(@report_opt CTDirect.DOCP_constraints!(c, xu, docp))

we get something similar with 1 runtime dispatch for the calls to the dynamics and lagrange cost, and 2 for the control/state/mixed constraints, and boundary/variable constraints.

═════ 12 possible errors found ═════
┌ DOCP_constraints!(c::Vector{…}, xu::Vector{…}, docp::CTDirect.DOCP{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:383
│┌ setWorkArray(docp::CTDirect.DOCP{…}, xu::Vector{…}, time_grid::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:85
││ runtime dispatch detected: %130::Union{Nothing, Dynamics, Dynamics!}(%167::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %56::Float64, %80::Float64, %126::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
│┌ setWorkArray(docp::CTDirect.DOCP{…}, xu::Vector{…}, time_grid::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:88
││ runtime dispatch detected: %173::Union{Nothing, Lagrange, Lagrange!}(%211::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %56::Float64, %80::Float64, %126::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
┌ DOCP_constraints!(c::Vector{…}, xu::Vector{…}, docp::CTDirect.DOCP{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:387
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:131
││ runtime dispatch detected: (%699::Any)[2]::Any
│└────────────────────
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:131
││ runtime dispatch detected: %700::Any(%737::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %22::Float64, %92::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:134
││ runtime dispatch detected: (%743::Any)[2]::Any
│└────────────────────
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:134
││ runtime dispatch detected: %744::Any(%785::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %22::Float64, %46::Float64, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:137
││ runtime dispatch detected: (%791::Any)[2]::Any
│└────────────────────
│┌ setConstraintBlock!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}, time_grid::Vector{…}, i::Int64, work::Vector{…}) @ CTDirect /home/pierre/CTDirect.jl/src/trapeze.jl:137
││ runtime dispatch detected: %792::Any(%837::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %22::Float64, %46::Float64, %92::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
┌ DOCP_constraints!(c::Vector{…}, xu::Vector{…}, docp::CTDirect.DOCP{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:391
│┌ setPointConstraints!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:445
││ runtime dispatch detected: (%66::Any)[2]::Any
│└────────────────────
│┌ setPointConstraints!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:445
││ runtime dispatch detected: %67::Any(%104::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, %33::Float64, %60::Float64, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any
│└────────────────────
│┌ setPointConstraints!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:450
││ runtime dispatch detected: (%110::Any)[2]::Any
│└────────────────────
│┌ setPointConstraints!(docp::CTDirect.DOCP{…}, c::Vector{…}, xu::Vector{…}, v::SubArray{…}) @ CTDirect /home/pierre/CTDirect.jl/src/problem.jl:450
││ runtime dispatch detected: %111::Any(%152::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true})::Any

It seems each call to an OCP function is flagged with 2 possible runtime dispatch, except dynamics and lagrange cost with only 1.

d) To conclude, we show the profiler ouptut for the constraints, to see where the allocations occur

Profile.Allocs.@profile sample_rate=1.0 CTDirect.DOCP_constraints!(c, xu, docp)
results = Profile.Allocs.fetch()
PProf.Allocs.pprof()

collect the profiling data and gives a link to a local webpage with the visualization from PProf

julia> test_unit(test_obj=false, test_trans=false, test_solve=false, profile=true)
Constraints  16.009 μs (1646 allocations: 59.45 KiB)
Analyzing 1646 allocation samples... 100%|█████████████████████████████████████████████████████| Time: 0:00:01
julia> Main binary filename not available.
Serving web UI on http://localhost:62261

I find the flamegraph and source views the most useful, even if they tend to be cluttered with a lot of noise from the julia internals (maybe the depth of analysis can be limited or something). The source view matches allocations with lines, such as

    378            .          .           function DOCP_constraints!(c, xu, docp::DOCP) 
    379            .          .            
    380            .          .               # initialization 
    381            .          .               time_grid = get_time_grid(xu, docp) 
    382            .          .               v = get_OCP_variable(xu, docp) 
    383            .       1618               work = setWorkArray(docp, xu, time_grid, v) 
    384            .          .            
    385            .          .               # main loop on time steps  
    386            .          .               for i = 1:docp.dim_NLP_steps+1 
    387            .          .                   setConstraintBlock!(docp, c, xu, v, time_grid, i, work) 
    388            .          .               end 
    389            .          .            
    390            .          .               # point constraints 
    391            .         28               setPointConstraints!(docp, c, xu, v) 
    392            .          .            
    393            .          .               # NB. the function *needs* to return c for AD... 
    394            .          .               return c 
    395            .          .           end 

which shows that the discretized dynamics in setConstraintsBlock! does not allocate anymore. For this problem with no path constraints almost all remaning allocations come from the dynamics / lagrange calls that allocate each time (cf the runtime dispatch warnings from JET):

     74            .          .           function setWorkArray(docp::DOCP{Trapeze}, xu, time_grid, v) 
     75            .          .               # use work array to store all dynamics + lagrange costs 
     76            .          2               work = similar(xu, docp.dim_NLP_x * (docp.dim_NLP_steps+1)) 
     77            .          .            
     78            .          .               # loop over time steps 
     79            .          .               for i = 1:docp.dim_NLP_steps+1 
     80            .          .                   offset = (i-1) * docp.dim_NLP_x 
     81            .          .                   ti = time_grid[i] 
     82            .          .                   xi = get_OCP_state_at_time_step(xu, docp, i) 
     83            .          .                   ui = get_OCP_control_at_time_step(xu, docp, i) 
     84            .          .                   # OCP dynamics 
     85            .        808                   docp.ocp.dynamics((@view work[offset+1:offset+docp.dim_OCP_x]), ti, xi, ui, v) 
     86            .          .                   # lagrange cost 
     87            .          .                   if docp.is_lagrange 
     88            .        808                       docp.ocp.lagrange((@view work[offset+docp.dim_NLP_x:offset+docp.dim_NLP_x]), ti, xi, ui, v) 
     89            .          .                   end 
     90            .          .               end 
     91            .          .               return work 
     92            .          .           end 

@gdalle
Copy link

gdalle commented Nov 19, 2024

Yes, that is due to field annotations like

struct MyType
    f::Function
end

which should be

struct MyType{F}
    f::F
end

@PierreMartinon
Copy link
Member Author

Thanks for the tip @gdalle , @ocots is currently reworking the OCP model and hopefully we can get rid of these remaining allocations.

@jbcaillau
Copy link
Member

back on track: 🚀 @PierreMartinon for cleaning up this and thanks @gdalle for the input. current refactoring by @ocots should take care of the remaining issues 🤞🏾

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants