MLR to the residuals of SR to speed up search #752

gm89uk · 2024-11-26T23:03:06Z

gm89uk
Nov 26, 2024

This is a strategy I used to greatly speed my search. Letting SR run from scratch found a great fit over quite a long search (6 variables), which eventually was a combination of nonlinear interactions + linear coefficients.

I didn't like the idea of running MLR and then feeding the residuals to SR, as that would be restrictive to the search space.
Instead, I did this in reverse, perform MLR to the residuals of the SR in a custom loss function.

The hope is SR can explore the nonlinear interactions without wasting the search space also finding the linear components (if applicable).

This way, the linear components mould to the residuals of the expression being assessed and hopefully this is not restrictive to the search space. Also, this is very useful if you have a categorial variable that you know has a linear relationship to y.

It does however, mean that you require to run MLR on the residuals of the expression afterward, and a bit more of a pain to find the final expression. The complexity will be longer than it may otherwise have been (but after simplification/factorisation it is not so different).

using SymbolicRegression, Statistics, LoopVectorization, Bumper, MLJ, DataFrames, CategoricalArrays, Zygote, SparseArrays, IterativeSolvers, TensorBoardLogger
function loss_fnc(tree, dataset::Dataset{T,L}, options, idx) where {T,L}
    # Extract data for the given indices
    X = idx === nothing ? dataset.X : dataset.X[:, idx]
    y = idx === nothing ? dataset.y : view(dataset.y, idx)
    weights = idx === nothing ? dataset.weights : view(dataset.weights, idx) #categorial variable with linear relationship to y. 
    prediction, grad, complete = eval_grad_tree_array(tree, X, options; variable=true)
    if !complete
        return L(Inf)
    end
    #....other code, such as monotonicity check etc. Return early where possible to avoid unnecessary MLR processing.
    residuals = Float32.(prediction .- y)
    # one-hot encoding for categorial variable, (passed as weights)
    unique_weights = unique(weights) 
    weight_map = Dict(weight => i for (i, weight) in enumerate(unique_weights))
    Z_w = sparse(1:length(weights), [weight_map[weights[i]] for i in 1:length(weights)], 1.0f0, length(weights), length(unique_weights))
    intercept = ones(Float32, length(weights))
    Z_X = X' 
    Z_SR = Float32.(prediction)
    Z = hcat(intercept, Z_w, Z_X, Z_SR) 
    β = IterativeSolvers.lsqr(Z, residuals)
    prediction .-= Z * β
    mse = mean((prediction .- y).^2)
    return mse
end

This code does converge very quickly indeed in my use case. Sparse matrices and IterativeSolvers really sped up the processing time.

It would be amazing to have some level of integration for MLR in one way or the other as an option, although I can definitely see the importance of letting SR explore without any bias or predilection.

Do you think running MLR even 'in reverse' here would still potentially restrict the search space as perhaps it will evolve to rely on linear components?

Edit: Add the output of SR to MLR to acquire its own coefficient. Very happy to share any logs using the new logging feature to demonstrate just how much faster it converges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLR to the residuals of SR to speed up search #752

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

MLR to the residuals of SR to speed up search #752

gm89uk Nov 26, 2024

Replies: 0 comments

gm89uk
Nov 26, 2024