Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add methods #4

Open
5 of 9 tasks
lacava opened this issue Feb 20, 2019 · 19 comments
Open
5 of 9 tasks

add methods #4

lacava opened this issue Feb 20, 2019 · 19 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@lacava
Copy link
Member

lacava commented Feb 20, 2019

add SR methods for comparison. the following come to mind:

@lacava lacava added help wanted Extra attention is needed enhancement New feature or request labels Feb 20, 2019
@jmmcd
Copy link
Contributor

jmmcd commented Mar 5, 2019

Another one that would be very nice is PGE (Worm & Chiu, GECCO 2013).

Paper: http://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_gecco2013.pdf
Code: https://github.com/verdverm/pypge

@lacava
Copy link
Member Author

lacava commented Mar 5, 2019

thanks, I'll reach out. doesn't look like it's being maintained.

@folivetti
Copy link
Contributor

if I may, my algorithm was just accepted for publication:

Paper: https://www.mitpressjournals.org/doi/abs/10.1162/evco_a_00285
Code: https://github.com/folivetti/ITEA

Even though the code is in Haskell, I have included a Python wrapper in my repository, similar to your wrappers. Let me know if I can be of any help!

@lacava
Copy link
Member Author

lacava commented Dec 21, 2020

hi @folivetti , thanks for sharing. I'm going to be uploading a contributing guide soon that will detail how to include your method. Please stay tuned.

@lacava
Copy link
Member Author

lacava commented Jan 13, 2021

hi @folivetti, please see the contributing guide on the dev branch: https://github.com/EpistasisLab/regression-benchmark/blob/dev/CONTRIBUTING.md

Eventually this will be merged into master (still working on some hiccups with existing methods), but if you would like to start now, you can issue a PR to contribute your method to the dev branch. Let me know if you have any questions!

@folivetti
Copy link
Contributor

thanks! I guess my code is already halfway through. As soon as I get the time to do so, I'll make the PR.

@folivetti
Copy link
Contributor

I finally took the time to implement the python wrapper for ITEA. I have one final question: my code is written in Haskell using stack as the project manager. Should I include the installation of stack into the install script or should I put this requirement in a README file?

To install the stack environment you only need to run curl -sSL https://get.haskellstack.org/ | sh, but it may require a sudo permission since it installs GMP.

@lacava
Copy link
Member Author

lacava commented Mar 21, 2021 via email

@lacava
Copy link
Member Author

lacava commented Mar 22, 2021

To install the stack environment you only need to run curl -sSL https://get.haskellstack.org/ | sh, but it may require a sudo permission since it installs GMP.

sudo is ok

also wanted to mention you can test the install locally, doing something like

./configure
./install 
cd experiment
python -m pytest -v 

also see the github workflow for more info

@lacava
Copy link
Member Author

lacava commented Apr 17, 2021

@folivetti hope you got my email, but just checking if you think you'll have time to get ITEA integrated this week? many thanks!

@folivetti
Copy link
Contributor

@folivetti hope you got my email, but just checking if you think you'll have time to get ITEA integrated this week? many thanks!

yes, I did receive the e-mail, thanks :-)
I have everything ready and should make the PR tomorrow. I'm just running some tests to double check that everything works. Thanks!

@gkronber
Copy link

gkronber commented May 4, 2021

I'm adding a few more methods for future reference.

While it would be nice to have a transparent and objective way to compare all those methods it will probably be impossible to have all SymReg methods included into srbench for various reasons (e.g. closed source, difficulty to provide a Python wrapper, method is tuned to work well for certain problem characteristics, authors not cooperative, ...).

Researchers publishing SymReg methods should be made aware of srbench. I argue that we should be increasingly careful about new SymReg methods which are not included in srbench when reviewing or reading papers even when they are published in reputable journals.

@lacava
Copy link
Member Author

lacava commented May 4, 2021

Thanks for the list @gkronber. Deep SR is implemented and i'm working on AI-Feynman

@lacava lacava mentioned this issue May 7, 2021
@MilesCranmer
Copy link
Contributor

Hi @lacava et al., thanks for making this benchmark suite, it looks great! I just found out about your efforts on this today, I think it is a great idea.

I would be interested in helping add my methods: the Julia library SymbolicRegression.jl (mentioned in @gkronber's post) and the Python frontend PySR which I actively maintain. Before I get started, just to check, would it be doable to include Julia as part of the benchmarking script?

Second, what kinds of resources are available for the benchmark? My library tends to find better results the longer it's run for and can be parallelized over multi-node.

Third, my methods output a list of equations rather than a single one. Is there I way I can pass the entire list through, or should I make a choice of one equation to pass?

Lastly, I was wondering about benchmark coverage: I have a "high-dimensional" SR method described a bit here (https://arxiv.org/abs/2006.11287) which is made for sequences, sets, and graphs. Is there a benchmark included here for high-dimensional SR?

Thanks!
Miles

@lacava
Copy link
Member Author

lacava commented Dec 13, 2021

Hi @lacava et al., thanks for making this benchmark suite, it looks great! I just found out about your efforts on this today, I think it is a great idea.

Great! Thanks for reaching out.

I would be interested in helping add my methods: the Julia library SymbolicRegression.jl (mentioned in @gkronber's post) and the Python frontend PySR which I actively maintain. Before I get started, just to check, would it be doable to include Julia as part of the benchmarking script?

We should definitely be able to support Julia. It will be easiest if there is a conda dependency for it. But we also are moving towards a Docker environment eventually.

Second, what kinds of resources are available for the benchmark? My library tends to find better results the longer it's run for and can be parallelized over multi-node.

In our current experiment (Table 2) we set the termination criteria to 500k evaluations per training or 48 hours for the real-world datasets, and 1M evaluations or 8 hours for the synthetic ground-truth datasets.

Most of the methods here are parallelizable, but because we're running 252 datasets, 10 trials, and 21 methods, it made more sense to give each a single core. The cluster we used has ~1100 CPU cores.

Third, my methods output a list of equations rather than a single one. Is there I way I can pass the entire list through, or should I make a choice of one equation to pass?

Only a single final model should be returned. Otherwise it wouldn't be a fair comparison since your method would have several chances to "win". (Incidentally, most of the GP-based SR methods also have a set of models, and use a hold-out set for final model selection. We could think about ways of comparing sets of equations in the future, but don't do so right now.)

Also, it would be ideal to return the equation string in sympy-compatible format to avoid a lot of post-processing from the last round.

Lastly, I was wondering about benchmark coverage: I have a "high-dimensional" SR method described a bit here (https://arxiv.org/abs/2006.11287) which is made for sequences, sets, and graphs. Is there a benchmark included here for high-dimensional SR?

Currently we've mostly look at tabular data. Have a look at the datasets in PMLB. the widest datasets are ~100s of features. But, we're always looking for good benchmark problems.

@MilesCranmer
Copy link
Contributor

Thanks! This is very helpful.

I have a quick followup question about the suite. Are you benchmarking accuracy, or parsimony, or some combination? Or are you evaluating whether the recovered sympy expression is equal? PySR's default choice for "best" is similar to Eureqa, where they look for "cliffs" in the accuracy-vs-parsimony curve.

Also - final question - can the model use a different set of hyperparameters for the noisy vs non-noisy dataset? (e.g., to simulate whether the experimenter would know a priori if their data was noisy).

Thanks again,
Miles

@lacava
Copy link
Member Author

lacava commented Dec 14, 2021

Are you benchmarking accuracy, or parsimony, or some combination? Or are you evaluating whether the recovered sympy expression is equal? PySR's default choice for "best" is similar to Eureqa, where they look for "cliffs" in the accuracy-vs-parsimony curve.

It's probably worth checking details in the paper. we broke the comparison into real-world/"black-box" problems with no known model, and ground-truth problems generated for known functions. we benchmark accuracy and parsimony in the former case and symbolic equivalence (within a linear transformation of the true model) in the latter.

Also - final question - can the model use a different set of hyperparameters for the noisy vs non-noisy dataset? (e.g., to simulate whether the experimenter would know a priori if their data was noisy).

We don't support this at the moment, but most of the benchmarks have some amount of noise. One of our study findings was that AI-Feynman was particularly sensitive to target label noise.

@MilesCranmer
Copy link
Contributor

Added PySR and SymbolicRegression.jl in this PR: #62. Let me know what else I need to add, thanks!

@fnpdaml
Copy link

fnpdaml commented Mar 2, 2023

add SR methods for comparison. the following come to mind:

Also please consider:
"TuringBot"
https://turingbotsoftware.com/
(Free version is limited to max. 50 rows of input data and max. 3 variables)
nevertheless, best success ratio from my empirical, personal usage.

from the documentation:
"TuringBot is also a console application that can be executed in a fully automated and customizable way"
https://turingbotsoftware.com/documentation.html#command-line

Again, I have no relation to the authors and/or copyright holders.
Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants