This repository contains the code for
- a paper,
- (parts of) a dissertation,
- and the Python package
cffs
.
This README
provides:
- An overview of the related publications.
- An outline of the repo structure.
- Steps for setting up a virtual environment and reproducing the experiments.
Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"
is a paper published in the journal SN Computer Science. You can find the paper here. You can find the corresponding complete experimental data (inputs as well as results) on KITopenData. We tagged the commits for reproducing these data:
- Use the tag
syn-pipeline-2021-03-26-paper-accept
to runprepare_openml_datasets.py
andsyn_pipeline.py
. - Use the tag
ms-pipeline-2021-03-26-paper-accept
to runprepare_ms_dataset.py
andms_pipeline.py
. - Use the tag
evaluation-2021-08-10-paper-accept
to runsyn_evaluation_journal.py
andms_evaluation_journal.py
.
Bach, Jakob. "Leveraging Constraints for User-Centric Feature Selection"
is a dissertation in progress. Once it is published, we will link it here as well. You can find the corresponding complete experimental data (inputs as well as results) on RADAR4KIT. We tagged the commits for reproducing these data:
- Use the tag
syn-pipeline-2021-03-26-dissertation
to runprepare_openml_datasets.py
andsyn_pipeline.py
(same commit as for journal version). - Use the tag
ms-pipeline-2021-03-26-dissertation
to runprepare_ms_dataset.py
andms_pipeline.py
(same commit as for journal version). - Use the tag
evaluation-2024-11-02-dissertation
to runsyn_evaluation_dissertation.py
andms_evaluation_dissertation.py
.
On the top level, there are the following (non-code) files:
.gitignore
: For Python development.LICENSE
: The software is MIT-licensed, so feel free to use the code.README.md
: You are here 🙃requirements.txt
: To set up an environment with all necessary dependencies; see below for details.
The folder src
contains the code in multiple sub-directories:
cffs_package
: Code for SMT expressions (to formulate constraints), solving and optimization. Organized as the standalone Python packagecffs
(i.e., can be used without the remaining code). See the corresponding README for more information.materials_science
: Code for our case study with manually defined constraints in materials science.synthetic_constraints
: Code for our study with synthetically generated constraints on arbitrary datasets.utilities
: Code for the experimental pipelines, like data I/O, computing feature qualities, and predicting.
Before running scripts to reproduce the experiments, you need to set up an environment with all necessary dependencies. Our code is implemented in Python (version 3.7).
If you use conda
, you can install the correct Python version into a new conda
environment
and activate the environment as follows:
conda create --name <conda-env-name> python=3.7
conda activate <conda-env-name>
We used virtualenv
(version 20.4.0) to create an environment for our experiments.
First, make sure you have the right Python version available.
Next, you can install virtualenv
with
python -m pip install virtualenv==20.4.0
To set up an environment with virtualenv
, run
python -m virtualenv -p <path/to/right/python/executable> <path/to/env/destination>
Activate the environment in Linux with
source <path/to/env/destination>/bin/activate
Activate the environment in Windows (note the back-slashes) with
<path\to\env\destination>\Scripts\activate
After activating the environment, you can use python
and pip
as usual.
To install all necessary dependencies for this repo, simply run
python -m pip install -r requirements.txt
If you make changes to the environment and you want to persist them, run
python -m pip freeze > requirements.txt
To leave the environment, run
deactivate
After setting up and activating an environment, you are ready to run the code. You can reproduce the results of both studies with the same three steps, i.e., by running three scripts each:
- Prepare datasets:
Run the script
prepare_openml_datasets.py
orprepare_ms_dataset.py
to prepare input data for the experimental pipeline. (If you use the experimental data linked above, you can skip this step for the materials-science dataset.) These scripts apply some pre-processing and then save feature data (X
) and prediction target (y
) as CSVs for each dataset. You can specify the output directory. We recommenddata/openml/
anddata/ms/
as output directories, so the following pipeline scripts work without specifying a directory. For the materials-science pre-processing, you need to provide the raw voxel datasetvoxel_data.csv
as an input. For the OpenML pre-processing, you need an internet connection, as the datasets are downloaded first.prepare_demo_dataset.py
is a lightweight alternative to test the pipeline for synthetic constraints, as it just prepares one dataset, which is already part ofsklearn
. - Run experimental pipeline:
Run the script
syn_pipeline.py
orms_pipeline.py
to execute the experimental pipeline. These scripts save the results as one or more CSV file(s). A merged results file is available asresults.csv
. You can specify various options, e.g., output directory, number of cores, number of repetitions, etc. We recommend using the default output directoriesdata/openml-results/
anddata/ms-results/
, so the following evaluation scripts work without specifying a directory. - Run evaluation:
Run the scripts
syn_evaluation_journal.py
andms_evaluation_journal.py
to create the paper's plots or run the scriptssyn_evaluation_dissertation.py
andms_evaluation_dissertation.py
to create the dissertation's plots. These scripts save the plots as PDFs. You can specify the input and output directory.
Execute all scripts from the src
directory of this repo like this:
python -m synthetic_constraints.prepare_demo_dataset <options>
-m <module.import.syntax>
makes sure imports of sub-packages work.
Note that you need to leave out the file ending .py
in that call.
Passing --help
as an option gives you an overview of each script's further options.