Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEDOT warm start experiment #6

Merged
merged 34 commits into from
May 21, 2024
Merged
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
2653757
add DatasetModelsFitnessScaler
MorrisNein Dec 13, 2023
87ab25d
simplify imports
MorrisNein Mar 26, 2024
6d3a30d
typing & other fixes
MorrisNein Dec 13, 2023
ad1e1fa
update DatasetModelsFitnessScaler to support different dataset types
MorrisNein Dec 13, 2023
176ce1f
fix typing
MorrisNein Dec 14, 2023
7aa5db8
fix typing [2]
MorrisNein Dec 14, 2023
b6605ac
add MetaLearningApproach
MorrisNein Oct 22, 2023
b72719f
add fedot_history_loader.py
MorrisNein Dec 13, 2023
0d814e0
add KNNSimilarityModelAdvice
MorrisNein Dec 13, 2023
1cf3a10
minor fixes
MorrisNein Dec 13, 2023
c938452
create Dockerfile abd .dockerignore
MorrisNein Apr 20, 2023
0467259
create the experiment script & config
MorrisNein Jul 20, 2023
cac2ba3
adapt to #39
MorrisNein Jul 27, 2023
1e1b08c
add config for debugging
MorrisNein Jul 28, 2023
ff6852a
remove data leak
MorrisNein Oct 12, 2023
cf1190a
persist train/test datasets split
MorrisNein Oct 12, 2023
50379f8
add final choices to the best models
MorrisNein Oct 12, 2023
4f3d0d8
fix pipeline evaluation, compute fitness on test data;
MorrisNein Oct 22, 2023
bf8aac6
set TMPDIR from script
MorrisNein Nov 3, 2023
877be96
set logging level of FEDOT
MorrisNein Nov 7, 2023
fa48660
create config_light.yaml
MorrisNein Nov 10, 2023
28506e6
fix train/test split
MorrisNein Nov 13, 2023
071574b
add evaluation caching
MorrisNein Nov 15, 2023
2b9b863
split config file
MorrisNein Nov 21, 2023
8824679
increase debug fedot timeout
MorrisNein Nov 21, 2023
82eb33c
minor fixes
MorrisNein Nov 16, 2023
61641be
various experiment improvements & fixes
MorrisNein Dec 13, 2023
3a09a4d
add cache for AutoML repetitions
MorrisNein Dec 14, 2023
238483f
adjust configs to advise 3 initial assumptions; add prefix for config…
MorrisNein Dec 14, 2023
da6168b
fix after rebase
MorrisNein Dec 14, 2023
71dac64
fix after rebase
MorrisNein Mar 26, 2024
176c71d
some experiment fixes
MorrisNein Apr 1, 2024
26c0a52
experiment stability update
MorrisNein May 20, 2024
295c7e2
Builds fix (#98)
MorrisNein May 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Config & info files
.pep8speaks.yml
Dockerfile
LICENSE
README.md

# Unnecessary files
examples
notebooks
test

# User data
data/cache
30 changes: 30 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Download base image ubuntu 20.04
FROM ubuntu:20.04

# For apt to be noninteractive
ENV DEBIAN_FRONTEND noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN true

# Preseed tzdata, update package index, upgrade packages and install needed software
RUN truncate -s0 /tmp/preseed.cfg; \
echo "tzdata tzdata/Areas select Europe" >> /tmp/preseed.cfg; \
echo "tzdata tzdata/Zones/Europe select Berlin" >> /tmp/preseed.cfg; \
debconf-set-selections /tmp/preseed.cfg && \
rm -f /etc/timezone /etc/localtime && \
apt-get update && \
apt-get install -y nano && \
apt-get install -y mc && \
apt-get install -y python3.9 python3-pip && \
apt-get install -y git && \
rm -rf /var/lib/apt/lists/*

# Set the workdir
ENV WORKDIR /home/meta-automl-research
WORKDIR $WORKDIR
COPY . $WORKDIR

RUN pip3 install pip && \
pip install wheel && \
pip install --trusted-host pypi.python.org -r ${WORKDIR}/requirements.txt

ENV PYTHONPATH $WORKDIR
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@
[![package](https://badge.fury.io/py/gamlet.svg)](https://badge.fury.io/py/gamlet)
[![Build](https://github.com/ITMO-NSS-team/MetaFEDOT/actions/workflows/build.yml/badge.svg)](https://github.com/ITMO-NSS-team/MetaFEDOT/actions/workflows/build.yml)
[![Documentation Status](https://readthedocs.org/projects/gamlet/badge/?version=latest)](https://gamlet.readthedocs.io/en/latest/?badge=latest)
[![codecov](https://codecov.io/gh/ITMO-NSS-team/GAMLET/graph/badge.svg?token=N3Z9YTPHP9)](https://codecov.io/gh/ITMO-NSS-team/GAMLET)
<!-- [![codecov](https://codecov.io/gh/ITMO-NSS-team/GAMLET/graph/badge.svg?token=N3Z9YTPHP9)](https://codecov.io/gh/ITMO-NSS-team/GAMLET) -->
[![Visitors](https://api.visitorbadge.io/api/visitors?path=https%3A%2F%2Fgithub.com%2FITMO-NSS-team%2FMetaFEDOT&countColor=%23263759&style=plastic&labelStyle=lower)](https://visitorbadge.io/status?path=https%3A%2F%2Fgithub.com%2FITMO-NSS-team%2FMetaFEDOT)

GAMLET (previously known as MetaFEDOT) is an open platform for sharing meta-learning experiences in **AutoML** and more
Empty file added experiments/__init__.py
Empty file.
Empty file.
20 changes: 20 additions & 0 deletions experiments/fedot_warm_start/configs/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
seed: 42
tmpdir: '/var/essdata/tmp'
update_train_test_datasets_split: true

#data_settings:
n_datasets: null # null for all available datasets
test_size: 0.25
train_timeout: 15
test_timeout: 15
n_automl_repetitions: 10
#meta_learning_params:
n_best_dataset_models_to_memorize: 10
mf_extractor_params:
groups: general
assessor_params:
n_neighbors: 5
advisor_params:
minimal_distance: 1
n_best_to_advise: 3
21 changes: 21 additions & 0 deletions experiments/fedot_warm_start/configs/config_debug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
seed: 42
save_dir_prefix: debug_
update_train_test_datasets_split: true
#data_settings:
n_datasets: 10 # null for all available datasets
test_size: 0.4
train_timeout: 15
test_timeout: 15
n_automl_repetitions: 1
#meta_learning_params:
n_best_dataset_models_to_memorize: 10
mf_extractor_params:
# groups: general
features:
- nr_inst
assessor_params:
n_neighbors: 2
advisor_params:
minimal_distance: 1
n_best_to_advise: 3
19 changes: 19 additions & 0 deletions experiments/fedot_warm_start/configs/config_light.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
seed: 42
tmpdir: '/var/essdata/tmp'
save_dir_prefix: light_
#data_settings:
n_datasets: 16 # null for all available datasets
test_size: 0.25
train_timeout: 15
test_timeout: 15
n_automl_repetitions: 10
#meta_learning_params:
n_best_dataset_models_to_memorize: 10
mf_extractor_params:
groups: general
assessor_params:
n_neighbors: 5
advisor_params:
minimal_distance: 1
n_best_to_advise: 3
10 changes: 10 additions & 0 deletions experiments/fedot_warm_start/configs/evaluation_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
split_seed: 0
collect_metrics:
- f1
- roc_auc
- accuracy
- neg_log_loss
- precision
baseline_model: 'catboost'
data_test_size: 0.25
data_split_seed: 0
7 changes: 7 additions & 0 deletions experiments/fedot_warm_start/configs/fedot_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
fedot_params:
problem: classification
logging_level: 10
n_jobs: 1
show_progress: false
cache_dir: '/var/essdata/tmp/fedot_cache'
use_auto_preprocessing: true
3 changes: 3 additions & 0 deletions experiments/fedot_warm_start/configs/use_configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- config.yaml
- evaluation_config.yaml
- fedot_config.yaml
670 changes: 670 additions & 0 deletions experiments/fedot_warm_start/run.py

Large diffs are not rendered by default.

73 changes: 73 additions & 0 deletions experiments/fedot_warm_start/train_test_datasets_split.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
dataset_id,dataset_name,category,is_train,NumberOfInstances,NumberOfFeatures,NumberOfClasses
1063,kc2,small_small_binary,1,small,small,binary
40927,CIFAR_10,big_big_big,1,big,big,big
1480,ilpd,small_small_binary,1,small,small,binary
54,vehicle,small_small_small,1,small,small,small
40978,Internet-Advertisements,big_big_binary,1,big,big,binary
1464,blood-transfusion-service-center,small_small_binary,1,small,small,binary
300,isolet,big_big_big,1,big,big,big
18,mfeat-morphological,small_small_big,1,small,small,big
23381,dresses-sales,small_small_binary,1,small,small,binary
46,splice,big_big_small,1,big,big,small
1461,bank-marketing,big_small_binary,1,big,small,binary
40966,MiceProtein,small_big_small,1,small,big,small
40983,wilt,big_small_binary,1,big,small,binary
469,analcatdata_dmft,small_small_small,1,small,small,small
1053,jm1,big_small_binary,1,big,small,binary
40499,texture,big_big_big,1,big,big,big
40701,churn,big_small_binary,1,big,small,binary
12,mfeat-factors,small_big_big,1,small,big,big
1486,nomao,big_big_binary,1,big,big,binary
40982,steel-plates-fault,small_small_small,1,small,small,small
1050,pc3,small_big_binary,1,small,big,binary
307,vowel,small_small_big,1,small,small,big
1475,first-order-theorem-proving,big_big_small,1,big,big,small
1049,pc4,small_big_binary,1,small,big,binary
23517,numerai28.6,big_small_binary,1,big,small,binary
1468,cnae-9,small_big_big,1,small,big,big
40984,segment,big_small_small,1,big,small,small
151,electricity,big_small_binary,1,big,small,binary
29,credit-approval,small_small_binary,1,small,small,binary
188,eucalyptus,small_small_small,1,small,small,small
40668,connect-4,big_big_small,1,big,big,small
1478,har,big_big_small,1,big,big,small
22,mfeat-zernike,small_big_big,1,small,big,big
1067,kc1,small_small_binary,1,small,small,binary
1487,ozone-level-8hr,big_big_binary,1,big,big,binary
6332,cylinder-bands,small_big_binary,1,small,big,binary
1497,wall-robot-navigation,big_small_small,1,big,small,small
1590,adult,big_small_binary,1,big,small,binary
16,mfeat-karhunen,small_big_big,1,small,big,big
1068,pc1,small_small_binary,1,small,small,binary
3,kr-vs-kp,big_big_binary,1,big,big,binary
28,optdigits,big_big_big,1,big,big,big
40996,Fashion-MNIST,big_big_big,1,big,big,big
1462,banknote-authentication,small_small_binary,1,small,small,binary
458,analcatdata_authorship,small_big_small,1,small,big,small
6,letter,big_small_big,1,big,small,big
40670,dna,big_big_small,1,big,big,small
1510,wdbc,small_big_binary,1,small,big,binary
40975,car,small_small_small,1,small,small,small
4134,Bioresponse,big_big_binary,1,big,big,binary
37,diabetes,small_small_binary,1,small,small,binary
44,spambase,big_big_binary,1,big,big,binary
15,breast-w,small_small_binary,1,small,small,binary
1501,semeion,small_big_big,1,small,big,big
40994,climate-model-simulation-crashes,small_small_binary,0,small,small,binary
4538,GesturePhaseSegmentationProcessed,big_big_small,0,big,big,small
14,mfeat-fourier,small_big_big,0,small,big,big
1485,madelon,big_big_binary,0,big,big,binary
11,balance-scale,small_small_small,0,small,small,small
23,cmc,small_small_small,0,small,small,small
554,mnist_784,big_big_big,0,big,big,big
4534,PhishingWebsites,big_big_binary,0,big,big,binary
38,sick,big_small_binary,0,big,small,binary
1494,qsar-biodeg,small_big_binary,0,small,big,binary
50,tic-tac-toe,small_small_binary,0,small,small,binary
40979,mfeat-pixel,small_big_big,0,small,big,big
1489,phoneme,big_small_binary,0,big,small,binary
31,credit-g,small_small_binary,0,small,small,binary
32,pendigits,big_small_big,0,big,small,big
41027,jungle_chess_2pcs_raw_endgame_complete,big_small_small,0,big,small,small
182,satimage,big_big_small,0,big,big,small
40923,Devnagari-Script,big_big_big,0,big,big,big
11 changes: 6 additions & 5 deletions gamlet/approaches/knn_similarity_model_advice.py
Original file line number Diff line number Diff line change
@@ -4,6 +4,7 @@
from typing import Callable, List, Optional, Sequence

from golem.core.optimisers.opt_history_objects.opt_history import OptHistory
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from gamlet.approaches import MetaLearningApproach
@@ -55,7 +56,7 @@ class Components:
class Data:
meta_features: DatasetMetaFeatures = None
datasets: List[OpenMLDataset] = None
datasets_data: List[OpenMLDataset] = None
datasets_data: List[TabularData] = None
dataset_ids: List[DatasetIDType] = None
best_models: List[List[EvaluatedModel]] = None

@@ -66,11 +67,11 @@ def fit(self,
data = self.data
params = self.parameters

data.datasets_data = list(datasets_data)
data.datasets = [d.dataset for d in datasets_data]
data.dataset_ids = [d.id for d in datasets_data]
data.meta_features = self.extract_train_meta_features(datasets_data)
data.dataset_ids = list(data.meta_features.index)
data.datasets_data = [d_d for d_d in datasets_data if d_d.id in data.dataset_ids]
data.datasets = [d_d.dataset for d_d in data.datasets_data]

data.meta_features = self.extract_train_meta_features(data.datasets_data)
self.fit_datasets_similarity_assessor(data.meta_features, data.dataset_ids)

data.best_models = self.load_models(data.datasets, histories, params.n_best_dataset_models_to_memorize,
7 changes: 5 additions & 2 deletions gamlet/components/meta_features_extractors/pymfe_extractor.py
Original file line number Diff line number Diff line change
@@ -32,8 +32,11 @@ def extract(self, data_sequence: Sequence[Union[DatasetBase, TabularData]],
for i, dataset_data in enumerate(tqdm(data_sequence, desc='Extracting meta features of the datasets')):
if isinstance(dataset_data, DatasetBase):
dataset_data = dataset_data.get_data()
meta_features = self._extract_single(dataset_data, fill_input_nans, fit_kwargs, extract_kwargs)
accumulated_meta_features.append(meta_features)
try:
meta_features = self._extract_single(dataset_data, fill_input_nans, fit_kwargs, extract_kwargs)
accumulated_meta_features.append(meta_features)
except Exception:
logger.exception(f'Dataset {dataset_data.dataset}: error while meta-features extractin.')

output = DatasetMetaFeatures(pd.concat(accumulated_meta_features), is_summarized=self.summarize_features,
features=self.features)
33 changes: 16 additions & 17 deletions gamlet/data_preparation/datasets_train_test_split.py
Original file line number Diff line number Diff line change
@@ -29,24 +29,23 @@ def openml_datasets_train_test_split(dataset_ids: List[OpenMLDatasetIDType], tes
single_value_categories = cat_counts[cat_counts == 1].index
idx = df_split_categories[df_split_categories['category'].isin(single_value_categories)].index
df_split_categories.loc[idx, 'category'] = 'single_value'
df_datasets_to_split = df_split_categories[df_split_categories['category'] != 'single_value']
df_test_only_datasets = df_split_categories[df_split_categories['category'] == 'single_value']
if not df_datasets_to_split.empty:
df_train_datasets, df_test_datasets = train_test_split(
df_datasets_to_split,
test_size=test_size,
shuffle=True,
stratify=df_datasets_to_split['category'],
random_state=seed
)
df_test_datasets = pd.concat([df_test_datasets, df_test_only_datasets])
signle_value_datasets = df_split_categories[df_split_categories['category'] == 'single_value']
if len(signle_value_datasets) >= 1:
df_datasets_to_split = df_split_categories
additional_datasets = pd.DataFrame([])
else:
df_train_datasets, df_test_datasets = train_test_split(
df_split_categories,
test_size=test_size,
shuffle=True,
random_state=seed
)
df_datasets_to_split = df_split_categories[df_split_categories['category'] != 'single_value']
additional_datasets = signle_value_datasets

df_train_datasets, df_test_datasets = train_test_split(
df_datasets_to_split,
test_size=test_size,
shuffle=True,
stratify=df_datasets_to_split['category'],
random_state=seed
)
df_train_datasets = pd.concat([df_train_datasets, additional_datasets])

df_train_datasets['is_train'] = 1
df_test_datasets['is_train'] = 0
df_split_datasets = pd.concat([df_train_datasets, df_test_datasets]).join(
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -22,3 +22,5 @@ pytest>=7.4.0
scikit-learn>=1.0.0
scipy>=1.7.3
tqdm>=4.65.0
loguru
pecapiku @ git+https://github.com/MorrisNein/pecapiku
7 changes: 0 additions & 7 deletions tests/unit/surrogate/test_surrogate_model.py
Original file line number Diff line number Diff line change
@@ -29,10 +29,3 @@ def get_test_data():
x_pipe = torch.load(path / 'data_pipe_test.pt')
x_dset = torch.load(path / 'data_dset_test.pt')
return x_pipe, x_dset


def test_model_output(read_config):
x_pipe, x_dset = get_test_data()
model = create_model_from_config(read_config, x_pipe, x_dset)
pred = torch.squeeze(model.forward(x_pipe, x_dset))
assert pred.shape[0] == 256