-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAE for tabular data for dimension reduction #81
Comments
Hello @caimiao0714, Thank you for the kind words and your interest in the repo. :) In such a case, you should only specify the dimension of your data points (i.e 5 in your case) in the from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np
import torch
# dummy datasets
dl_dt = torch.randn(2001, 5)
my_training_config = BaseTrainerConfig(
output_dir='./',
num_epochs=5,
learning_rate=1e-3,
per_device_train_batch_size=200,
per_device_eval_batch_size=200,
train_dataloader_num_workers=2,
eval_dataloader_num_workers=2,
steps_saving=20,
optimizer_cls="AdamW",
optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
scheduler_cls="ReduceLROnPlateau",
scheduler_params={"patience": 5, "factor": 0.5}
)
# Set up the model configuration
my_vae_config = VAEConfig(
input_dim=(5,), ####### This is what changed from your code #######
latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
training_config=my_training_config,
model=my_vae_model
)
dl_train_sample = dl_dt[0:1000,:].numpy()
dl_eval_sample = dl_dt[1001:2001,:].numpy()
# Launch the Pipeline
pipeline(
train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
) PS: Do not hesitate to adapt the neural networks you use for the encoder and decoder to make it better suited for tabular data as well. I hope this helps! Best, Clément |
Hi Clément, Thank you! This helps a lot. One more question is the step on data generation after fitting the model. I notice that the example in the official manual generates new data as pictures ( Thanks, |
Hi @caimiao0714, As to the generation of synthetic data, it is indeed performed after training the model. For instance, assuming that you have trained the model as explained in the previous comment, you can generate new synthetic tabular data as follows: from pythae.models import AutoModel
from pythae.samplers import NormalSampler
# reload the trained model for the folder where it was stored
trained_model = AutoModel.load_from_folder('VAE_training_2023-03-23_18-25-25/final_model').eval()
# Create the sampler
sampler = NormalSampler(trained_model)
# Launche the sample function
gen_samples = sampler.sample(
num_samples=100, # specify the number of samples you want to generate
return_gen=True # specify that you want the sampler to return the generated samples
)
print(gen_samples.shape) As to generating disentangled data, did you mean this in the sense of #78 ? I hope this helps :) Best, Clément |
Hi Clément, Thanks for your help in generating samples. This is very useful! For generating disentangled data, I'm not sure if I fully understand issue #78. Let me try to illustrate my point in a simpler way, and hopefully I could clearly illustrate my point. Problem setting. For the dummy dataset generated by Why I chose disentanglement learning The reason why I'm trying to apply disentanglement learning for the dataset Problem with the current code At this stage, hopefully, you could see the problem with I hope that my question and problem are clear. Thanks, |
Hi @caimiao0714, Sorry or the late reply. From what I understand (tell me if I am wrong), you would like to use a different representation of the input data that can be used as input for your supervised model. If so, you can definitely do this using the models available in the library. You can for instance use as inputs of your model the latent representations of from pythae.models import AutoModel
# Reload the train model
trained_model = AutoModel.load_from_folder('path/to/model').eval()
# Get the embeddings
embeddings = trained_model.embed(torch.from_numpy(dl_train_sample)) In such a case, each row of I hope this helps. Best, Clément |
Hi @caimiao0714, I am happy to see that this is working. As to the relationship between the latent embeddings and the input data, I am not sure what you are expecting from this. The VAE model will embed the input data in the latent space using potentially highly non-linear functions and so I am not sure that you will be able to relate the latent embedding coordinates directly to those of the input data. Nonetheless, you can still try with models that specifically target the tasks of learning disentangled representations such as the Best, Clément |
Hi Clément,
Thanks for creating and maintaining this great repo. I'm a biostatistician working on environmental epidemiology (meaning that I'm new to machine learning and my questions may be naive), and I'm trying to tackle the high correlation issue with VAE or disentanglement learning.
My question is quite different (in my view) from the questions in the example code: the data in my field are tabular datasets with observations in the rows and variables in the columns (2Ds), while the example data and code in the repo are mostly images (3Ds). I'm wondering how could I set up the correct dataset form and input dimension for
benchmark_VAE
to work? Please see a small example data below.My aim is to reduce the y-dimension of this data set because they (SO4 , NO3, NH4, OM, BC) are highly correlated, and putting them in one model will cause the issue of variance inflation. I wonder how could I set up the right
benchmark_VAE
code to achieve this aim. Currently my code looks like this:But it reported the following error. I guess I did not set up the input datasets and input dimensions correctly. Any ideas would be appreciated.
Thanks,
Miao
The text was updated successfully, but these errors were encountered: