ContrastiveLosses

Implementations and examples of use cases of loss functions used in contrastive representation learning. The loss functions are found in contrastive_losses.py.

Implemented in TensorFlow, orginally for use in genetic data, here shown for other applications as well.

Computational Environment

This has been developed and tested only for Linux, not guaranteed to work for Windows.

Singularity:

Singularity/Apptainer is the recommeded option to get the computational environment set up correctly. The required packages are all included in the Apptainer definition file image.def.

Build the Apptainer image:

$ sudo apptainer build image.sif image.def

Run the image, the --nv flag exposes the NVIDIA GPU to the container:

$ apptainer run --nv image.sif

Python3.10 is installed in this container, so if you use this, you might need to explicitly run with Python version 3.10, so in the example below on genetic data, run

$ python3.10 run_gcae.py train ...

Virtual environment specification

As an alternative, the file requirements.txt contains the needed packages. Creating a python venv, and executing pip install -r requirements.txt will install the packages needed to run the code.

Example on Genetic data:

This section gives an example use on a data set consisting of dog genotypes, with results presented as a poster at PAG30.

Getting the data:

Generally, the code supports data in PLINK format. You can use your own PLINK data, but the data used in the below example is described in this paper, and the data can be obtained by running

$ wget ftp://ftp.nhgri.nih.gov/pub/outgoing/dog_genome/SNP/2017-parker-data/*

It contains SNP data on ~1300 dogs from 23 clades, with ~150k variants.

Place these files in the Data folder, and then run the below commands for training and projection of the samples. The accompanying file dog_superpopulations corresponds to this specific dataset.

Command line interface

The program run_gcae.py is called for examples of contrastive learning on genetic data.

To run, the user need to state whether we want to train a model anew or to project already saved model states, among other parameters. This project is a continuation of GenoCAE, see that page for a more detailed usage guide. This project shares essentially the same API.

For example, to train a model on the dog dataset, run the following. This is an example using all samples in training, with the first 10k SNPs, and a 2D embedding model:

$ python3 run_gcae.py train --trainedmodeldir=./test --datadir=Data/dog --model_id=CM_2D_test --data=All_Pure_150k --train_opts_id=ex3_CL --data_opts_id=d_0_4_dog_cont --save_interval=5 --epochs=100

To plot results for saved model states in a directory, run

$ python3 run_gcae.py project --trainedmodeldir=./test --datadir=Data/dog --model_id=CM_2D_test --data=All_Pure_150k --train_opts_id=ex3_CL --data_opts_id=d_0_4_dog_cont --superpops=Data/dog/dog_superpopulations

Settings in the manuscript Dimensionality Reduction of Genetic Data using Contrastive Learning

The above example model has a 2-dimensional output. The model Contrastive3D.json is the one used in the preprint , and has a normalized 3-dimensional output.

The dog and Human Origins in the manuscript have used the data opts files d_0_4_dog_filtered.json and d_0_4_human.json, and the train_opts files ex3_CL_dog3D.json and ex3_CL_human3D.json, respectively.

The data used is referred to their respective sources, the urls to the datasets are available in the manuscript. The evaluation metrics used to evaluate the embeddings and the plots found in the manuscript are found in evaluation_scripts/embedding_evaluations.py. The t-SNE and UMAP embeddings are created with calls from the file evaluation_scripts/umap_and_tsne.py

The two files have some hardcoded filepaths to data and label information and will not run as-is. They are mainly uploaded to show how the UMAP and t-SNE calls look that produced the embeddings, and to show the implementations of the metrics presented in the manuscript, as well as how the plots were produced. Since the runtime for the larger datasets in the paper are relatively long, the embeddings used in the manuscript are also uploaded, and can be found in evaluation_scripts/manuscript_embeddings.

We are open to collaboration, and if you need help in getting started to use this on your own data, please do not hesitate to get in touch. There are many hyperparameters which may need to be tuned.

Some notes:

Depending on the hardware setup, some minor changes may need to be made for the code to run. One issue could be the GPU running out of memory. Reducing the batch size in the train_opts file could be one fix, another would be to use less variants, toggled in the data_opts file. For my current hardware setup, I had to explicitly allocate more memory than tensorflow automatically did. This can be done by adding the line tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3500)]) a VRAM of 3.5GB should be enough to run the example with the current settings. On my machine, one training epoch takes ~25 seconds for the full dataset, which has been run for the above example, but this depends heavily on the setting used, specifically the batch size.

PAG 31

To see other related visualizations, check out this website.

This repo contains the code that has produced the embeddings using contrastive learning, as shown at PAG31, poster number 676 and at ICQG7, poster number 92 (day 2). The poster presented at PAG 31 contained some smaller errors. Most notably in Figure 3 shows PCA with 10 dimensions, instead of 2.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.idea		.idea
evaluation_scripts		evaluation_scripts
example_figures		example_figures
gcae		gcae
mnist_example		mnist_example
tests		tests
.gitignore		.gitignore
ContrastiveLosses.py		ContrastiveLosses.py
LICENSE		LICENSE
README.md		README.md
image.def		image.def
requirements.txt		requirements.txt
set_tf_config_berzelius.py		set_tf_config_berzelius.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContrastiveLosses

Computational Environment

Singularity:

Virtual environment specification

Example on Genetic data:

Getting the data:

Command line interface

Settings in the manuscript Dimensionality Reduction of Genetic Data using Contrastive Learning

Some notes:

PAG 31

To see other related visualizations, check out this website.

About

Releases

Packages

Contributors 2

Languages

License

filtho/ContrastiveLosses

Folders and files

Latest commit

History

Repository files navigation

ContrastiveLosses

Computational Environment

Singularity:

Virtual environment specification

Example on Genetic data:

Getting the data:

Command line interface

Settings in the manuscript Dimensionality Reduction of Genetic Data using Contrastive Learning

Some notes:

PAG 31

To see other related visualizations, check out this website.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages