Skip to content
Peter H. Li edited this page May 29, 2023 · 27 revisions

Segmentation-Guided Contrastive Learning of Representations (SegCLR)

Segmentation-guided Contrastive Learning of Representations (SegCLR) is a method for learning rich embedding representations of cellular morphology and ultrastructure. See the blog post and the updated preprint fully describing the method for more details. The embeddings for two large-scale cortical datasets, one from human temporal cortex (explore in Neuroglancer) and one from mouse visual cortex (explore), are publicly released on Google Cloud Storage. SegCLR training data as well as code for training new embedding and classification models are also provided below.

Code release and demo notebooks

Open-source release of SegCLR code to the connectomics repo is complete aside from minor updates and bugfixes. The API may still undergo some changes.

The following Colab notebooks demonstrate how to use the different code modules:

  • Run a pretrained SegCLR embedding model from TensorFlow 2 to predict embeddings for an arbitrary data cutout using TensorStore. This notebook shows how to instantiate a SegCLR model, load weights from a pretrained model, and run inference. Pretrained SegCLR models can also be loaded from TensorFlow 1.
  • Train a SegCLR embedding model. This notebook shows how to read positive pair tf.train.Examples from a TFRecord table, load the corresponding EM and segmentation data blocks, preprocess and batch them, and use them to train a SegCLR embedding model. By default the demo notebook is set to connect to a Google Colab instance with an NVIDIA T4 GPU, but for large-scale training or fine-tuning a larger GPU cluster should be used.
  • Access precomputed SegCLR embeddings from public CSV ZIP releases for h01 (human cortex) and MICrONS (mouse cortex). This notebook shows how to read the data remotely and parse it. It also demonstrates how to run dimensionality reduction to inspect embedding clusters (similar to paper figure 4).
  • Run a pretrained SegCLR subcompartment classifier. This notebook shows how to load a pretrained subcompartment classifier model and run it on embeddings for a test cell (as in paper figure 2).
  • Train a cell type classifier with out-of-distribution (OOD) detection. This notebook shows how to load ground truth cell type labels for the mouse cortex dataset, and train a lightweight cell type classifier on top of SegCLR embeddings from scratch (as in paper figure 3). In this demo, the classifier is trained on glia cell types, while the neuron types are only used for evaluation, so the classifier must learn to reject the OOD neuron types. We do this by training a classifier with calibrated uncertainty estimates via SNGP (SNGP paper, as in paper figure 5).

Data manifest

All results and training data for the SegCLR paper are released on Google Cloud Storage:

  • For the H01 human temporal cortex dataset: gs://h01-release/data/20230118/ (console)
  • For the MICrONS mouse visual cortex dataset: gs://iarpa_microns/minnie/minnie65/ (console)

To access Cloud Storage, you may use the console links above or the gsutil command-line program, or refer to the Colab demo notebooks for programmatic access via Python.

Precomputed embeddings

We provide precomputed SegCLR embeddings:

The precomputed embeddings are available in CSV format stored as sharded ZIP archives. See the demo notebook for examples of how to read the data from Python. The data are available here:

  • H01:
    • Unaggregated embeddings: gs://h01-release/data/20230118/c3/embeddings/segclr_nm_coord_csvzips/
    • Aggregated 10 um: gs://h01-release/data/20230118/c3/embeddings/segclr_nm_coord_aggregated_10um_csvzips
    • Aggregated 25 um: gs://h01-release/data/20230118/c3/embeddings/segclr_nm_coord_aggregated_25um_csvzips
  • MICrONS:
    • Unaggregated embeddings: gs://iarpa_microns/minnie/minnie65/embeddings_m343/segclr_nm_coord_public_offset_csvzips/
    • Aggregated 10 um: gs://iarpa_microns/minnie/minnie65/embeddings_m343/segclr_nm_coord_public_offset_aggregated_10um_csvzips/
    • Aggregated 25 um: gs://iarpa_microns/minnie/minnie65/embeddings_m343/segclr_nm_coord_public_offset_aggregated_25um_csvzips/

Pretrained embedding models

See the demo notebook for examples of how to load pretrained embedding models and run them on arbitrary data cutouts. The model checkpoints are stored:

  • H01: gs://h01-release/data/20230118/models/segclr-355200/
  • MICrONS: gs://iarpa_microns/minnie/minnie65/embeddings/models/segclr-216000/

Pretrained model checkpoints can also be used as a starting point for fine-tuning of an embedding model on new data. This can be faster and less resource intensive than training an embedding model from scratch (paper supplemental figure 2).

Embedding model training data

See the demo notebook for examples of how to run a training loop.

The SegCLR embedding models were trained using the publicly released H01 and MICrONS EM datasets and segmentations:

Additionally, the training pipeline uses an input table of positive pair coordinates derived from skeletonized segmentations:

  • H01: gs://h01-release/data/20230118/training_data/c3_positive_pairs/goog14c3_max200000_skip50.tfrecord-*-of-01000
  • MICrONS:
    • gs://iarpa_microns/minnie/minnie65/embeddings/training_data/positive_pairs/minnie65_v117_skip200.tfrecord-*-of-01000
    • gs://iarpa_microns/minnie/minnie65/embeddings_m343/training_data/positive_pairs/minnie65_v343_skip200.tfrecord-*-of-01000

Classification model training data and pretrained models

See the demo notebooks for how to load and run pretrained classifiers or train new classifiers on top of SegCLR embeddings.

The ground truth data used to train subcompartment or cell type classification models in the paper are provided here:

  • H01: gs://h01-release/data/20230118/training_data/
  • MICrONS: gs://iarpa_microns/minnie/minnie65/embedding_classification/training_data/

For H01 training, some of the ground truth was collected on earlier cutouts of the dataset, prior to the final full aligned dataset and c3 segmentation. These earlier cutouts are provided here:

  • gs://h01-release/data/20230118/training_data/phase1/ (Neuroglancer)
  • gs://h01-release/data/20230118/training_data/roi215/ (Neuroglancer)
  • gs://h01-release/data/20230118/training_data/roi466/ (Neuroglancer)

We also provide the following pretrained classification models:

  • H01
    • Subcompartment classification trained on unaggregated embeddings: gs://h01-release/data/20230118/models/subcompartment_4class_0um_linear_20230127/
    • Add links
  • MICrONS:
    • Subcompartment classification trained on unaggregated embeddings: gs://iarpa_microns/minnie/minnie65/embedding_classification/models/subcompartment_0um_BERT_SNGP_20220819/
    • Subcompartment classification trained on 10 um aggregated embeddings: gs://iarpa_microns/minnie/minnie65/embedding_classification/models/subcompartment_10um_BERT_SNGP_20220819/