Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize cggnn workflow #248

Merged
merged 24 commits into from
Nov 17, 2023
Merged

Parallelize cggnn workflow #248

merged 24 commits into from
Nov 17, 2023

Conversation

CarlinLiao
Copy link
Collaborator

These changes finalize the transfer of responsibilities from the cg-gnn pip package to SPT and refactors the Nextflow cggnn workflow to create graphs in parallel from every specimen, in the process giving Nextflow more control and insight into the cggnn run process. In addition, the cggnn workflow now also saves multiple model points, the graphs created, and other metadata that could be reused.

@CarlinLiao CarlinLiao added the refactor Code change preserving functionality label Nov 16, 2023
@CarlinLiao CarlinLiao self-assigned this Nov 16, 2023
@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Nov 16, 2023

Will be converted to a mergeable PR once I can get home and push that commit I left on my computer. That's needed for all tests to pass since bokeh was moved from a pip cg-gnn prerequisite to actually being used in SPT itself.

That said, this PR in its current state has a working cggnn workflow. However, in order to properly evaluate it, you need to make a slight change to the cggnn_environment.yml. Instead of allowing it to install spatialprofilingtoolbox[all] from pip, comment that line out and install a local build created from this branch. Once that's done, this is what the output will look like if you use it with the CyTOF dataset on RDS:

$ spt workflow configure --local --workflow='cggnn' --study-name='Melanoma CyTOF ICI' --database-config-file='../../spt_db.config' --workflow-config-file='workflow.config'
$ ls
configure.sh  main.nf  nextflow.config  run.sh  workflow.config
$ nextflow run .
N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [insane_mayer] DSL2 - revision: 856742884f
executor >  local (35)
[09/cacb09] process > echo_environment_variables [100%] 1 of 1 ✔
[ca/03dc41] process > prep_graph_creation        [100%] 1 of 1 ✔
[96/95dde9] process > create_specimen_graphs (9) [100%] 30 of 30 ✔
[f7/92863f] process > finalize_graphs            [100%] 1 of 1 ✔
[2a/22470e] process > train                      [100%] 1 of 1 ✔
[f1/fd0f17] process > upload_importances         [100%] 1 of 1 ✔
Completed at: 16-Nov-2023 16:36:05
Duration    : 8m 26s
CPU hours   : 0.2
Succeeded   : 35


$ ls
configure.sh  main.nf  nextflow.config  results  run.sh  work  workflow.config
$ ls results/
feature_names.txt  graph_info.pkl  graphs.bin  importances.csv  model
$ ls results/model/
classification_report.txt  model_best_validation_accuracy.pt  model_best_validation_loss.pt  model_best_validation_weighted_f1_score.pt

@CarlinLiao
Copy link
Collaborator Author

Something to note: I originally had the cggnn workflow environment only install the cggnn and workflow extras, but this would error when running spt workflow configure because running that would require importing the phenotype proximity workflow, which imports FeatureMatrixExtractor from db, which would cause an error because it needs db extras. So, pip install spatialprofilingtoolbox[all].

@CarlinLiao CarlinLiao linked an issue Nov 16, 2023 that may be closed by this pull request
@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Nov 16, 2023

Interesting error from cggnn extract.

11-16 23:41:22 [  INFO   ] db.feature_matrix_extractor: Retrieving expression data from database.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 71, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(
                                         ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/extract.py", line 135, in extract_cggnn_data
    for specimen, data in extractor.extract(study=study, retain_structure_id=True).items()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 80, in extract
    extraction = self._extract(
                 ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 98, in _extract
    data_arrays = self._retrieve_expressions_from_database(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 129, in _retrieve_expressions_from_database
    puller.pull(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 233, in pull
    self._retrieve_data_arrays(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 253, in _retrieve_data_arrays
    self._fill_data_arrays_for_study(
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 276, in _fill_data_arrays_for_study
    sparse_entries = self._get_sparse_entries(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 355, in _get_sparse_entries
    query, parameters = self._get_sparse_matrix_query_specimen_specific(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 381, in _get_sparse_matrix_query_specimen_specific
    range_definition = SparseMatrixPuller._retrieve_expressions_range(cursor, specimen)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py", line 395, in _retrieve_expressions_range
    cursor.execute(query, (scope,))
psycopg2.errors.UndefinedTable: relation "range_definitions" does not exist
LINE 3:         FROM range_definitions
                     ^

@jimmymathews
Copy link
Collaborator

The range_definitions table is the replacement for the previous index, introduced in the last few days.
Most likely your data images are outdated.

I am getting a different error (which may indicate accidental omission of some test data artifact?). During plotting testing:

  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/scripts/plot_interactives.py", line 39, in <module>
    plot_interactives(graphs_data, feature_names, args.output_directory, args.merge_rois)
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in plot_interactives
    graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 47, in <listcomp>
    graphs = [_convert_dgl_to_networkx(graph, feature_names) for graph in dgl_graphs]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/spatialprofilingtoolbox/cggnn/interactives.py", line 122, in _convert_dgl_to_networkx
    raise ValueError(
ValueError: Importance scores not yet found. Calculate them and place them in graph.ndata[importance] first.

@CarlinLiao
Copy link
Collaborator Author

I know how to fix that one. I updated the test data artifacts for the new cg-gnn/spatialprofilingtoolbox object type split, but forgot to run a trained model on them to produce importance scores. Should be a quick fix.

@CarlinLiao
Copy link
Collaborator Author

After rebuilding the data loaded images and replacing the graphs.bin and graph_info.pkl, all tests pass.

@CarlinLiao CarlinLiao marked this pull request as ready for review November 17, 2023 15:39
@jimmymathews
Copy link
Collaborator

I modified the documentation in several places for style conformity.

@jimmymathews jimmymathews merged commit 4eb5b33 into main Nov 17, 2023
1 check passed
@jimmymathews jimmymathews deleted the parallel-cggnn branch November 17, 2023 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code change preserving functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bring data transformation features from cg-gnn to SPT
2 participants