Exe 1340 update merge #3

ludvigla · 2024-01-22T16:37:41Z

merge.CellGraphAssay now makes better use of the hive-style partitioning of the edgelist parquet files. However, there are some pros and cons with this new approach which are discussed below.

Changes

RenameCells.CellGraphAssay has been updated to be more strict when renaming cell IDs. Cell names must now match ^[a-zA-Z][a-zA-Z0-9]*\\_RCVCMP\\d{7}$. This should start with a letter, then any number of letters or numbers (e.g. "Sample01") followed by an underscore and the pixel ID which should be RCVCMP followed by 7 digits.

When a two or more CellGraphAssays are merged (using merge), RenameCells is called under the hood to update the pixel IDs. If the component IDs are not formatted correctly, RenameCells throws an error.

The reason for this behavior is that we can use the sample ID part to figure out what folder to look for edgelist data. If we have a merged object with the following edgelist data directory:

├── sample=A
│   └── edgelist.parquet
└── sample=B
    └── edgelist.parquet

Then we can use the sample ID part of the cell names to select the appropriate directory to look in. For example, if we say that we want to load the edgelist for "B_RCVCMP0000001", we look into the sample=B directory and fetch component "RCVCMP0000001"". This way, we can avoid modifying the parquet files.

merge.CellGraphAssay now takes a parameter add.cell.ids which is required for RenameCells to work properly.
For instance, merge(seur1, seur2, add.cell.ids = c("A", "B")) is ok but merge(seur1, seur2) is not. If we just run merge(seur1, seur2), Seurat will add a suffix to the component IDs, e.g. -1, -2, ... which will break the whole idea of fetching edgelist data using the sample ID part of the cell names.

When merging is done correctly, pixelatorR will simply copy and rename the edgelist directories instead of loading them in memory and modifying the component IDs. If we start with this Seurat objects edgelist directory:

├── sample=S1
│   └── edgelist.parquet

and then run merge(seur1, seur2, add.cell.ids = c("A", "B")), we will get:

├── sample=A
│   └── edgelist.parquet
└── sample=B
    └── edgelist.parquet

where the individual edgelist.parquet files remain unmodified. This makes merging substantially faster than before.

LoadCellGraphs now handles the loading of edgelist differently. First, it collects the sample ID from the cell name (if available) and uses this information to fetch the appropriate edgelist. For example:
- Find sample ID and component ID from cell name "B_RCVCMP0000001" -> "B"
- Go to the sample=B directory and fetch the edgelist for component ID "RCVCMP0000001" in the parquet file
- Load edgelist and convert to a graph object

Cons

merge(seur1, seur2) breaks when CellGraphAssays are present
"double merging" doesn't work. If you have two merged Seurat objects, you cannot merge them again.

# OK
seur_merged <- merge(seur1, seur2, add.cell.ids = c("A", "B"))
# NOT OK
seur_double_merged <- merge(seur_merged, seur_merged, add.cell.ids = c("X", "Y"))

add.cell.ids is not a default parameter for merge and will therefore not show up as an option when using merge
awkward error message when doing an invalid merge. RenameCells.CellGraphAssay is called by the merge.Seurat function when merging two Seurat objects; however, RenameCells.CellGraphAssay doesn't know if the merging was done on two Seurat objects or two CellGraphAssay objects. Therefore, the error message has to instruct users to:
1. Use appropriate IDs
2. Use add.cell.ids when merging Seurat objects which is not the default
3. Not do double merging

NOTE: Another annoying thing is that when providing add.cell.ids to merge.Seurat, these are not actually passed down to merge.CellGraphAssay. Instead, merge.Seurat calls RenameCells on each individual CellGraphAssay to make sure that the names are unique, and then passes the CellGraphAssay to merge.CellGraphAssay.

Pros

Massive speed gain and reduced memory consumption when merging datasets. Previously, each parquet file had to be loaded in memory, the component IDs modified and the reexported. Now, we only have to copy the directories containing parquet files and renaming the directories.

Reviewer(s)

Please test the new merging feature and see if you think this is a reasonable way forward. I would highly appreciate any suggestions to circumvent some of the cons listed above. Below is some code that could help understand the behavior of merging a bit better:

# Please test with a real dataset to appreaciate the speed gain :-)
seur <- ReadMPX_Seurat(<pxl_file>)

# Error - add.cell.ids must be provided
seur_merged <- merge(seur, seur)

# OK :-D
system.time({
  seur_merged <- merge(seur, seur, add.cell.ids = c("A", "B"))
})

# Error - Cannot double merge objects :-(
seur_merged <- merge(seur_merged, seur_merged, add.cell.ids = c("A", "B"))

# Faster merging for many objects (8 Seurat objects) :-DDD
system.time({
  seur_merged <- merge(seur, 
                       list(seur, seur, seur, seur, seur, seur, seur), 
                       add.cell.ids = LETTERS[1:8])
})

# What's going on here?
# Initially, the edgelist is stored in a directory called sample=S1
ArrowDir(seur) %>% fs::dir_tree()
# The column names are the MPX IDs
colnames(seur)[1:10]

# When we merge the objects, the directories containing edgelists are 
# copied from each sample to a new directory, and the directories are 
# renamed to enable hive-style loading
ArrowDir(seur_merged) %>% fs::dir_tree()
# The column names are the MPX IDs prefixed with add.cell.ids
colnames(seur_merged) %>% head()
colnames(seur_merged) %>% tail()

…ph loading to work with hive-style edgelists.

…ied.

maxkarlsson

I have gone through the code and test run RenameCells.CellGraphAssay line by line and I think it looks like everything works as intended. It is a little bit difficult for me to follow why certain validations are there, but I have tried to ask questions in those places. The merging is indeed very fast, so great work!

R/load_cell_graphs.R

tests/testthat/test-CellGraphAssay-methods.R

R/objects.R

maxkarlsson · 2024-01-23T13:08:13Z

R/objects.R


-    fsd <- slot(object, name = "arrow_data")
+      # Copy directory
+      if (dir.exists(arrow_dirs)) {


Here you check whether an edgelist exists, and if so you copy that to a temporary folder?

Exactly. The only option is to make copies when renaming the cells. If we tried to overwrite the directories instead, we might break our precious object. For example:

se <- ReadMPX_Seurat(<path to pxl file>) se_renamed <- RenameCells(se, new.names = <some new names>)

We don't want to overwrite the edgelist directories which belongs to se so we have to copy the directories to a new temp dir. On the other hand, if we did:

se <- ReadMPX_Seurat(<path to pxl file>) # Overwrite se variable se <- RenameCells(se, new.names = <some new names>)

I guess it would make sense to overwrite, but it's tricky to know if the variable has been overwritten or not.

R/objects.R

ludvigla added 4 commits January 22, 2024 08:07

Updated merge function to handle hive-style partitioning. Changed gra…

32a1c6f

…ph loading to work with hive-style edgelists.

Updated tests

613d2d1

Updated docs

3c41eb7

Fixed examples where merge is used and argument add.cell.ids is requr…

6238a4c

…ied.

ludvigla requested a review from maxkarlsson January 22, 2024 16:37

maxkarlsson requested changes Jan 23, 2024

View reviewed changes

Fixed typo

42a5851

maxkarlsson approved these changes Jan 23, 2024

View reviewed changes

Merge branch 'main' into EXE-1340-update-merge

de274a2

ludvigla merged commit 7913cb0 into main Feb 12, 2024
5 of 6 checks passed

ludvigla deleted the EXE-1340-update-merge branch February 12, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exe 1340 update merge #3

Exe 1340 update merge #3

ludvigla commented Jan 22, 2024

maxkarlsson left a comment

maxkarlsson Jan 23, 2024

ludvigla Jan 23, 2024

Exe 1340 update merge #3

Exe 1340 update merge #3

Conversation

ludvigla commented Jan 22, 2024

Changes

Cons

Pros

Reviewer(s)

maxkarlsson left a comment

Choose a reason for hiding this comment

maxkarlsson Jan 23, 2024

Choose a reason for hiding this comment

ludvigla Jan 23, 2024

Choose a reason for hiding this comment