Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exe 1340 update merge #3

Merged
merged 6 commits into from
Feb 12, 2024
Merged

Exe 1340 update merge #3

merged 6 commits into from
Feb 12, 2024

Conversation

ludvigla
Copy link
Collaborator

merge.CellGraphAssay now makes better use of the hive-style partitioning of the edgelist parquet files. However, there are some pros and cons with this new approach which are discussed below.

Changes

  1. RenameCells.CellGraphAssay has been updated to be more strict when renaming cell IDs. Cell names must now match ^[a-zA-Z][a-zA-Z0-9]*\\_RCVCMP\\d{7}$. This should start with a letter, then any number of letters or numbers (e.g. "Sample01") followed by an underscore and the pixel ID which should be RCVCMP followed by 7 digits.

When a two or more CellGraphAssays are merged (using merge), RenameCells is called under the hood to update the pixel IDs. If the component IDs are not formatted correctly, RenameCells throws an error.

The reason for this behavior is that we can use the sample ID part to figure out what folder to look for edgelist data. If we have a merged object with the following edgelist data directory:

├── sample=A
│   └── edgelist.parquet
└── sample=B
    └── edgelist.parquet

Then we can use the sample ID part of the cell names to select the appropriate directory to look in. For example, if we say that we want to load the edgelist for "B_RCVCMP0000001", we look into the sample=B directory and fetch component "RCVCMP0000001"". This way, we can avoid modifying the parquet files.

  1. merge.CellGraphAssay now takes a parameter add.cell.ids which is required for RenameCells to work properly.
    For instance, merge(seur1, seur2, add.cell.ids = c("A", "B")) is ok but merge(seur1, seur2) is not. If we just run merge(seur1, seur2), Seurat will add a suffix to the component IDs, e.g. -1, -2, ... which will break the whole idea of fetching edgelist data using the sample ID part of the cell names.

When merging is done correctly, pixelatorR will simply copy and rename the edgelist directories instead of loading them in memory and modifying the component IDs. If we start with this Seurat objects edgelist directory:

├── sample=S1
│   └── edgelist.parquet

and then run merge(seur1, seur2, add.cell.ids = c("A", "B")), we will get:

├── sample=A
│   └── edgelist.parquet
└── sample=B
    └── edgelist.parquet

where the individual edgelist.parquet files remain unmodified. This makes merging substantially faster than before.

  1. LoadCellGraphs now handles the loading of edgelist differently. First, it collects the sample ID from the cell name (if available) and uses this information to fetch the appropriate edgelist. For example:
    • Find sample ID and component ID from cell name "B_RCVCMP0000001" -> "B"
    • Go to the sample=B directory and fetch the edgelist for component ID "RCVCMP0000001" in the parquet file
    • Load edgelist and convert to a graph object

Cons

  • merge(seur1, seur2) breaks when CellGraphAssays are present
  • "double merging" doesn't work. If you have two merged Seurat objects, you cannot merge them again.
# OK
seur_merged <- merge(seur1, seur2, add.cell.ids = c("A", "B"))
# NOT OK
seur_double_merged <- merge(seur_merged, seur_merged, add.cell.ids = c("X", "Y"))
  • add.cell.ids is not a default parameter for merge and will therefore not show up as an option when using merge
  • awkward error message when doing an invalid merge. RenameCells.CellGraphAssay is called by the merge.Seurat function when merging two Seurat objects; however, RenameCells.CellGraphAssay doesn't know if the merging was done on two Seurat objects or two CellGraphAssay objects. Therefore, the error message has to instruct users to:
    1. Use appropriate IDs
    2. Use add.cell.ids when merging Seurat objects which is not the default
    3. Not do double merging

NOTE: Another annoying thing is that when providing add.cell.ids to merge.Seurat, these are not actually passed down to merge.CellGraphAssay. Instead, merge.Seurat calls RenameCells on each individual CellGraphAssay to make sure that the names are unique, and then passes the CellGraphAssay to merge.CellGraphAssay.

Pros

  • Massive speed gain and reduced memory consumption when merging datasets. Previously, each parquet file had to be loaded in memory, the component IDs modified and the reexported. Now, we only have to copy the directories containing parquet files and renaming the directories.

Reviewer(s)

Please test the new merging feature and see if you think this is a reasonable way forward. I would highly appreciate any suggestions to circumvent some of the cons listed above. Below is some code that could help understand the behavior of merging a bit better:

# Please test with a real dataset to appreaciate the speed gain :-)
seur <- ReadMPX_Seurat(<pxl_file>)

# Error - add.cell.ids must be provided
seur_merged <- merge(seur, seur)

# OK :-D
system.time({
  seur_merged <- merge(seur, seur, add.cell.ids = c("A", "B"))
})

# Error - Cannot double merge objects :-(
seur_merged <- merge(seur_merged, seur_merged, add.cell.ids = c("A", "B"))

# Faster merging for many objects (8 Seurat objects) :-DDD
system.time({
  seur_merged <- merge(seur, 
                       list(seur, seur, seur, seur, seur, seur, seur), 
                       add.cell.ids = LETTERS[1:8])
})

# What's going on here?
# Initially, the edgelist is stored in a directory called sample=S1
ArrowDir(seur) %>% fs::dir_tree()
# The column names are the MPX IDs
colnames(seur)[1:10]

# When we merge the objects, the directories containing edgelists are 
# copied from each sample to a new directory, and the directories are 
# renamed to enable hive-style loading
ArrowDir(seur_merged) %>% fs::dir_tree()
# The column names are the MPX IDs prefixed with add.cell.ids
colnames(seur_merged) %>% head()
colnames(seur_merged) %>% tail()

@ludvigla ludvigla requested a review from maxkarlsson January 22, 2024 16:37
Copy link
Contributor

@maxkarlsson maxkarlsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone through the code and test run RenameCells.CellGraphAssay line by line and I think it looks like everything works as intended. It is a little bit difficult for me to follow why certain validations are there, but I have tried to ask questions in those places. The merging is indeed very fast, so great work!


fsd <- slot(object, name = "arrow_data")
# Copy directory
if (dir.exists(arrow_dirs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you check whether an edgelist exists, and if so you copy that to a temporary folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The only option is to make copies when renaming the cells. If we tried to overwrite the directories instead, we might break our precious object. For example:

se <- ReadMPX_Seurat(<path to pxl file>)
se_renamed <- RenameCells(se, new.names = <some new names>)

We don't want to overwrite the edgelist directories which belongs to se so we have to copy the directories to a new temp dir. On the other hand, if we did:

se <- ReadMPX_Seurat(<path to pxl file>)
# Overwrite se variable
se <- RenameCells(se, new.names = <some new names>)

I guess it would make sense to overwrite, but it's tricky to know if the variable has been overwritten or not.

@ludvigla ludvigla merged commit 7913cb0 into main Feb 12, 2024
5 of 6 checks passed
@ludvigla ludvigla deleted the EXE-1340-update-merge branch February 12, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants