-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exe 1340 update merge #3
Conversation
…ph loading to work with hive-style edgelists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone through the code and test run RenameCells.CellGraphAssay
line by line and I think it looks like everything works as intended. It is a little bit difficult for me to follow why certain validations are there, but I have tried to ask questions in those places. The merging is indeed very fast, so great work!
|
||
fsd <- slot(object, name = "arrow_data") | ||
# Copy directory | ||
if (dir.exists(arrow_dirs)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you check whether an edgelist exists, and if so you copy that to a temporary folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The only option is to make copies when renaming the cells. If we tried to overwrite the directories instead, we might break our precious object. For example:
se <- ReadMPX_Seurat(<path to pxl file>)
se_renamed <- RenameCells(se, new.names = <some new names>)
We don't want to overwrite the edgelist directories which belongs to se
so we have to copy the directories to a new temp dir. On the other hand, if we did:
se <- ReadMPX_Seurat(<path to pxl file>)
# Overwrite se variable
se <- RenameCells(se, new.names = <some new names>)
I guess it would make sense to overwrite, but it's tricky to know if the variable has been overwritten or not.
merge.CellGraphAssay
now makes better use of the hive-style partitioning of the edgelist parquet files. However, there are some pros and cons with this new approach which are discussed below.Changes
RenameCells.CellGraphAssay
has been updated to be more strict when renaming cell IDs. Cell names must now match^[a-zA-Z][a-zA-Z0-9]*\\_RCVCMP\\d{7}$
. This should start with a letter, then any number of letters or numbers (e.g. "Sample01") followed by an underscore and the pixel ID which should be RCVCMP followed by 7 digits.When a two or more CellGraphAssays are merged (using
merge
), RenameCells is called under the hood to update the pixel IDs. If the component IDs are not formatted correctly, RenameCells throws an error.The reason for this behavior is that we can use the sample ID part to figure out what folder to look for edgelist data. If we have a merged object with the following edgelist data directory:
Then we can use the sample ID part of the cell names to select the appropriate directory to look in. For example, if we say that we want to load the edgelist for "B_RCVCMP0000001", we look into the
sample=B
directory and fetch component "RCVCMP0000001"". This way, we can avoid modifying the parquet files.merge.CellGraphAssay
now takes a parameteradd.cell.ids
which is required for RenameCells to work properly.For instance,
merge(seur1, seur2, add.cell.ids = c("A", "B"))
is ok butmerge(seur1, seur2)
is not. If we just runmerge(seur1, seur2)
, Seurat will add a suffix to the component IDs, e.g. -1, -2, ... which will break the whole idea of fetching edgelist data using the sample ID part of the cell names.When merging is done correctly, pixelatorR will simply copy and rename the edgelist directories instead of loading them in memory and modifying the component IDs. If we start with this Seurat objects edgelist directory:
and then run
merge(seur1, seur2, add.cell.ids = c("A", "B"))
, we will get:where the individual edgelist.parquet files remain unmodified. This makes merging substantially faster than before.
LoadCellGraphs
now handles the loading of edgelist differently. First, it collects the sample ID from the cell name (if available) and uses this information to fetch the appropriate edgelist. For example:Cons
merge(seur1, seur2)
breaks when CellGraphAssays are presentadd.cell.ids
is not a default parameter for merge and will therefore not show up as an option when usingmerge
add.cell.ids
when merging Seurat objects which is not the defaultNOTE: Another annoying thing is that when providing
add.cell.ids
to merge.Seurat, these are not actually passed down to merge.CellGraphAssay. Instead, merge.Seurat calls RenameCells on each individual CellGraphAssay to make sure that the names are unique, and then passes the CellGraphAssay to merge.CellGraphAssay.Pros
Reviewer(s)
Please test the new merging feature and see if you think this is a reasonable way forward. I would highly appreciate any suggestions to circumvent some of the cons listed above. Below is some code that could help understand the behavior of merging a bit better: