Problems in hpb() when one or more clusters are very small #5

lcolladotor · 2022-03-08T22:27:25Z

Hi,

We ran into an scenario where a cluster is small such that https://github.com/ntdyjack/fasthplus/blob/main/R/bsf.R#L20 fails. That's because the small cluster is smaller than the t divided by the number of clusters. That is, t / length(q).

We can avoid this by choosing t with code like this:

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

Otherwise, when t / length(q) is larger than the smallest cluster size, then you run into an error with sample() because replace = FALSE by default. Alternatively,

fasthplus/R/bsf.R

Line 20 in 736494f

p <- as.vector(sapply(q, function(x) sample(x,round(t/length(q)))))

could be changed to have sample(replace = TRUE) but I don't know if this would break your bootstrap assumptions.

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

set.seed(20220308)
n_labels <- 7
L <- sample(letters[seq_len(n_labels)], 1e4, replace = TRUE)
min(table(L))
#> [1] 1382
find_t(L, 1)
#> [1] 9674
find_t(L, (find_t(L, 1) + n_labels)/1e4)
#> [1] 9674
find_t(L, (find_t(L, 1) - n_labels)/1e4)
#> [1] 9667
find_t(L, 0.05)
#> [1] 500

## Small reprex to force the error
label_list <- split(seq_len(length(L)), L)
q <- label_list
t <- find_t(L, 1) + n_labels
p <- as.vector(sapply(q, function(x) sample(x,round(t/length(q)))))
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
traceback()
#> No traceback available

^{Created on 2022-03-08 by the reprex package (v2.0.1)}

EDIT start

Looks like reprex doesn't include the traceback() output. Here it is:

Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'
> traceback()
6: sample.int(length(x), size, replace, prob)
5: sample(x, round(t/length(q))) at #1
4: FUN(X[[i]], ...)
3: lapply(X = X, FUN = FUN, ...)
2: sapply(q, function(x) sample(x, round(t/length(q))))
1: as.vector(sapply(q, function(x) sample(x, round(t/length(q)))))

EDIT end

The problem with the above solution, is that it could result in very low values of t when for example you have clusters that are very small compared to others. Abby @abspangler13 has a case where we have 15 clusters and the smallest one has a size of 8. So the smallest t would then be 15 * 8 = 120 as shown below, which would be very small.

> find_t(L=colData(spe)[[paste0("bayesSpace_harmony_",k)]],proportion = 0.01)
[1] 120
> sort(table(colData(spe)[[paste0("bayesSpace_harmony_",k)]]))

    6    12    14     1    11    15    13     2    10     9     5     8     3 
    8   167   519  1315  3501  3579  4104  4851  4998  9333  9492 11687 13310 
    7     4 
14535 18175

Best,
Leo

The text was updated successfully, but these errors were encountered:

lcolladotor · 2022-03-08T22:46:32Z

In the above case, we are considering simply dropping that tiny cluster so we can use a larger t without any changes to fasthplus.

…thplus https://github.com/LieberInstitute/spatialDLPFC/blob/main/code/analysis/06_fastplus/06_fasthplus.R#L46-L63 and avoiding ntdyjack/fasthplus#5 if possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in hpb() when one or more clusters are very small #5

Problems in hpb() when one or more clusters are very small #5

lcolladotor commented Mar 8, 2022 •

edited

Loading

lcolladotor commented Mar 8, 2022

Problems in hpb() when one or more clusters are very small #5

Problems in hpb() when one or more clusters are very small #5

Comments

lcolladotor commented Mar 8, 2022 • edited Loading

lcolladotor commented Mar 8, 2022

lcolladotor commented Mar 8, 2022 •

edited

Loading