Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in hpb() when one or more clusters are very small #5

Open
lcolladotor opened this issue Mar 8, 2022 · 1 comment
Open

Comments

@lcolladotor
Copy link

lcolladotor commented Mar 8, 2022

Hi,

We ran into an scenario where a cluster is small such that https://github.com/ntdyjack/fasthplus/blob/main/R/bsf.R#L20 fails. That's because the small cluster is smaller than the t divided by the number of clusters. That is, t / length(q).

We can avoid this by choosing t with code like this:

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

Otherwise, when t / length(q) is larger than the smallest cluster size, then you run into an error with sample() because replace = FALSE by default. Alternatively,

p <- as.vector(sapply(q, function(x) sample(x,round(t/length(q)))))
could be changed to have sample(replace = TRUE) but I don't know if this would break your bootstrap assumptions.

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

set.seed(20220308)
n_labels <- 7
L <- sample(letters[seq_len(n_labels)], 1e4, replace = TRUE)
min(table(L))
#> [1] 1382
find_t(L, 1)
#> [1] 9674
find_t(L, (find_t(L, 1) + n_labels)/1e4)
#> [1] 9674
find_t(L, (find_t(L, 1) - n_labels)/1e4)
#> [1] 9667
find_t(L, 0.05)
#> [1] 500

## Small reprex to force the error
label_list <- split(seq_len(length(L)), L)
q <- label_list
t <- find_t(L, 1) + n_labels
p <- as.vector(sapply(q, function(x) sample(x,round(t/length(q)))))
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
traceback()
#> No traceback available

Created on 2022-03-08 by the reprex package (v2.0.1)

EDIT start

Looks like reprex doesn't include the traceback() output. Here it is:

Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'
> traceback()
6: sample.int(length(x), size, replace, prob)
5: sample(x, round(t/length(q))) at #1
4: FUN(X[[i]], ...)
3: lapply(X = X, FUN = FUN, ...)
2: sapply(q, function(x) sample(x, round(t/length(q))))
1: as.vector(sapply(q, function(x) sample(x, round(t/length(q)))))

EDIT end

The problem with the above solution, is that it could result in very low values of t when for example you have clusters that are very small compared to others. Abby @abspangler13 has a case where we have 15 clusters and the smallest one has a size of 8. So the smallest t would then be 15 * 8 = 120 as shown below, which would be very small.

> find_t(L=colData(spe)[[paste0("bayesSpace_harmony_",k)]],proportion = 0.01)
[1] 120
> sort(table(colData(spe)[[paste0("bayesSpace_harmony_",k)]]))

    6    12    14     1    11    15    13     2    10     9     5     8     3 
    8   167   519  1315  3501  3579  4104  4851  4998  9333  9492 11687 13310 
    7     4 
14535 18175 

Best,
Leo

@lcolladotor
Copy link
Author

In the above case, we are considering simply dropping that tiny cluster so we can use a larger t without any changes to fasthplus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant