Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check needed for NCOL(dtm) <= # of topics #23

Open
ko-ichi-h opened this issue Aug 2, 2021 · 6 comments
Open

Check needed for NCOL(dtm) <= # of topics #23

ko-ichi-h opened this issue Aug 2, 2021 · 6 comments

Comments

@ko-ichi-h
Copy link
Contributor

ko-ichi-h commented Aug 2, 2021

Hello,

Thank you for developing such a useful software!

When I run FindTopicsNumber(), I can get results normally for some data, but I get the following error for some data.

fit models... done.
calculate metrics:
     Griffiths2004... done.
     CaoJuan2009... done.
     Arun2010...Error in FUN(X[[i]], ...) : 
     dims [product 71] do not match the length of object [80]
In addition: Warning message:
In cm1/cm2 :
  longer object length is not a multiple of shorter object length

And here is the R script file that gave me the above error:
ldatuning_error.zip

If I exclude "Arun2010" from "metrics" option, I get results normally without any errors.

My sessionInfo():

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932   
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldatuning_1.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         xml2_1.3.2         magrittr_2.0.1    
 [4] munsell_0.5.0      colorspace_2.0-2   tm_0.7-8          
 [7] R6_2.5.0           rlang_0.4.11       fansi_0.5.0       
[10] tools_4.1.0        parallel_4.1.0     grid_4.1.0        
[13] gtable_0.3.0       utf8_1.2.2         modeltools_0.2-23 
[16] ellipsis_0.3.2     tibble_3.1.3       lifecycle_1.0.0   
[19] crayon_1.4.1       gmp_0.6-2          NLP_0.2-1         
[22] ggplot2_3.3.5      vctrs_0.3.8        glue_1.4.2        
[25] slam_0.1-48        Rmpfr_0.8-4        compiler_4.1.0    
[28] pillar_1.6.2       topicmodels_0.2-12 scales_1.1.1      
[31] stats4_4.1.0       pkgconfig_2.0.3   

I also get the same error with R 3.x.

Best.

@ko-ichi-h ko-ichi-h changed the title FindTopicsNumber() gives me: simpleError in FUN(X[[1L]], ...): dims [product 71] do not match the length of object [80] FindTopicsNumber() gives me: Error in FUN(X[[1L]], ...): dims [product 71] do not match the length of object [80] Aug 8, 2021
@titaniumtroop
Copy link
Collaborator

Since you're getting results with some datasets/metrics and not others, I suspect you may have NAs, NANs, NULL, or other non-numeric values in your data that are causing this type of error. If you confirm the data aren't the issue, it would be helpful if you could post the traceback to pinpoint the error.

Just a note: if memory serves correctly, the original author wrote this package as a grad school project. I took over as the maintainer while working towards my own graduate degree. I'm out of school now so it's been a while since I've actively worked on the project (hence the delayed response), and there isn't any active development going on. If you're interested in contributing to the project, I'm happy to add you to the repo.

Thanks!

@ko-ichi-h
Copy link
Contributor Author

Hello and thank you for your reply.

I believe the data is not the issue because (1) only "Arun2010" gives me the error while other metrics return results, and (2) for some "topics" settings, "Arun2010" also gives me the result normally. The following command gives me the error but if I delete ", 80" ​from the "topics" option, it gives me the result normally.

result_tps <- FindTopicsNumber(
	dtm,
	topics   = c(seq(2, 35, by=3), 40, 45, 50, 60, 70, 80),
	metrics  = c("Griffiths2004", "CaoJuan2009", "Arun2010" , "Deveaud2014"),
	method   = "Gibbs",
	control  = list(seed = 1234567,  burnin = 1000),
	verbose = T
)

Anyway, here is the traceback() result:

9: FUN(X[[i]], ...)
8: lapply(X = X, FUN = FUN, ...)
7: sapply(models, FUN = function(model) {
       m1 <- exp(model@beta)
       m1.svd <- svd(m1)
       cm1 <- as.matrix(m1.svd$d)
       m2 <- model@gamma
       cm2 <- len %*% m2
       norm <- norm(as.matrix(len), type = "m")
       cm2 <- as.vector(cm2/norm)
       divergence <- sum(cm1 * log(cm1/cm2)) + sum(cm2 * log(cm2/cm1))
       return(divergence)
   })
6: Arun2010(models, dtm)
5: FindTopicsNumber(dtm, topics = c(seq(2, 35, by = 3), 40, 45, 
       50, 60, 70, 80), metrics = c("Griffiths2004", "CaoJuan2009", 
       "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 1234567, 
       burnin = 1000), verbose = T) at ldatuning_error.r#1230
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("C:\\Users\\KO-ichi\\Desktop\\ldatuning_error.r")

Any help would be highly appreciated.

Thank you.

@titaniumtroop
Copy link
Collaborator

What are the dimensions of your input dtm? It looks like the number of columns might be 71, which would correspond to the number of terms. Perhaps you can't generate a larger number of topics than you have terms using the Arun method.

If that's the case, there should be a check to confirm that the number of topics specified in FindTopicsNumber doesn't exceed the number of terms in the dtm.

@ko-ichi-h
Copy link
Contributor Author

What are the dimensions of your input dtm? It looks like the number of columns might be 71, which would correspond to the number of terms. Perhaps you can't generate a larger number of topics than you have terms using the Arun method.

Yes, you are absolutely right. The column number is 71 and svd() outputs only 71 singular values. It causes the error.

And yes again, that number check should be performed and more human readable error message would be nice.

@titaniumtroop titaniumtroop changed the title FindTopicsNumber() gives me: Error in FUN(X[[1L]], ...): dims [product 71] do not match the length of object [80] Check needed for NCOL(dtm) <= # of topics Aug 9, 2021
@titaniumtroop
Copy link
Collaborator

Ok, glad we were able to identify the issue. I tagged this as something that needs work.

I question whether it ever makes sense to have more topics than terms. My suggestion would be for the check to throw an error if topics > terms, regardless of which algorithm is selected, unless someone can give a good example of why you'd want to have more topics than terms.

The error should occur before actual processing begins -- it wouldn't be fun for your processing to run for a few days only to get an error at the end.

@ko-ichi-h
Copy link
Contributor Author

Hmm, it may be possible that term A forms topic Alpha, term B forms topic Beta, and term A & B together form topic Gamma. 2 terms and 3 topics may be possible I think.

So it would be fine to raise an error only when users specify "Arun2010".

ko-ichi-h added a commit to ko-ichi-h/ldatuning that referenced this issue Mar 1, 2022
titaniumtroop added a commit that referenced this issue Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants