Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get.attribute.df() - how to handle duplicate CellIDs? #67

Open
reliscu opened this issue Oct 1, 2020 · 1 comment
Open

get.attribute.df() - how to handle duplicate CellIDs? #67

reliscu opened this issue Oct 1, 2020 · 1 comment

Comments

@reliscu
Copy link

reliscu commented Oct 1, 2020

Apparently the loom file I'm trying to work with has duplicated Cell IDs:

df=loom$get.attribute.df(MARGIN=2) 

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10X35_1_GATGGACCTTAT-’, ‘10X38_1_ATGGCTTCAGAC-’, ‘10X38_1_CCGAGAACAGTC-’, ‘10X38_1_GGAGTGACGTAC-’, ‘10X38_2_AGTCCTATAAGG-’, ‘10X38_2_ATCGTGACGGTT-’, ‘10X38_2_GCAAGAAGTGCT-’, ‘10X38_2_TCGACTGCAGTT-’, ‘10X43_2_GCCATGCTTCCG-’, ‘10X43_2_TAACACAGATGA-’, ‘10X43_3_CACGTGGACGGA-’, ‘10X48_2_AGCGGAGCGATT-’, ‘10X48_2_GTCTGAACCAGT-’, ‘10X48_2_TACCACCTGATG-’, ‘10X49_1_CAATACCACACA-’, ‘10X49_1_GGTGGATTCGTT-’, ‘10X49_3_ATGCCTGCGTAT-’, ‘10X49_3_CAAATGTCCTCG-’, ‘10X49_3_GGTAACGGAGGT-’, ‘10X49_4_AGCTGAATTCGG-’, ‘10X49_4_ATCCCTAGCGTT-’, ‘10X49_4_ATGCACTCTAGG-’, ‘10X49_4_ATGCTGTATCGG-’, ‘10X49_4_CAGTTGGATAGA-’, ‘10X49_4_CTTTCTCGTGAT-’, ‘10X49_4_GGAGCTACAGTC-’, ‘10X49_4_TCGAACCAGAAA-’, ‘10X49_4_TCTACTAAAAGC-’, ‘10X49_4_TTACGACCCTAC-’, ‘10X49_5_AGATGACGTTGA-’, ‘10X49_5_AGCGGATCTGGA-’, ‘10X49_5_AGGCTGACCTGA-’, ‘10X49_5_AGTCACTACGAC-’, [... truncated] 

Ultimately I would like convert these data into a data frame for further analysis. What can I do?

Session info:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /wynton/home/cbi/shared/software/CBI/R-4.0.2/lib64/R/lib/libRblas.so
LAPACK: /wynton/home/cbi/shared/software/CBI/R-4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] loomR_0.2.1.9000    hdf5r_1.3.3         R6_2.4.1           
 [4] devtools_2.3.2      usethis_1.6.3       Matrix_1.2-18      
 [7] data.table_1.13.0   GEOquery_2.56.0     Biobase_2.48.0     
[10] BiocGenerics_0.34.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        pillar_1.4.6      compiler_4.0.2    remotes_2.2.0    
 [5] prettyunits_1.1.1 tools_4.0.2       bit_4.0.4         testthat_2.3.2   
 [9] digest_0.6.25     pkgbuild_1.1.0    pkgload_1.1.0     memoise_1.1.0    
[13] lifecycle_0.2.0   tibble_3.0.3      lattice_0.20-41   pkgconfig_2.0.3  
[17] rlang_0.4.7       cli_2.0.2         curl_4.3          stringr_1.4.0    
[21] withr_2.3.0       dplyr_1.0.2       xml2_1.3.2        desc_1.2.0       
[25] generics_0.0.2    fs_1.5.0          vctrs_0.3.4       hms_0.5.3        
[29] bit64_4.0.5       rprojroot_1.3-2   grid_4.0.2        tidyselect_1.1.0 
[33] glue_1.4.2        processx_3.4.4    pbapply_1.4-3     fansi_0.4.1      
[37] sessioninfo_1.1.1 limma_3.44.3      tidyr_1.1.2       readr_1.3.1      
[41] purrr_0.3.4       callr_3.4.4       magrittr_1.5      backports_1.1.10 
[45] ps_1.3.4          ellipsis_0.3.1    assertthat_0.2.1  stringi_1.5.3    
[49] crayon_1.3.4  
@kvastad
Copy link

kvastad commented Apr 20, 2023

I've had a similar issues with a loom file containing duplicate CellIDs. After help from the authors of the publication (many thanks) it could be resolved by running a Python script on the dataset before using loomR. In this case the cells are unique, but the cell IDs are not unique for some cells due to a string truncation error in their pipeline.

Here is the Python script, it adds a suffix to the end of the second occurence of a duplicated CellIDs.

-------- save this part below in a make_unique_CellID.py file --------

from sys import exit

import loompy
from collections import Counter
d = loompy.connect("l5_all.loom")
cn = Counter(d.ca.CellID)
duplicates = [ cid for cid, n in cn.items() if n > 1 ]
cellids = d.ca.CellID[:]
c = set()
for i in range(len(cellids)):
cellid = cellids[i]
if cellid in c:
cellids[i] += "2"
c.add(cellid)
d.ca.CellID = cellids
d.close()

exit()

-------- end of make_unique_CellID.py file, don't save this line --------

Run the script in the same directory as the loom file, here it was l5_all.loom.

To run from the terminal in that directory type:

python make_unique_CellID.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants