get.attribute.df() - how to handle duplicate CellIDs? #67

reliscu · 2020-10-01T04:53:23Z

Apparently the loom file I'm trying to work with has duplicated Cell IDs:

df=loom$get.attribute.df(MARGIN=2) 

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10X35_1_GATGGACCTTAT-’, ‘10X38_1_ATGGCTTCAGAC-’, ‘10X38_1_CCGAGAACAGTC-’, ‘10X38_1_GGAGTGACGTAC-’, ‘10X38_2_AGTCCTATAAGG-’, ‘10X38_2_ATCGTGACGGTT-’, ‘10X38_2_GCAAGAAGTGCT-’, ‘10X38_2_TCGACTGCAGTT-’, ‘10X43_2_GCCATGCTTCCG-’, ‘10X43_2_TAACACAGATGA-’, ‘10X43_3_CACGTGGACGGA-’, ‘10X48_2_AGCGGAGCGATT-’, ‘10X48_2_GTCTGAACCAGT-’, ‘10X48_2_TACCACCTGATG-’, ‘10X49_1_CAATACCACACA-’, ‘10X49_1_GGTGGATTCGTT-’, ‘10X49_3_ATGCCTGCGTAT-’, ‘10X49_3_CAAATGTCCTCG-’, ‘10X49_3_GGTAACGGAGGT-’, ‘10X49_4_AGCTGAATTCGG-’, ‘10X49_4_ATCCCTAGCGTT-’, ‘10X49_4_ATGCACTCTAGG-’, ‘10X49_4_ATGCTGTATCGG-’, ‘10X49_4_CAGTTGGATAGA-’, ‘10X49_4_CTTTCTCGTGAT-’, ‘10X49_4_GGAGCTACAGTC-’, ‘10X49_4_TCGAACCAGAAA-’, ‘10X49_4_TCTACTAAAAGC-’, ‘10X49_4_TTACGACCCTAC-’, ‘10X49_5_AGATGACGTTGA-’, ‘10X49_5_AGCGGATCTGGA-’, ‘10X49_5_AGGCTGACCTGA-’, ‘10X49_5_AGTCACTACGAC-’, [... truncated]

Ultimately I would like convert these data into a data frame for further analysis. What can I do?

Session info:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /wynton/home/cbi/shared/software/CBI/R-4.0.2/lib64/R/lib/libRblas.so
LAPACK: /wynton/home/cbi/shared/software/CBI/R-4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] loomR_0.2.1.9000    hdf5r_1.3.3         R6_2.4.1           
 [4] devtools_2.3.2      usethis_1.6.3       Matrix_1.2-18      
 [7] data.table_1.13.0   GEOquery_2.56.0     Biobase_2.48.0     
[10] BiocGenerics_0.34.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        pillar_1.4.6      compiler_4.0.2    remotes_2.2.0    
 [5] prettyunits_1.1.1 tools_4.0.2       bit_4.0.4         testthat_2.3.2   
 [9] digest_0.6.25     pkgbuild_1.1.0    pkgload_1.1.0     memoise_1.1.0    
[13] lifecycle_0.2.0   tibble_3.0.3      lattice_0.20-41   pkgconfig_2.0.3  
[17] rlang_0.4.7       cli_2.0.2         curl_4.3          stringr_1.4.0    
[21] withr_2.3.0       dplyr_1.0.2       xml2_1.3.2        desc_1.2.0       
[25] generics_0.0.2    fs_1.5.0          vctrs_0.3.4       hms_0.5.3        
[29] bit64_4.0.5       rprojroot_1.3-2   grid_4.0.2        tidyselect_1.1.0 
[33] glue_1.4.2        processx_3.4.4    pbapply_1.4-3     fansi_0.4.1      
[37] sessioninfo_1.1.1 limma_3.44.3      tidyr_1.1.2       readr_1.3.1      
[41] purrr_0.3.4       callr_3.4.4       magrittr_1.5      backports_1.1.10 
[45] ps_1.3.4          ellipsis_0.3.1    assertthat_0.2.1  stringi_1.5.3    
[49] crayon_1.3.4

The text was updated successfully, but these errors were encountered:

kvastad · 2023-04-20T07:51:59Z

I've had a similar issues with a loom file containing duplicate CellIDs. After help from the authors of the publication (many thanks) it could be resolved by running a Python script on the dataset before using loomR. In this case the cells are unique, but the cell IDs are not unique for some cells due to a string truncation error in their pipeline.

Here is the Python script, it adds a suffix to the end of the second occurence of a duplicated CellIDs.

-------- save this part below in a make_unique_CellID.py file --------

from sys import exit

import loompy
from collections import Counter
d = loompy.connect("l5_all.loom")
cn = Counter(d.ca.CellID)
duplicates = [ cid for cid, n in cn.items() if n > 1 ]
cellids = d.ca.CellID[:]
c = set()
for i in range(len(cellids)):
cellid = cellids[i]
if cellid in c:
cellids[i] += "2"
c.add(cellid)
d.ca.CellID = cellids
d.close()

exit()

-------- end of make_unique_CellID.py file, don't save this line --------

Run the script in the same directory as the loom file, here it was l5_all.loom.

To run from the terminal in that directory type:

python make_unique_CellID.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get.attribute.df() - how to handle duplicate CellIDs? #67

get.attribute.df() - how to handle duplicate CellIDs? #67

reliscu commented Oct 1, 2020 •

edited

Loading

kvastad commented Apr 20, 2023

get.attribute.df() - how to handle duplicate CellIDs? #67

get.attribute.df() - how to handle duplicate CellIDs? #67

Comments

reliscu commented Oct 1, 2020 • edited Loading

kvastad commented Apr 20, 2023

reliscu commented Oct 1, 2020 •

edited

Loading