diff --git a/.DS_Store b/.DS_Store index c14e483..66fc502 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.gitignore b/.gitignore index a535e8d..606292c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,23 @@ .Rproj.user -.Rhistory -.RData .Ruserdata doc Meta + +# History files +.Rhistory +.Rapp.history + +# Session Data files +.RData + +# Example code in package build process +*-Ex.R + +# Output files from R CMD check +/*.Rcheck/ + +# RStudio files +.Rproj.user/ + +# Mac OS +.DS_Store diff --git a/README.md b/README.md index acce58a..3c8b4a0 100755 --- a/README.md +++ b/README.md @@ -1 +1,15 @@ -# cui2vec \ No newline at end of file +# cui2vec + +This repo contains the code associated with the following paper (under review): + +> Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X., Cai, T., Kohane, I.S., and Beam, A.L., 2019. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486. + +# Overview + +This repo contains the R pacakge `cui2vec`, which provides code for fitting embeddings to your own co-occurrence data in the manner presented in the above paper. The package can be installed locally from source. An overview of usage is provided in the following HTML vignette, which can be viewed in your browser: + +[vignettes/rendered/2019_07_31/cui2vecWorkflow.html](vignettes/rendered/2019_07_31/cui2vecWorkflow.html). + +Additional information on each of the public functions can be accessed in the standard way (e.g., ```?cui2vec::construct_word2vec_embedding```). + +Data agreements prevent us from releasing all of our original source data, but upon acceptance, we will release our embeddings at the following URL: TBD. diff --git a/vignettes/.DS_Store b/vignettes/.DS_Store index ca62346..a11c610 100755 Binary files a/vignettes/.DS_Store and b/vignettes/.DS_Store differ diff --git a/vignettes/cui2vec.Rmd b/vignettes/cui2vec.Rmd index 075e6fb..26ffbfe 100755 --- a/vignettes/cui2vec.Rmd +++ b/vignettes/cui2vec.Rmd @@ -16,8 +16,9 @@ knitr::opts_chunk$set( comment = "#>" ) ``` -##cui2vec Overview -Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes a insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles which can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information. +## cui2vec Overview + +Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles that can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information. In this vignette, we'll walk through the core steps of `cui2vec`. Start by loading the package: @@ -25,7 +26,7 @@ In this vignette, we'll walk through the core steps of `cui2vec`. Start by loadi library(cui2vec) ``` -For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combination of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you've have a TCM and singleton count for your corpus of interest. +For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combinations of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you have a TCM and singleton count for your corpus of interest. ```{r, message=FALSE} # denominator in PMI calculation N <- 261397 @@ -35,13 +36,13 @@ load('singleton_counts.rda') ``` -The first step in the `cui2vec` algorithm is construct the Pointwise Mutual Information (PMI) matrix: +The first step in the `cui2vec` algorithm is to construct the Pointwise Mutual Information (PMI) matrix: ```{r, message=FALSE} pmi <- construct_pmi(term_cooccurrence_matrix,singleton_counts,N) pmi[1:5, 1:3] ``` -Then you need to construct the Shift Positive Pointwise Mutual Information (SPPMI) matrix: +Next, you need to construct the Shifted Positive Pointwise Mutual Information (SPPMI) matrix: ```{r, message=FALSE} sppmi <- construct_sppmi(pmi) sppmi[1:5, 1:5] @@ -49,11 +50,11 @@ sppmi[1:5, 1:5] Finally, you can fit `cui2vec` embeddings using `construct_word2vec_embedding`. We'll keep this example small and only work with 20 dimensional embeddings. ```{r, message=FALSE} -w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20) +w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20, iters=50) w2v_embedding[1:5, 1:5] ``` -We can also do `word2vec` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings. +We can also do `PCA` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings. ```{r, message=FALSE} pca_embedding <- construct_pca_embedding(term_cooccurrence_matrix, dim_size = 20) pca_embedding[1:5, 1:5] @@ -70,7 +71,7 @@ To run the benchmarks in our paper, we need some additional information about th ```{r} print(check_embedding_semantic_columns(w2v_embedding)) ``` -As expected, this fails, since we just created the embeddings. We have a helper function to add this function to an embedding. +As expected, this fails, since we just created the embeddings. We have a helper function to add this information to an embedding. ```{r, message=FALSE, results='hide'} glove_embedding <- bind_semantic_types(glove_embedding) @@ -81,13 +82,13 @@ Let's check that it worked: w2v_embedding[1:5, 1:5] ``` -We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and use this similarity to assess whether or not the two concepts are related. There are five benchmarks: +We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and we use this similarity to assess whether or not the two concepts are related. There are five benchmarks: * **Comorbid Conditions**: A comorbidity is a disease or condition that frequently accompanies a primary diagnosis. * **Causative Relationships**: The UMLS contains a table (MRREL) of entities known to be the cause of a certain result. * **National Drug File Reference Terminology (NDF-RT)**: We assess power to detect "may treat" and "may prevent" relationships using bootstrap scores of random drug-disease pairs. * **UMLS Semantic Type**: Semantic types are meta-information about which category a concept belongs to, and these categories are arranged in a hierarchy. -* **Human Assessment of Concept Similarity**: We report the spearman correlation between the human assessment scores and cosine similarity from the embeddings. +* **Human Assessment of Concept Similarity**: We report the Spearman correlation between the human assessment scores and cosine similarity from the embeddings. ```{r, message = FALSE, eval=FALSE} # No CUIs in our tiny embeding that overlap with comorbidity CUIs, so don't evaluate diff --git a/vignettes/cui2vec.bib b/vignettes/cui2vec.bib index da20112..f1e244d 100755 --- a/vignettes/cui2vec.bib +++ b/vignettes/cui2vec.bib @@ -1,13 +1,15 @@ -% Generated by Paperpile. Check out http://paperpile.com for more information. -% BibTeX export options can be customized via Settings -> BibTeX. +%% This BibTeX bibliography file was created using BibDesk. +%% http://bibdesk.sourceforge.net/ -@ARTICLE{Beam2018-vl, - title = "Clinical Concept Embeddings Learned from Massive Sources of - Multimodal Medical Data", - author = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and - Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane, - Isaac S", - abstract = "Word embeddings are a popular approach to unsupervised +%% Created for Allen Schmaltz at 2019-07-31 20:15:58 -0400 + + +%% Saved with string encoding Unicode (UTF-8) + + + +@article{Beam2018-vl, + Abstract = {Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an @@ -25,11 +27,12 @@ @ARTICLE{Beam2018-vl previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for - interactive exploration of the cui2vec embeddings.", - month = apr, - year = 2018, - keywords = "cui2vec", - archivePrefix = "arXiv", - primaryClass = "cs.CL", - eprint = "1804.01486" -} + interactive exploration of the cui2vec embeddings.}, + Archiveprefix = {arXiv}, + Author = {Kompa, Benjamin and Schmaltz, Allen and Fried, Inbar and Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane, Isaac S, and Beam, Andrew L}, + Date-Modified = {2019-08-01 00:15:41 +0000}, + Eprint = {1804.01486}, + Keywords = {cui2vec}, + Primaryclass = {cs.CL}, + Title = {Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data}, + Year = 2019} diff --git a/vignettes/prev_cui2vec.bib b/vignettes/prev_cui2vec.bib new file mode 100755 index 0000000..da20112 --- /dev/null +++ b/vignettes/prev_cui2vec.bib @@ -0,0 +1,35 @@ +% Generated by Paperpile. Check out http://paperpile.com for more information. +% BibTeX export options can be customized via Settings -> BibTeX. + +@ARTICLE{Beam2018-vl, + title = "Clinical Concept Embeddings Learned from Massive Sources of + Multimodal Medical Data", + author = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and + Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane, + Isaac S", + abstract = "Word embeddings are a popular approach to unsupervised + learning of word relationships that are widely used in + natural language processing. In this article, we present a + new set of embeddings for medical concepts learned using an + extremely large collection of multimodal medical data. + Leaning on recent theoretical insights, we demonstrate how + an insurance claims database of 60 million members, a + collection of 20 million clinical notes, and 1.7 million + full text biomedical journal articles can be combined to + embed concepts into a common space, resulting in the largest + ever set of embeddings for 108,477 medical concepts. To + evaluate our approach, we present a new benchmark + methodology based on statistical power specifically designed + to test embeddings of medical concepts. Our approach, called + cui2vec, attains state of the art performance relative to + previous methods in most instances. Finally, we provide a + downloadable set of pre-trained embeddings for other + researchers to use, as well as an online tool for + interactive exploration of the cui2vec embeddings.", + month = apr, + year = 2018, + keywords = "cui2vec", + archivePrefix = "arXiv", + primaryClass = "cs.CL", + eprint = "1804.01486" +}