From 365b41c9434daec86608d011349b46e118b47ea8 Mon Sep 17 00:00:00 2001
From: Allen Schmaltz <allen.schmaltz@gmail.com>
Date: Wed, 31 Jul 2019 20:48:55 -0400
Subject: [PATCH] documentation updates

---
 .DS_Store                  | Bin 8196 -> 8196 bytes
 .gitignore                 |  21 +++++++++++++++++++--
 README.md                  |  16 +++++++++++++++-
 vignettes/.DS_Store        | Bin 6148 -> 6148 bytes
 vignettes/cui2vec.Rmd      |  21 +++++++++++----------
 vignettes/cui2vec.bib      |  37 ++++++++++++++++++++-----------------
 vignettes/prev_cui2vec.bib |  35 +++++++++++++++++++++++++++++++++++
 7 files changed, 100 insertions(+), 30 deletions(-)
 create mode 100755 vignettes/prev_cui2vec.bib

diff --git a/.DS_Store b/.DS_Store
index c14e483d293841bece72f76e001aa021197517cc..66fc5024cce66e22fdb856a237ef2acd4bc6149d 100644
GIT binary patch
delta 125
zcmZp1XmOa}&nU7nU^hRb$YvgaRZPOn47m)640&b2MR_^-dFc!c42+vM3q50!=VT~j
m$Ye-o$YV%lC;`G$hGJwz8w)L&HnU57W10M1WD{Y<W=sH#Y$9C%

delta 41
xcmZp1XmOa}&nUDpU^hRb&}JTiRZN@D2t8)nSdz!InO))=%Vc5EIU7r|nE)<h4wL`@

diff --git a/.gitignore b/.gitignore
index a535e8d..606292c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,23 @@
 .Rproj.user
-.Rhistory
-.RData
 .Ruserdata
 doc
 Meta
+
+# History files
+.Rhistory
+.Rapp.history
+
+# Session Data files
+.RData
+
+# Example code in package build process
+*-Ex.R
+
+# Output files from R CMD check
+/*.Rcheck/
+
+# RStudio files
+.Rproj.user/
+
+# Mac OS
+.DS_Store
diff --git a/README.md b/README.md
index acce58a..3c8b4a0 100755
--- a/README.md
+++ b/README.md
@@ -1 +1,15 @@
-# cui2vec
\ No newline at end of file
+# cui2vec
+
+This repo contains the code associated with the following paper (under review):
+
+> Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X., Cai, T., Kohane, I.S., and Beam, A.L., 2019. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486.
+
+# Overview
+
+This repo contains the R pacakge `cui2vec`, which provides code for fitting embeddings to your own co-occurrence data in the manner presented in the above paper. The package can be installed locally from source. An overview of usage is provided in the following HTML vignette, which can be viewed in your browser:
+
+[vignettes/rendered/2019_07_31/cui2vecWorkflow.html](vignettes/rendered/2019_07_31/cui2vecWorkflow.html).
+
+Additional information on each of the public functions can be accessed in the standard way (e.g., ```?cui2vec::construct_word2vec_embedding```).
+
+Data agreements prevent us from releasing all of our original source data, but upon acceptance, we will release our embeddings at the following URL: TBD.
diff --git a/vignettes/.DS_Store b/vignettes/.DS_Store
index ca62346f256a65d3e3d3321cbb7294b0e9dd5ef0..a11c6101b0f70f6254c61e548f3e22d924d0dbbc 100755
GIT binary patch
delta 57
zcmZoMXfc@J&&WJ6U^gT4WFE$~vK$OW45<ux3@Jbo#7Zd(F3QWv&r4@uU|`&QkkOEB
JGdss$egM7L4|@Or

delta 32
ocmZoMXfc@J&&V_}VE1GL5thmPjH@<3VN_<D*r2_co#QV*0J2~T+5i9m

diff --git a/vignettes/cui2vec.Rmd b/vignettes/cui2vec.Rmd
index 075e6fb..26ffbfe 100755
--- a/vignettes/cui2vec.Rmd
+++ b/vignettes/cui2vec.Rmd
@@ -16,8 +16,9 @@ knitr::opts_chunk$set(
   comment = "#>"
 )
 ```
-##cui2vec Overview 
-Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes a insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles which can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information. 
+## cui2vec Overview 
+
+Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles that can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information. 
 
 
 In this vignette, we'll walk through the core steps of `cui2vec`. Start by loading the package: 
@@ -25,7 +26,7 @@ In this vignette, we'll walk through the core steps of `cui2vec`. Start by loadi
 library(cui2vec)
 ```
 
-For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combination of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you've have a TCM and singleton count for your corpus of interest. 
+For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combinations of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you have a TCM and singleton count for your corpus of interest. 
 ```{r, message=FALSE}
 # denominator in PMI calculation 
 N <- 261397 
@@ -35,13 +36,13 @@ load('singleton_counts.rda')
 ```
 
 
-The first step in the `cui2vec` algorithm is construct the Pointwise Mutual Information (PMI) matrix: 
+The first step in the `cui2vec` algorithm is to construct the Pointwise Mutual Information (PMI) matrix: 
 ```{r, message=FALSE}
 pmi <- construct_pmi(term_cooccurrence_matrix,singleton_counts,N)
 pmi[1:5, 1:3]
 ```
 
-Then you need to construct the Shift Positive Pointwise Mutual Information (SPPMI) matrix: 
+Next, you need to construct the Shifted Positive Pointwise Mutual Information (SPPMI) matrix: 
 ```{r, message=FALSE}
 sppmi <- construct_sppmi(pmi)
 sppmi[1:5, 1:5]
@@ -49,11 +50,11 @@ sppmi[1:5, 1:5]
 
 Finally, you can fit `cui2vec` embeddings using `construct_word2vec_embedding`. We'll keep this example small and only work with 20 dimensional embeddings. 
 ```{r, message=FALSE}
-w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20)
+w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20, iters=50)
 w2v_embedding[1:5, 1:5]
 ```
 
-We can also do `word2vec` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings. 
+We can also do `PCA` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings. 
 ```{r, message=FALSE}
 pca_embedding <- construct_pca_embedding(term_cooccurrence_matrix, dim_size = 20)
 pca_embedding[1:5, 1:5]
@@ -70,7 +71,7 @@ To run the benchmarks in our paper, we need some additional information about th
 ```{r}
 print(check_embedding_semantic_columns(w2v_embedding))
 ```
-As expected, this fails, since we just created the embeddings. We have a helper function to add this function to an embedding. 
+As expected, this fails, since we just created the embeddings. We have a helper function to add this information to an embedding. 
 
 ```{r, message=FALSE, results='hide'}
 glove_embedding <- bind_semantic_types(glove_embedding)
@@ -81,13 +82,13 @@ Let's check that it worked:
 w2v_embedding[1:5, 1:5]
 ```
 
-We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts.  We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and use this similarity to assess whether or not the two concepts are related. There are five benchmarks: 
+We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts.  We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and we use this similarity to assess whether or not the two concepts are related. There are five benchmarks: 
 
 * **Comorbid Conditions**: A comorbidity is a disease or condition that frequently accompanies a primary diagnosis. 
 * **Causative Relationships**: The UMLS contains a table (MRREL) of entities known to be the cause of a certain result. 
 * **National Drug File Reference Terminology (NDF-RT)**: We assess power to detect "may treat" and "may prevent" relationships using bootstrap scores of random drug-disease pairs.
 * **UMLS Semantic Type**:  Semantic types are meta-information about which category a concept belongs to, and these categories are arranged in a hierarchy.
-* **Human Assessment of Concept Similarity**: We report the spearman correlation between the human assessment scores and cosine similarity from the embeddings.
+* **Human Assessment of Concept Similarity**: We report the Spearman correlation between the human assessment scores and cosine similarity from the embeddings.
 
 ```{r, message = FALSE, eval=FALSE}
 # No CUIs in our tiny embeding that overlap with comorbidity CUIs, so don't evaluate
diff --git a/vignettes/cui2vec.bib b/vignettes/cui2vec.bib
index da20112..f1e244d 100755
--- a/vignettes/cui2vec.bib
+++ b/vignettes/cui2vec.bib
@@ -1,13 +1,15 @@
-% Generated by Paperpile. Check out http://paperpile.com for more information.
-% BibTeX export options can be customized via Settings -> BibTeX.
+%% This BibTeX bibliography file was created using BibDesk.
+%% http://bibdesk.sourceforge.net/
 
-@ARTICLE{Beam2018-vl,
-  title         = "Clinical Concept Embeddings Learned from Massive Sources of
-                   Multimodal Medical Data",
-  author        = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and
-                   Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane,
-                   Isaac S",
-  abstract      = "Word embeddings are a popular approach to unsupervised
+%% Created for Allen Schmaltz at 2019-07-31 20:15:58 -0400 
+
+
+%% Saved with string encoding Unicode (UTF-8) 
+
+
+
+@article{Beam2018-vl,
+	Abstract = {Word embeddings are a popular approach to unsupervised
                    learning of word relationships that are widely used in
                    natural language processing. In this article, we present a
                    new set of embeddings for medical concepts learned using an
@@ -25,11 +27,12 @@ @ARTICLE{Beam2018-vl
                    previous methods in most instances. Finally, we provide a
                    downloadable set of pre-trained embeddings for other
                    researchers to use, as well as an online tool for
-                   interactive exploration of the cui2vec embeddings.",
-  month         =  apr,
-  year          =  2018,
-  keywords      = "cui2vec",
-  archivePrefix = "arXiv",
-  primaryClass  = "cs.CL",
-  eprint        = "1804.01486"
-}
+                   interactive exploration of the cui2vec embeddings.},
+	Archiveprefix = {arXiv},
+	Author = {Kompa, Benjamin and Schmaltz, Allen and Fried, Inbar and Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane, Isaac S, and Beam, Andrew L},
+	Date-Modified = {2019-08-01 00:15:41 +0000},
+	Eprint = {1804.01486},
+	Keywords = {cui2vec},
+	Primaryclass = {cs.CL},
+	Title = {Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data},
+	Year = 2019}
diff --git a/vignettes/prev_cui2vec.bib b/vignettes/prev_cui2vec.bib
new file mode 100755
index 0000000..da20112
--- /dev/null
+++ b/vignettes/prev_cui2vec.bib
@@ -0,0 +1,35 @@
+% Generated by Paperpile. Check out http://paperpile.com for more information.
+% BibTeX export options can be customized via Settings -> BibTeX.
+
+@ARTICLE{Beam2018-vl,
+  title         = "Clinical Concept Embeddings Learned from Massive Sources of
+                   Multimodal Medical Data",
+  author        = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and
+                   Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane,
+                   Isaac S",
+  abstract      = "Word embeddings are a popular approach to unsupervised
+                   learning of word relationships that are widely used in
+                   natural language processing. In this article, we present a
+                   new set of embeddings for medical concepts learned using an
+                   extremely large collection of multimodal medical data.
+                   Leaning on recent theoretical insights, we demonstrate how
+                   an insurance claims database of 60 million members, a
+                   collection of 20 million clinical notes, and 1.7 million
+                   full text biomedical journal articles can be combined to
+                   embed concepts into a common space, resulting in the largest
+                   ever set of embeddings for 108,477 medical concepts. To
+                   evaluate our approach, we present a new benchmark
+                   methodology based on statistical power specifically designed
+                   to test embeddings of medical concepts. Our approach, called
+                   cui2vec, attains state of the art performance relative to
+                   previous methods in most instances. Finally, we provide a
+                   downloadable set of pre-trained embeddings for other
+                   researchers to use, as well as an online tool for
+                   interactive exploration of the cui2vec embeddings.",
+  month         =  apr,
+  year          =  2018,
+  keywords      = "cui2vec",
+  archivePrefix = "arXiv",
+  primaryClass  = "cs.CL",
+  eprint        = "1804.01486"
+}