Skip to content

Commit

Permalink
documentation updates
Browse files Browse the repository at this point in the history
  • Loading branch information
allenschmaltz committed Aug 1, 2019
1 parent 87c060f commit 365b41c
Show file tree
Hide file tree
Showing 7 changed files with 100 additions and 30 deletions.
Binary file modified .DS_Store
Binary file not shown.
21 changes: 19 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
doc
Meta

# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# Example code in package build process
*-Ex.R

# Output files from R CMD check
/*.Rcheck/

# RStudio files
.Rproj.user/

# Mac OS
.DS_Store
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,15 @@
# cui2vec
# cui2vec

This repo contains the code associated with the following paper (under review):

> Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X., Cai, T., Kohane, I.S., and Beam, A.L., 2019. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486.
# Overview

This repo contains the R pacakge `cui2vec`, which provides code for fitting embeddings to your own co-occurrence data in the manner presented in the above paper. The package can be installed locally from source. An overview of usage is provided in the following HTML vignette, which can be viewed in your browser:

[vignettes/rendered/2019_07_31/cui2vecWorkflow.html](vignettes/rendered/2019_07_31/cui2vecWorkflow.html).

Additional information on each of the public functions can be accessed in the standard way (e.g., ```?cui2vec::construct_word2vec_embedding```).

Data agreements prevent us from releasing all of our original source data, but upon acceptance, we will release our embeddings at the following URL: TBD.
Binary file modified vignettes/.DS_Store
Binary file not shown.
21 changes: 11 additions & 10 deletions vignettes/cui2vec.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,17 @@ knitr::opts_chunk$set(
comment = "#>"
)
```
##cui2vec Overview
Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes a insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles which can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information.
## cui2vec Overview

Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. `cui2vec` was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles that can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See [our preprint](https://arxiv.org/abs/1804.01486) [@Beam2018-vl] for more information.


In this vignette, we'll walk through the core steps of `cui2vec`. Start by loading the package:
```{r setup}
library(cui2vec)
```

For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combination of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you've have a TCM and singleton count for your corpus of interest.
For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. `term_cooccurrence_matrix.RData` contains a term co-occurrence matrix (TCM) for all pairwise combinations of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. `singleton_counts.RData` contains the raw count of each term in the vocabulary. Both are needed for `cui2vec` to work. For now, we'll assume you have a TCM and singleton count for your corpus of interest.
```{r, message=FALSE}
# denominator in PMI calculation
N <- 261397
Expand All @@ -35,25 +36,25 @@ load('singleton_counts.rda')
```


The first step in the `cui2vec` algorithm is construct the Pointwise Mutual Information (PMI) matrix:
The first step in the `cui2vec` algorithm is to construct the Pointwise Mutual Information (PMI) matrix:
```{r, message=FALSE}
pmi <- construct_pmi(term_cooccurrence_matrix,singleton_counts,N)
pmi[1:5, 1:3]
```

Then you need to construct the Shift Positive Pointwise Mutual Information (SPPMI) matrix:
Next, you need to construct the Shifted Positive Pointwise Mutual Information (SPPMI) matrix:
```{r, message=FALSE}
sppmi <- construct_sppmi(pmi)
sppmi[1:5, 1:5]
```

Finally, you can fit `cui2vec` embeddings using `construct_word2vec_embedding`. We'll keep this example small and only work with 20 dimensional embeddings.
```{r, message=FALSE}
w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20)
w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20, iters=50)
w2v_embedding[1:5, 1:5]
```

We can also do `word2vec` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings.
We can also do `PCA` on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings.
```{r, message=FALSE}
pca_embedding <- construct_pca_embedding(term_cooccurrence_matrix, dim_size = 20)
pca_embedding[1:5, 1:5]
Expand All @@ -70,7 +71,7 @@ To run the benchmarks in our paper, we need some additional information about th
```{r}
print(check_embedding_semantic_columns(w2v_embedding))
```
As expected, this fails, since we just created the embeddings. We have a helper function to add this function to an embedding.
As expected, this fails, since we just created the embeddings. We have a helper function to add this information to an embedding.

```{r, message=FALSE, results='hide'}
glove_embedding <- bind_semantic_types(glove_embedding)
Expand All @@ -81,13 +82,13 @@ Let's check that it worked:
w2v_embedding[1:5, 1:5]
```

We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and use this similarity to assess whether or not the two concepts are related. There are five benchmarks:
We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and we use this similarity to assess whether or not the two concepts are related. There are five benchmarks:

* **Comorbid Conditions**: A comorbidity is a disease or condition that frequently accompanies a primary diagnosis.
* **Causative Relationships**: The UMLS contains a table (MRREL) of entities known to be the cause of a certain result.
* **National Drug File Reference Terminology (NDF-RT)**: We assess power to detect "may treat" and "may prevent" relationships using bootstrap scores of random drug-disease pairs.
* **UMLS Semantic Type**: Semantic types are meta-information about which category a concept belongs to, and these categories are arranged in a hierarchy.
* **Human Assessment of Concept Similarity**: We report the spearman correlation between the human assessment scores and cosine similarity from the embeddings.
* **Human Assessment of Concept Similarity**: We report the Spearman correlation between the human assessment scores and cosine similarity from the embeddings.

```{r, message = FALSE, eval=FALSE}
# No CUIs in our tiny embeding that overlap with comorbidity CUIs, so don't evaluate
Expand Down
37 changes: 20 additions & 17 deletions vignettes/cui2vec.bib
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
% Generated by Paperpile. Check out http://paperpile.com for more information.
% BibTeX export options can be customized via Settings -> BibTeX.
%% This BibTeX bibliography file was created using BibDesk.
%% http://bibdesk.sourceforge.net/
@ARTICLE{Beam2018-vl,
title = "Clinical Concept Embeddings Learned from Massive Sources of
Multimodal Medical Data",
author = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and
Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane,
Isaac S",
abstract = "Word embeddings are a popular approach to unsupervised
%% Created for Allen Schmaltz at 2019-07-31 20:15:58 -0400
%% Saved with string encoding Unicode (UTF-8)
@article{Beam2018-vl,
Abstract = {Word embeddings are a popular approach to unsupervised
learning of word relationships that are widely used in
natural language processing. In this article, we present a
new set of embeddings for medical concepts learned using an
Expand All @@ -25,11 +27,12 @@ @ARTICLE{Beam2018-vl
previous methods in most instances. Finally, we provide a
downloadable set of pre-trained embeddings for other
researchers to use, as well as an online tool for
interactive exploration of the cui2vec embeddings.",
month = apr,
year = 2018,
keywords = "cui2vec",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
eprint = "1804.01486"
}
interactive exploration of the cui2vec embeddings.},
Archiveprefix = {arXiv},
Author = {Kompa, Benjamin and Schmaltz, Allen and Fried, Inbar and Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane, Isaac S, and Beam, Andrew L},
Date-Modified = {2019-08-01 00:15:41 +0000},
Eprint = {1804.01486},
Keywords = {cui2vec},
Primaryclass = {cs.CL},
Title = {Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data},
Year = 2019}
35 changes: 35 additions & 0 deletions vignettes/prev_cui2vec.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
% Generated by Paperpile. Check out http://paperpile.com for more information.
% BibTeX export options can be customized via Settings -> BibTeX.
@ARTICLE{Beam2018-vl,
title = "Clinical Concept Embeddings Learned from Massive Sources of
Multimodal Medical Data",
author = "Beam, Andrew L and Kompa, Benjamin and Fried, Inbar and
Palmer, Nathan P and Shi, Xu and Cai, Tianxi and Kohane,
Isaac S",
abstract = "Word embeddings are a popular approach to unsupervised
learning of word relationships that are widely used in
natural language processing. In this article, we present a
new set of embeddings for medical concepts learned using an
extremely large collection of multimodal medical data.
Leaning on recent theoretical insights, we demonstrate how
an insurance claims database of 60 million members, a
collection of 20 million clinical notes, and 1.7 million
full text biomedical journal articles can be combined to
embed concepts into a common space, resulting in the largest
ever set of embeddings for 108,477 medical concepts. To
evaluate our approach, we present a new benchmark
methodology based on statistical power specifically designed
to test embeddings of medical concepts. Our approach, called
cui2vec, attains state of the art performance relative to
previous methods in most instances. Finally, we provide a
downloadable set of pre-trained embeddings for other
researchers to use, as well as an online tool for
interactive exploration of the cui2vec embeddings.",
month = apr,
year = 2018,
keywords = "cui2vec",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
eprint = "1804.01486"
}

0 comments on commit 365b41c

Please sign in to comment.