-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcleanExtractions.Rmd
94 lines (65 loc) · 2.74 KB
/
cleanExtractions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
title: Cleanup Score Extractions
output: html_document
---
```{r}
library(XLConnect)
library(stringr)
```
Connect to local MySQL server and read scores from tracking file.
Read the Excel file (from dropbox copy)
```{r}
library(RMySQL)
con = dbConnect(MySQL(), dbname='appy')
dbListTables(con)
dbWriteTable(con,"trackingFile",mergedPubmedTracking,overwrite=T)
# a function to simplify queries
query <- function(...) dbGetQuery(con, ...)
# all with any score
pmid_scores_tracking <- query("SELECt PMID FROM trackingfile WHERE include_ = 'yes' AND test_performance = 'yes'
AND test_type LIKE '%score%'")
pmid_score_tracking <- pmid_scores_tracking$PMID
pmid_score_tracking
```
Read the scores_extraction worksheet into R (from Dropbox copy)
```{r}
## NB list sheets: http://stackoverflow.com/questions/12945687/how-to-read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frame-e
# wb <- loadWorkbook("D:/Dropbox/BrownCEBM/Appendicitis/scores_extraction_6-23-2014.xlsx")
wb <- loadWorkbook("C:/Users/dsteele/Desktop/scores_extraction_6-26-2014.xlsx")
wb1 <- loadWorkbook("C:/Users/dsteele/Desktop/score_derivation_6-26-2014.xlsx")
scoredat <- readWorksheet(wb, sheet=getSheets(wb)[1])
scores <- readWorksheet(wb1, sheet=getSheets(wb1)[1])
pmid_score_extracts <- scoredat$study_pmid
pmid_score_extracted_papers <- as.numeric(unique(scoredat$study_pmid))
names(scoredat)
names(scores)
grouping <- unique(scoredat$test_grouping_for_analysis)
derivation <- unique(scores$score_name)
derivation
setdiff(grouping,derivation)
setdiff(derivation,grouping)
```
How many unique papers are there which have extracted scores?
Which papers have been extracted, but are not indicated as such in tracking file?
```{r}
setdiff(pmid_score_extracted_papers, pmid_score_tracking)
setdiff(pmid_score_tracking, pmid_score_extracted_papers)
```
Which papers containing 'score' have include_ = 'no?' OR test_performance='no'
```{r}
(pmid_scores_out <- query("SELECt PMID FROM trackingfile WHERE include_ = 'no?' AND test_type LIKE '%score%'"))
(pmid_scores_noTP <- query("SELECt PMID FROM trackingfile WHERE test_performance = 'no?'
AND test_type LIKE '%score%'"))
```
Use regular expressions to clean, harmonize and create dummy variables for analysis.
```{r}
wb <- loadWorkbook("D:/Dropbox/BrownCEBM/Appendicitis/scores_extraction_6-23-2014.xlsx")
scoredat <- readWorksheet(wb, sheet=getSheets(wb)[1])
names(scoredat)
library(stringr)
cutPoint <- str_trim(str_replace(str_replace(scoredat$definition_of_a_positive_test, ">=", ""),
"score", ""), side="both")
scoreName <- str_trim(str_replace(scoredat$test, "score|scoring system", ""), side="both")
scoreName
table(scoredat$Population)
```